The Genesis of Data
Where does data come from and what does it mean?
There’s obviously a story to tell here about the construction of sophisticated storage methods, from old clay tablets in Mesopotamia to S3 buckets in AWS. That story is bound to include the history of writing techniques. A section of it would focus on cybernetics and information theory; perhaps it would cite Bernard Dionysius Geoghegan’s book Code. That might be because the 20th century was obsessed with language. That obsession remains with us today in a now-passé term “big data,” and the technology referred to as “Artificial Intelligence.”
That story is not what I’m here to tell. Instead I’m here to consider the origin of data in another sense. I want to focus on the connection with, or I should say functional analogy between, data and language. Geoghegan’s book explores how this connection became thinkable, and I want to say more about what that thought entails.
Code tells of how the anthropological theory of cultural codes combined with mathematical practices, biology, and psychological discourses to produce the field of cybernetics. It then talks about how various institutions introduced cybernetics into linguistics and back into anthropology to produce structuralism and what the French and francophiles like myself call the “human sciences.” That field is basically what Americans think of as the social sciences plus studies of literature, and is decidedly non-quantitative. French academics could run with this because their country was rapidly industrializing with help from programs like the Marshall Plan and then flourishing as a booming economy. In retrospect we might think of this as the maturation of a Grand Unifying Field: Communication studies.
What is structuralism, though, and why does it matter here? One thing that has stuck around from structuralism is the idea that things exist in relation with one another in a more or less coherent, abstract system. That system and how each thing relates with the others determines what that thing means. Meaning, therefore, can be discussed and quantified (here comes the mathematics) as differential relations. As the abstract relations change, the meaning of the actual things change.
If this seems confusing, don’t worry. You understand structuralism just fine. Look at this XKCD comic, Dependency. If you understand that the whole structure makes the indicated project significant, i.e. important or meaningful, to the system and that if it were reorganized then that project might not be so critical then you understand structuralism.
But I was talking about language and data. One of the criticisms that so-called “post-structuralist” thinkers made was to criticize the lack of history that an idea of a structuralist system entailed. Return to the XKCD comic. Where did that whole thing come from? And is the only option to change the indicated project’s meaning to rearrange the existing pieces? Why can’t we talk about where the whole thing came from and where it might be going? Structuralism resists this because there is supposed to be a steady relation of reference between, say, a linguistic sign and that thing in the state of affairs (‘the world’) that it signifies. Post-structuralists reject that assumption. They think that languages and other such systems can meaningfully, that is qualitatively, change. The question is how.
Gilles Deleuze, in his work The Logic of Sense, proposes that language has a very interesting feature: it can proliferate infinitely. It does this by way of its unending ability to talk about itself; one can always say something about what was said before because propositions can always refer to other propositions and thereby comment on them. That commentary is not only a new proposition, but it also changes the meaning of the prior sentence because it changes its position and relationships in the language system. That speech act makes a difference; it transforms things and means that their meaning is in flux. Since the meaning is in flux, it can’t act as a steady signifier of anything in the state of affairs and in fact constitutes a change in the state of affairs in and of itself. There is no dualism between language and the world.
I contend that this is how data operates too. It is always possible to collect metadata about some datum. Even if that datum is itself metadata about some other datum. For instance, I’m sure Squarespace collects data about this blog post once it’s published; and observability instrumentation may collect data about the operation of creating and storing the data about this post; that data may be sent to an observability tool for analysis, which will record data about that operation; and so on and so forth.
Further, data don’t represent anything but act in a way that is effective on those things which they supposedly represent. There are lots of good works out there on this point, so I won’t delve into it here. I really like Paul Cillier’s work on this in explaining Derrida’s thinking on deconstruction and how it relates to complexity in Complexity and Postmodernism. C. Thi Nguyen also published a piece recently which is quite good on this and has some additional references.
It’s this capacity to continually produce new metadata, which is ultimately just more data, that has led to our the explosion so often commented on in our times.
Finally, let me say that I find it interesting that this feature has led to fears in the AI discourse. Researchers are alarmed at the effects of using AI-generated data as training data for AI systems, which will of course produce more AI-generated data which may be subsequently used as training data ad infinitum. I take it the worry is that spiral and its implications. The fear of that spiral isn’t surprising since so many people naively take data to have a stable intentional referent, though I think it’s misguided because it’s not actually anything new (if we work from the understanding of cybernetic theory and its offshoots). But what are the implications here? That the AI systems will become even more opaque to their handlers? That they’ll do even more things we don’t understand and that they may even do things overwhelm their handlers? This seems like the same concerns people have about bio- and geoengineering. So maybe we should stop doing these things and resist the capitalist system which impels us to do them. Just a thought; say what you will about it.
PS, 26 Dec 2024: Since writing this post I’ve read an excellent article on Deleuze’s concept of sense by Daniel W. Smith, Professor of Philosophy at Purdue. Smith is widely considered one of the foremost scholar of Deleuze working today. He’s the author of the instant classic Essays on Deleuze, which I’ve cited in this blog before, and has mentored numerous philosophers. He’s a true gentleman and scholar.
His article has induced me to reconsider this post, or at least to more clearly see its shortcomings. First of all, in an early footnote Smith calls attention to Deleuze’s reservations about The Logic of Sense from later in his life. His work with Felix Guattari seems to have helped him overcome certain problems in his earlier thinking where he relied too much on structuralist principles; score one for people who give Guattari his due. I’ll set that aside, though, as it’s implications are something I don’t yet understand well.
A second point, though, leads me to want to change the title of this post since it doesn’t seem to properly account for the genesis of data. The post as it exists today (and I’m not changing it for the sake of honesty) simply describes how it proliferates (the “static genesis”) but doesn’t account for its origins very well. That latter would have required a more involved theory of how a datum is constructed (the “dynamic genesis”). Such an account would require considering how a change in intensity, say the transformation of an electrical current in a circuit, is registered and made into the integer of a metric or the string of a log file.
Ultimately I think the post is fine. But it’s got shortcomings, and I think it’s important to acknowledge that publicly. If a reader liked the post and wants to use it to help their thinking, perhaps consider reading Smith’s article and continue to stretch beyond what I wrote initially. As you can tell, I’m continuing to learn too.