What is data?
Defining what “Data” is is a hard task, as the concept is incredibly broad. Some of the most common definitions that people give when asked are:
- Everything that can be represented in bits.
- But what about a fossilized bone? Or a cave painting? Is a manufact not data? If not, what are museums for?
- Anything that can be used to extract Knowledge or insight.
- Is pure Logic-based derived thoughts data, then?
- Anything that a researcher uses to conduct its work.
- What work does a researcher do?
- What about non-human researchers? Do they count?
The EOSC Glossary used to define data as “reinterpretable digital representation of information in a formalized manner suitable for communication, interpretation or processing”, which is also very bits-based (i.e. “digital representation”). As of today, 2024-08-03, the glossary no longer has any definition of “Data”.
Sabina Leonelli, in her work “Scientific Research and Big Data”, finds two possible definitions for what “Data” is. First, the rapresentational definition for data: data is a representation, a “footprint” of the real world, captured in some form like a number, a sentence, etc…
In this view, data “captures” real features of the world and provides them with tangibility and the capability to be interpreted in order to generate Knowledge. Data objects are therefore immutable, and have well-defined, fixed, unchangeable context: that of the moment in which the data was recorded. In this sense, data is objective, and this very characteristics endows the resulting knowledge with epistemic value.
Data is also, in this view, the basis for Inductive Logic, and gives it its meaningfulness.
If data is an objective representation of reality, then it must be the case that there are correct and incorrect ways of interpreting it, just as there are correct and incorrect logical processes. However, the demarcation between what is a “correct” and what is an “incorrect” interpretation of data is unclear. Leonelli finds that the context in which the data is gathered and interpreted is essential to the very definition of what data is:
“[…] despite their epistemic value as ‘given’, data are clearly made. They are the results of complex processes of interaction between researchers and the world, which typically happen with the help of interfaces such as observational techniques, registration and measurement devices, and the re-scaling and manipulation of objects of inquiry for the purposes of making them amenable to investigation.”
- Sabina Leonelli, “What Counts as Scientific Data? A Relational Framework”, Philosophy of Science, 2016 https://doi.org/10.1086/684083
This led her to formulate a new definition for data, the so-called “relational view”. This relational view arises from the statement quoted before: data is not given - it is made. The process of creating data is laden with human preconceptions and potential biases, as well as technical or practical constraints: it is “theory-laden”. How can such a “theory-laden” object be found to be an objective representation of reality?
“How can data, understood as an intrinsically local, situated, idiosyncratic, theory-laden product of specific research conditions, serve as confirmation for universal truths about nature?”
- Sabina Leonelli, “What Counts as Scientific Data? A Relational Framework”
The relational view defines Data as any product of research activities that can be used as evidence for scientific claims. In other words, the act of using something as evidence for a claim makes that something data.
This definition shifts the focus from the assumption that data is objective to the problem of supporting the usage of this or that piece of data to support any given theory. This view of data more clearly allows the consideration of the need to scrutinize and explain the reason by which an object is considered to be data (evidence) for a specific theory or claim: how is this object used? how was this object obtained? who is using it to make the claim? what form does this object have, and how does it influence its interpretations? what manipulations has it undergone and how do these influence its fitness-of-purpose?
Leonelli also highlights data portability as an essential aspect of being data. Replicability is a core requirement for scientific statements, and similarly essential is the need of sharing one’s evidence with a group of peers in order to support one’s statements Data that is not portable, just as something that only one person has ever observed, has no meaning and therefore is not data.
This also means that the medium in which data is shared may affect its epistemic meaning:
“the physical characteristics of the medium significantly affect the ways in which data can be disseminated, and thus their usability as evidence. In other words, when data change medium, their scientific significance may also shift.”
- Sabina Leonelli, “Scientific Research and Big Data”, 2020, https://plato.stanford.edu/archives/sum2020/entries/science-big-data/
If this is the case, then Attribution of data creation might be ephemeral: if a person so profoundly edits the format that a dataset is in, or even combines it with others, is the new data completely new? Is authorship linked to the data, or its epistemic meaning?
An example: a scientist provides data on the migration of birds in different continents. The scientist defines five different continents: Europe, Asia, America, Africa and Oceania. To conform with the requirements of a deposition database, they are forced to change their definition of “continent” to more regions: East and West Europe, North and South America, etc… The data has not changed between the two formats, but arguably its epistemological meaning has.
Sources and Additional Reading Material
- Sabina Leonelli, “Scientific Research and Big Data”, Stanford Encyclopedia of Philosophi, 2020, https://plato.stanford.edu/archives/sum2020/entries/science-big-data/
- Sabina Leonelli, “What Counts as Scientific Data? A Relational Framework”, Philosophy of Science, 2016 https://doi.org/10.1086/684083