What isn't data?

Matt Sinderbrand, MPH • April 19, 2018 • 6 minute read

I was not a well-behaved child. By age 6 I already knew everything while my parents and teachers obviously didn't, so I operated under a "you can't tell me nothing" policy that infuriated any adult who dared to cross my path. By age 12 that petulant tone had run its course, and I realized that I in fact knew almost nothing about anything. This cycle from all is known to all is unknown seems to repeat itself as I gain more experience and process more data through my learning computer (my brain).

At birth we immediately begin processing data through our five sensory inputs in an attempt to make sense of this crazy thing called existence. As children we pursue knowledge rabidly and seek to learn as much as possible, continuously asking "but why?" and refusing to accept answers like "because that's how it is". Like me, all of you experienced that moment when the small world of your youth suddenly expanded to a massive and scary place, forcing you to confront all the unknowns and seek information to answer life's greatest questions.

But at some point this thirst for more information goes away, and we start to believe that there is a finality of knowledge - that we have used all available information to arrive at an answer that cannot be challenged. The truth is that knowledge is a function of inputs and outputs: if we don't have ALL the inputs, the output is always subject to change. Essentially, the power of knowledge is only as strong as the information used to support it. But to define the strength of information, we must understand how it is fundamentally rooted in data.

So what is data? I honestly don't know!

Instead I'll ask a different, but similar question: “What isn’t data?”

A brief history of data

The first evidence of data occurred when our ancestors began forming sounds into distinct words. Certain sounds communicated different meanings, and eventually, named concepts began to form the foundation for local languages.

The first evidence of data storage can be seen in the cave drawings of a hominid species known as Neanderthals dating as far back as 65,000 years.

Fast forward some 60,000 years and we arrive at the first evidence of data transmission - the Sumerian language called cuneiform. The written word became the foundation from which ancient humans could contextualize the world around them and transmit information on tablets and parchment over long distances (necessitated by trade). Written records began to develop around 3500 BCE, with culture, law and civilization following soon thereafter.

From this point forward, named concepts - or data conveyed through natural language - became the dominant data transmission and storage method, and still is to this day.

Down the rabbit hole…

Throughout the next 5000 years, humans observed the world around them and recorded their findings, and over time started to classify these observations in an attempt to make sense of the complex interrelationships between the spiritual, biologic and physiologic realms. For the purposes of this review, I'd like to take a look at a timeline of our attempts to "make sense" of biology, specifically the human body and its medicaments. As we've already identified the first examples of data storage and transmission, now we get to the fun part: data classification!

~1500 BCE : Ancient Ayurvedic and Chinese medicine practitioners began classifying biological ailments in texts

~350 BCE : Aristotle creates first version of modern taxonomy (kingdom, phylum, class, etc.)

[2000 years of religious wars, Dark Ages, etc. - not a great time for science]

1763: François Boissier de Sauvages de Lacroix publishes, "Nosologia Methodica" (nosology = study of classification of disease)

1893: Jacques Bertillon introduces “Bertillon Classification of Causes of Death”

1900: First "Controlled Vocabulary" - International Classifications of Cause of Death (ICD, later changed to International Classification of Disease) adopted in Europe - 161 classifications

1975: ICD-9 - 17,000 disease classifications

1999: LOINC introduced for lab results classification

2002: RxNorm introduced for drug code standardization

2003: SNOMED-CT introduced for clinical terminology

2010: ICD-10 - 90,000 disease classifications

NOW: 22 "Controlled Vocabularies" (e.g. ICD, LOINC, SNOMED, etc) representing over 1,000,000 biomedical concepts

Starting in the late 19th century, we began classifying diseases as named concepts in an attempt to more efficiently communicate (i.e. transmit data) about "what happened". As our ability to measure biological processes and ailments improved, systems of classifications became more complex, text strings were replaced by codes, and the named concepts from which these classifications originated became obsolete.

Now, with more than 1M different codes used to identify these named concepts , we've encountered a much larger problem in our attempt to efficiently communicate "what happened". Remembering that 80% of these named concepts are still not structured or standardized, this essentially means that we've invented 1M codes to describe 20% of the relevant information needed to support sound knowledge. This is one aspect of what's known in academic circles as The Naming Problem .

Data in its natural state

The history lesson above points out that our attempts to simplify the way we communicate named concepts led to the creation of coding vocabularies that increase in complexity each time we get better at naming things. Unfortunately, we will NEVER stop improving our ability to identify, measure and NAME things. Additionally, organization of concepts by name fundamentally limits our ability to interpret their meaning. But after traveling this far down the wrong road, how can we possibly go back? How do we solve the Naming Problem and get back to data in its natural state?

Fortunately for us, an interdisciplinary group led by the National Libraries of Medicine (NLM) started working on this problem back in 1986 - their solution is updated regularly to this day, and is known as the Unified Medical Language System (UMLS®). UMLS organizes concepts by meaning, and focuses on what the relationship between two data points means, rather than what it's called by name. This represents a fundamental shift from classification by concept , to classification by context.


UMLS is a biomedical interoperability engine that maps key terminology, classification and coding standards to transform data into universally indexable information. UMLS and its associated Lexical Tools serve as the technological foundation for "making sense" of unstructured clinical text, evolving multiple lexicons in combination with controlled vocabularies to refine all sources of health data into a single "source of truth". This is an essential first step toward producing complete, population-level datasets: the missing foundation for actionable machine learning and artificial intelligence in medicine.

Quick recap…

Without the ability to capture and structure complete sets of health data, medical knowledge suffers. It's possible to use meta-ontologies like UMLS to capture and structure ALL forms of data, including the often-ignored and highly relevant clinical notes that represent the doctor's opinion.

Remember: Medicine is a PRACTICE, and medical knowledge should evolve continuously as we build better, more accurate sets of information based on complete data.