Can LLMs Extract Meaning from Clinical Data? Part I | Glia

I’ve been testing Large Language Models for various healthcare use-cases, and a common theme among them is the extraction of meaning from clinical data. Extracting and representing meaning of clinical data consistently, also known as semantic normalization, is foundational to any subsequent use of the data. This is a foundational need that I’ve worked on in the industry for almost two decades, and I’m curious to find out if LLMs can revolutionize this.

I’ll start with an introduction to semantic normalization in Part I of this multi-part series. The articles will be in an accessible style and will not require a background in medicine or computer science. In part II, I’ll describe how semantic normalization was done traditionally before the invention of transformers. In Part III, I’ll talk about how transformers have improved semantic normalization. In parts IV and beyond, I’ll talk about what LLMs bring to the table and what the future holds.

In each part of this series, I’ll address both structured data (problem list, demographics, vital signs, lab orders/results, medical order/administration, list of diagnoses/procedures, etc.) as well as free-text notes (progress notes, procedure notes, discharge summary, etc. — also called unstructured data). A patient’s healthcare journey is a complex, lifelong story, and it is impossible to fit a story into little drop-down boxes on the screen. At the same time, the story includes some things that do fit into the boxes. Many past efforts, including electronic heath record user interfaces, do a poor job of integrating structured and unstructured data. Looking at just one or the other will provide an incomplete picture. I will describe how approaches to normalize both structured and unstructured data have evolved over the last 20 years.

Semantic normalization is a technical way to say ‘denote the meaning of data consistently’ — for both humans and machines to process. In other words, Semantic Normalization is the process of extracting meaning contained in clinical data (structured data or free-text notes) and representing it using a standard vocabulary.

Contrast this with Syntactic Normalization, the process of converting medical data into a common physical format. Adoption of standard data models and ontologies makes this process less painful, however they still require a lot of elbow grease.

Note that vocabulary, terminology, ontology, and nomenclature are similar but not exactly the same. For purposes of the current topic, they are used interchangeably. The distinctions will be called out where pertinent.

When it comes to structured data, semantic normalization involves mapping local codes or terms to standard terminology concepts. A concept is a shorthand for a unique meaning. An example that I first heard 20 years ago from my professor and mentor Dr. Lee Min Lau involves the term ‘cold’. Cold may denote common cold (e.g. ‘I have cold’), chronic obstructive lung disease (e.g. ‘I have COLD’ **), or the sensory perception of low temperature (e.g. ‘I feel cold’). The hyperlinks take you to three distinct concepts from SNOMED CT (the world’s largest medical ontology), which provides their human-readable definitions and machine-processable expressions. (** the uppercase acronym can be unreliable in typed or transcribed text. So, COPD — Pulmonary instead of Lung — is the preferred acronym).

Three meanings of the term ‘cold’, first heard from Lau LM, ca. 2002–2004. Image generated by DALL-E 3

Local terms and codes are standardized through a process of ‘mapping’ — linking them to standard terminology concepts that are equivalent (or based on the use-case, close enough) in meaning. The cold homonyms (words with same spelling/pronunciation but with different meanings) show that context is important for mapping to preserve the meaning. Often, context is not available in structured data that is aggregated and de-identified.

On the other hand, context is available in free-text notes (such as progress notes, procedure notes, or discharge summaries) — however, we need a Natural Language Processing (NLP)* engine capable of understanding the context, known in technical terms as co-reference detection and word sense disambiguation. (* To be precise, this is called Natural Language Understanding — NLU, which is a subset of NLP).

Finally, a named entity recognizer takes these words or phrases and links them to ontology concepts. Modules that perform various specialized tasks are strung together in an intuitively-named pipeline, and they read and write data to an object that looks like the layers of an onion — sometimes called a CAS (Common Analysis Structure) — with different layers containing paragraphs, sentences, nouns, verbs, pronouns, negations, and so on.

Specialized pipelines have been available for years to detect specialized pieces of information in a medical record such as smoking status, cardiac ejection fraction (in an echocardiography report), etc. Of late, general pipelines are able to detect just about any imaginable type of information in a medical record — diagnoses, procedures, medications, vital signs, lab results, allergies, outcomes, etc. These named entities are linked to standard terminology concepts so that we don’t mistake a patient with chronic obstructive lung disease for common cold. Finally, modern pipelines are able to output the extracted data using standard data models such as FHIR (pronounced ‘fire’, here’s a 6-minute intro).

The less-obvious question that one might ask is how did we get from specialized pipelines (for smoking status or ejection fraction) to general pipelines (everything in the medical record including the kitchen sink). The answer lies in transformers, the deep learning architecture that Google invented in 2017. Transformer-based NLPs have all but replaced classical NLP engines that used rules, statistics, or both. Two popular open-source NLP toolkits that support transformer models are SparkNLP and Spacy. They also have commercially-licensed version with more bells and whistles, but the open-source versions are more than sufficient to get started.

Now, the more-obvious question on everyone’s mind is how to LLMs compare? Large Language Models have stolen the limelight since OpenAI introduced ChatGPT 3.5 to the world last year.

So, one is bound to ask whether LLMs obviate NLP engines. Do they also obviate the need for terminology mapping for structured data? These are the questions that I’ll address through experiments on a few different LLMs. If you have any questions about this article or if you want me to address specific questions in future articles in this series, please let me know.