Needle in a Haystack, Part 2: Testing GPT’s Ability to Read Clinical Guidelines

May 28

Part 2 of a 3 part series

By Maaike Desmedt, Lauren Pendo, Nikhita Luthra, and Sauhard Sahi

Overview of the prototype

In our previous post we shared an overview of how Oscar’s AI R&D team uses LLMs to enhance clinician workflows, specifically in the Prior Authorization (PA) process. In this post, we’ll share how we tested GPT’s ability to read clinical guidelines, and some of the learnings we amassed through our experimentation.

Oscar develops and adopts clinical guidelines to support medical necessity decisions, and ensure our members receive high-quality, evidence-based care. One way we leverage these guidelines is by informing clinicians during the PA Request workflow. Oscar’s criteria for medical and pharmacy benefits are publicly available to members and providers, here. These guidelines outline the conditions under which a service will be covered, and can vary by state, line of business, or even plan type. The list of services that have a corresponding clinical guideline are shared with providers in advance. Clinical guidelines cover a specific set of procedures represented by Current Procedural Terminology (CPT) codes. For example, a hip replacement could refer to either CPT 27310 or CPT 27132.

Our prototype iterates through the requirements in Oscar Clinical Guideline (CG070). It translates each criteria to a clinical question, then asks GPT to answer whether the requirement is met based on the patient’s medical record, and if so, to cite the relevant evidence. Our use of LLMs in this application streamlines the clinician experience –– enabling a better understanding of unstructured natural language in the medical record.

Here is an overview of how our prototype works:

Below are the outcomes of 16 different experiments. The experiments ran on a sample of medical records to test GPT’s ability to correctly answer a set of clinical questions derived from the clinical guideline. After each iteration, we looked at what GPT was getting right and wrong, and updated our technique to address the errors. As part of our experimentation, we uncovered interesting insights about the way GPT works.

Prompt Engineering techniques that successfully improved results:

GPT is more likely to give correct answers (in line with human reviewer) when prompted to provide justification in its reasoning (as distinct from providing citations).
In one example, where we did not ask for a justification, GPT missed a reference in a chart to a required criteria for approval (note: as our focus is on accelerating approvals, this would not have any negative consequences for the member). This is classical chain-of-thought: LLMs have to resolve inputs into an answer with a limited number of intermediate steps in inner layers, and by asking for chain-of-thought justification, output tokens become a sort of “scratch pad” for the LLM that allows it to iterate longer to get to the right answer.
We break out a question into multiple smaller, simpler questions:
a. For example, we initially asked whether the following statement straight from the clinical guideline was true or false:
i. Therapeutic injections to the hip and not within three months before the surgery date (unless contraindicated. Examples include, but not limited to as bone-on-bone, osteolytic cysts, osteophytes, and sclerosis);
b. We saw success by separating this statement into two distinct statements after we noticed GPT was struggling with identifying when injections are contraindicated. Instead, we asked whether each statement was true or false:
i. “There has been a failure to decrease pain or improve function after at least a 3-month trial of therapeutic injections to the hip.”
ii. “Therapeutic injections are contraindicated, for example if there is evidence of bone-on-bone, osteolytic cysts, osteophytes, and sclerosis.”

This surely makes intuitive sense even to how humans think about these kinds of questions: reduce complexity by chopping them up into unrelated steps. But from an LLM perspective, we might also tie this to a discovery from interpretability research: papers like this one (“In-Context Learning Creates Task Vectors”) or this one (“Time is Encoded in the Weights of Finetuned Language Models”) show that LLMs learn to express concepts in the form of look-up vectors, and generate their answers that way (or at least some part of the answering process seems to work this way). For example, an LLM learns a “time of day” vector of neuron activations, that when multiplied with an actual input reliably transforms that input into an output token that reflects time of day. When you think about it that way, you want to ask questions that have exactly the right level of specificity: the probability that an LLM has learned a neuron activation that answers just one of these questions correctly ought to be higher than a more specific combination of issues.

3. When GPT is asked the same question twice with different wording (first a formal version, followed by a more simplified version), GPT will answer in a format more appropriate for the formal version.

a. Originally we asked if the member is a non-smoker, and it would sometimes answer incorrectly.

b. We came to the correct response by providing both statements: “The member is a non-smoker” and “The member is a smoker.”

c. Replacing negatives with positives: Asking, “Are you a smoker?” yielded better results than asking, “Are you a non-smoker?” We updated our smoker questions to be positive instead of negative.

In the literature, this relates to plenty of papers that have studied the positive impact of voting schemes in improving output quality. Different ways of asking (for humans) the same question seem to elicit different activations of the LLM’s learned material, and when you then run different results through a majority vote, you will get an overall better answer. Here is a paper with the most elaborate version of this (“Graph of Thoughts - Solving Elaborate Problems with Large Language Models”, February 2024), and here is a recent paper using this approach (“Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding”, February 2024). Careful, not providing any variation in external stimulus - e.g., different ways of asking the same question - will likely not improve quality, contrary to popular belief, as this paper showed (“Large Language Models Cannot Self-Correct Reasoning Yet”, March 2024).

5. Providing definitions as additional context improves GPT’s struggle with medical jargon.

a. GPT struggled with certain specific medical definitions, e.g. Tonnis Classification and Kellgren-Lawrence. At first, GPT was unable to answer the questions about these definitions unless the name was directly mentioned in the medical record.

b. However, when we provided GPT with the definition of a Grade III Tonnis, it was able to answer correctly more consistently, by detecting if a member had multiple osteophytes or a joint space narrowing. Here’s the definition we passed as additional context to GPT, also directly from the clinical guideline:

Grade III is defined as Moderate multiple osteophytes, definite joint space narrowing, some sclerosis, and possible deformity of bone contour and Grade IV is defined as Large osteophytes, marked joint space narrowing, severe sclerosis, and definite deformity of bone contour.

This phenomenon is consistent with Google’s recent paper on “Capabilities of Gemini Models in Medicine”: the paper showed that allowing Gemini 1.0 Ultra to do web searches when answering clinical questions took its accuracy from 87% to 91% (which interestingly is a bigger lift than the paper gets from fine-tuning the Ultra model on clinical Q&A). So, “creating mappings” between the task vectors that the model has learned in its pretraining and what you’re trying to have it get done works. (A web search is probably just an automated version of creating that kind of mapping at run-time, here we’re doing it manually at prompt time.)

6. Incorporating function calling to return a more structured and reliable output where we ask for the decision and reasoning for each piece of logic. We’re asking for a JSON response:

a. There are two ways we can try to get the output in a certain format:

1. Ask in the prompt.

2. In the API call, add the desired format as one of the parameters.

Only asking in the prompt often results in incorrect formats for the output. The difference is relying on GPT’s native JSON output feature vs trying to incorporate it into the prompt. Adding this step is important for ensuring standards and consistency in the response.

b. We can also leverage function calling to request reasoning, citations, and a boolean decision for each criteria. By parsing this information into a reliable format, we can pass it into python functions containing the clinical guideline’s logic.

In the next part of this series, we’ll share where GPT is still struggling.

Sarah Donna

Needle in a Haystack, Part 2: Testing GPT’s Ability to Read Clinical Guidelines

Harnessing OpenAI to Enhance the Healthcare Experience

Needle in a Haystack: Using LLMs to Search for Answers in Medical Records