Needle in a Haystack: Using LLMs to Search for Answers in Medical Records

Part 1 of a 3 part series

By Maaike Desmedt, Lauren Pendo, Nikhita Luthra, and Sauhard Sahi

We are constantly thinking about how to make the workflows of clinicians less tedious, so they can do what they do best: serve patients and provide care. In a previous post on Messaging Encounter Documentation, we reviewed how our clinicians leverage AI to speed up documentation of their secure messaging visits with patients. In this next series of posts, we will share our learnings from a different application with the same goal: improving clinician workflows to deliver faster, better care for our members. We’ll delve specifically into how we can leverage LLMs to answer questions about the medical record.

Background

There are many workflows where clinicians must answer questions using a patient’s medical record. But these records are often dense with many unstructured sections which makes querying them cumbersome. Enter LLMs. They have the ability to both create structure out of unstructured natural language, and extract information from the medical record that can be used to answer targeted clinical questions.

One application where clinicians currently spend a lot of time digging through medical records is Prior Authorization (PA). When your doctor determines that you need a particular medical procedure, like a hip surgery, they submit a request to Oscar to get prior approval to perform the procedure. These PA requests are evaluated by clinical staff at Oscar for medical necessity, which usually takes the form of a set of clinical criteria derived from medical literature. A subset of these PA requests get automatically approved without human review by in-house technology, and we do not automatically deny requests. This saves everyone in the process time, getting the member the care they need as quickly as possible. The remainder require meticulous review from our clinicians. We wanted to understand whether GPT could accelerate our efforts to more quickly approve PA requests. We took a highly iterative approach to this problem, ultimately testing out 10+ different system design strategies.

For each experiment, we computed the accuracy, precision, and recall against historical requests and dove deeper into examples to understand where the responses were working and not. We then tweaked or changed the design for the next experiment—the complete set of experiments is represented in the table below (experiments noted with V2, V3, etc. contain small tweaks to the prompt based on our evaluation).

We focused on three main evaluation metrics 

Accuracy

  • In this case we want GPT to recommend approval for auths that should have been approved, and correctly flag auths where further review is needed.

Precision

  • What percentage of auths that were recommended for approval by GPT were actually approved.

Recall 

  • What percentage of all auth approvals was GPT able to capture.

Precision and recall are traded off with each other: 

  • If we approve everything, then recall is at 100%: we are able to catch 100% of all ‘true’ approvals. However, if precision is low: not all of the approvals are ‘true’ approvals - we are letting some requests that should have been raised for further review out the door, and the leakage is very high.

  • If we approve nothing, then recall is 0: we aren’t catching any of the ‘true’ approvals. However, precision is high: we aren’t letting anything out the door that shouldn’t be and there is no benefit for an AI augmentation system.

In general, we are more comfortable with the AI being conservative with approvals, as we want the AI to help augment the process and help with review. This is because the AI can never deny a claim –– so if the AI is conservative, the worst thing that will happen is that a human will have to review it and make a choice. IE - the status quo. In other words, we care more about high precision than high recall. 

That being said, if the AI is too conservative (i.e. the model never recommended approvals for anything), then it won’t actually streamline the process. 

You can see that in the end our best experiments were able to increase our precision to the 83-94% range. The results show that when our evaluation via GPT indicated a positive outcome, it matched the human assessment the vast majority of the time. But when we started, the results were much more humble.

Medical necessity is established using a collection of clinical criteria. One of the first strategies we tried (“Ask Questions 10 at a Time”) consisted of a mega prompt that attempted to evaluate all criteria in a single task. Not surprisingly, the model struggled with this.

More surprising is that a subsequent test (asking 1-3 questions at a time with function calling and citations – in the table it’s, “Function Calling Ask One by One with Citations”) actually did much worse. Without the context of other questions, it became confused on certain questions that it then evaluated to True, which led to false positives. To give a concrete example, GPT struggled with answering the following question:

  • Does the patient have a post-traumatic injury (e.g., fracture, infection) causing debilitating hip joint destruction affecting movement, causing pain or stiffness?

Instead of interpreting whether or not the member had a fracture or infection using information in the medical record, it gave a roundabout answer as to why a hip replacement was recommended (listing the conservative treatments that had been tried). It did not directly address whether a post-traumatic injury such as a fracture or infection had occurred. You can see a comparison of the two different cases below.

Comparison of Responses Asking Questions One at a Time vs. All at Once:

In using AI to improve clinician workflows, and streamline member care, it’s important to note that denials will always be handled by a human, not AI. We have not yet deployed this prototype and are still in the testing phase. In the next post we’ll delve deeper into the series of experiments and show how we iterated from early failures to much more promising results.

Previous
Previous

Needle in a Haystack, Part 2: Testing GPT’s Ability to Read Clinical Guidelines

Next
Next

Related Condition Search