Related Condition Search

By Bingjie Wang

It’s always nice to use “old but gold” techniques from the literature. LLMs are cool, but precision-recall curves are in every machine learning paper, so they should make an appearance on this blog too.

Semantic Search

In the first step of our ‘find care’ pipeline, Oscar has an omni-search bar that’s able to distinguish between reasons for visit, providers, facilities, and drugs. The bar uses lexicographic (i.e., close in dictionary order) techniques to find autocomplete matches from each group type and a rules-based model decides which results from each group should be surfaced. These rules help us make sure the results are relevant. For example, typing “cli” only brings up types of clinics but not doctors named “Clive” until the user types “cliv” to the search. The judgment being made here is trading recall for precision: Oscar wants to serve users interested in doctors named Clive (recall), but also Oscar doesn’t want to overload the user with results (precision).

The lexicographic matcher uses standard lemmatization tricks to correct “opthalmology” to “ophthalmology” and gets us quite far, but better is to also use semantic similarity (i.e., close in meaning) for the initial search. For example, a “reason for visit” query for “sneeze” should return “cough.” This is done by mapping each word in Oscar’s medical database to a vector using OpenAI’s embeddings (though the dated GloVe or word2vec models would work fine). When a user query comes in, it does the same and finds the closest database vectors to the query vector.

Precision and Recall

The same precision and recall tradeoff happens with semantic search. What do we mean by “the closest vectors”? Two vectors are “close” if the cosine distance between them is smaller than a threshold. 

If the threshold is too low, irrelevant results are surfaced (losing precision), but if it’s too high, then relevant results are missed (losing recall). To determine the threshold, the search team compiled a list of most common query words that currently don’t return any results, such as “vegetarian,” “swollen,” or “tiredness.” To get the whole team involved, we divided the work and found the words in the database that we thought were relevant. For example, “tiredness” should return “fatigue,” “sleep disorders,” “depression” and “insomnia.” Using this data, we replayed these queries and computed how precision and recall change with respect to the threshold.

A small detail: precision is the number of relevant results divided by the number of results. This doesn’t work when there are zero results, so we treat this as having a precision of zero. As a result, both precision and recall go up until zero result searches are eliminated.

Theoretically, we can pick any point along this curve, but the standard in machine learning papers is to use the F1 score, which is an even balance between precision and recall. For this curve, the F1 score is maximized when the threshold is 0.44. With the threshold chosen, let’s QA the results!

Results

First, the good. A search for “dermatologist” clearly brings up things that are related to skin:

Now the bad. For “urgent,” we want to see results that might prompt a member to go to an urgent care or emergency room, such as “chest pain.” However, the general purpose embeddings model doesn’t have the context of being used in a healthcare search engine. Some future work in fine tuning could solve this problem.

Finally, the ugly. AIs are never perfect and occasionally there are wild misses and this is why humans are kept in the loop. Ultimately, these cases are unavoidable and Oscar tracks user behavior to ensure there aren’t an excess of irrelevant results shown and make sure our omni-search bar achieves its intended purpose of routing members to the best care.

Future Work

Semantic search is available to a subset of users. We’d like to know if members with semantic search enabled click more on our search results and whether this feature increases visits to Oscar recommended doctors. This study will tell us if our precision-recall tuning was accurate and how much more tuning the model would need before we are comfortable with a full release. 

Previous
Previous

Needle in a Haystack: Using LLMs to Search for Answers in Medical Records

Next
Next

Curious Language Model Limitations