Oscar Health’s early observations on OpenAI o1-preview

By Nikhita Luthra and Lauren Pendo

Yesterday, OpenAI dropped a preview of a new series of generative models focused on reasoning, dubbed “OpenAI o1.” Our initial observations are that the “o1-preview” model is more autonomous in generating a set of steps to solve a problem, more precise, and able to handle more complex tasks with higher consistency in output. We also appreciated the more digestible formatting of outputs, and its increased sense of ‘confidence’ relative to GPT-4o, likely powered by well-thought-out explanations of its thinking. In short, o1-preview seems to improve significantly on being able to reason.

In terms of impact, o1-preview will unlock net new use cases as well as accelerate time to deployment for existing use cases in healthcare. In this article, we lay out what applications we believe will be accelerated in the healthcare landscape with o1-preview’s advanced reasoning capabilities.

Claim Pricing

Problem in healthcare: There are thousands of rules stored in natural language – including legally binding contracts, industry best practices, and government guidelines – that determine how much a medical service will cost. Synthesizing these rules is extremely cumbersome, prone to error, and opaque.
How Oscar is using AI to solve it: 4o hasn’t really come close to being able to take in these rules and determine the cost of a service without a lot of manual intervention and technical scaffolding.
Our observations on o1: We tested o1-preview’s ability to determine the cost for a newborn delivery. o1-preview demonstrated a breakthrough at determining costs by correctly identifying relevant rules to apply from a contract, making assumptions where needed, and performing calculations. We were particularly impressed by its ability to identify nuances such as special carve outs for high-cost drugs that the mother had to take, apply compounding cost increases, and explain cost discrepancies due to secondary insurance plans and the mother’s complications including a UTI. Most notably, it independently applied this logic without specific prompting.

I passed o1-preview the claim and the relevant section of a contract and simply said ‘Determine the contracted rate.’ It autonomously generated a plan to follow to determine the correct amount:

Here’s o1-preview correctly identifying the relevant parts of the contract to apply based on lines from the claim.

In contrast, 4o was not able to get this right at all. In one case, 4o even incorrectly applied the contract clause for a C-section, instead of a vaginal delivery, which would have been a lot more expensive for the patient.

Finally, here’s o1-preview performing the calculations to arrive at the correct price for the delivery. In addition to identifying the correct clauses of the contract to apply, o1-preview also paid attention to a footnote in the contract specifying an annual inflation adjuster, and even applied the inflation adjuster to recompute the service level unit cost. This is a nuanced calculation that the healthcare ecosystem messes up all the time.

o1-preview is a huge leap forward towards making healthcare costs more transparent and meeting consumers’ expectations that costs are calculated correctly.

Navigating clinical guidelines

Problem in healthcare: Clinical teams navigate complex clinical guidelines to approve procedures for members and ensure they get timely access to care. Here’s an example of clinical guidelines for a hip replacement.
How Oscar is using AI to solve it: Oscar is building an AI assistant for nurses that will speed up the care approval process for members who require complicated and expensive procedures. This care approval process requires nurses to substantiate criteria from the clinical guideline in medical documentation submitted by the provider. We have tested 4o’s ability to navigate complicated and esoteric clinical guidelines. We found that we have to simplify the rules for 4o because it can only handle a small subset of criteria at a time. Otherwise it gets lost and at times hallucinates. This ‘simplification’ effort can add months to implementation
Our observations on o1-preview: Unlike 4o, o1-preview can autonomously navigate the complex rules laid out in the clinical guideline on its own. It can then evaluate the rules against a medical record, providing direct quotations from the record as evidence. While it still needs some guidance, we hypothesize we will no longer need to ‘dumb down’ the criteria, like we have to for 4o. This could accelerate tool deployment leading to faster approvals for members seeking care.

Here’s an example of o1-preview reviewing a medical record for a Total Hip Arthroplasty. Given the clinical guideline and a medical record, o1-preview was able to navigate the complex criteria and substantiate approval for the requested hip replacement:

Let's go through the clinical guideline and evaluate whether the criteria for total hip arthroplasty (THA) approval are met based on the provided medical record.

Criteria Evaluation:

Radiologic or Arthroscopic Findings:

Kellgren Lawrence Grade III or IV radiographic findings (Table 1); or

Tönnis Classification Grade 3 radiographic findings (Table 2); or

Evidence from Medical Record:

The medical record states: "Multiple orthogonal radiographs of bilateral hips demonstrate no evidence of fracture. There is advanced bone-on-bone changes bilaterally. Morphology of the femoral head is deformed with evidence of coxa valga and excessive femoral anteversion. Morphology of the acetabulum is dysplastic bilaterally."

This satisfies the criteria for both Kellgren Lawrence Grade III/IV and Tönnis Classification Grade 3 radiographic findings, as described in the medical record.

Conclusion: Criteria met.

Severe Pain Interfering with Functional Activity for at Least 3 Months:

Evidence from Medical Record:

Patient reports "pain is worse with activity, weightbearing, and it interferes with activities of daily living."

Pain is rated as 7/10, which is moderate to severe.

Conclusion: Criteria met.

Failure of Conservative Treatment:

Evidence from Medical Record:

The patient has failed conservative treatment including "antiinflammatory medication and other oral medications, activity modification, physical therapy, assistive devices, weight loss, bracing as indicated, and joint injections."

No intra-articular steroid injections to the hip within three months before the surgery date.

Conclusion: Criteria met.

No Contraindications Present:

Evidence from Medical Record:

No contraindications such as active joint infection, systemic bacteremia, or severe medical conditions that outweigh the potential benefits of surgery are mentioned.

It will be more than 12 weeks from the most recent injection to the joint replacement surgery date.

Conclusion: Criteria met.

Failure of Previous Surgical Interventions (if applicable):

Evidence from Medical Record:

No history of previous hip surgeries or interventions mentioned, therefore this criterion does not apply in this case.

Conclusion: Not applicable.

Overall Evaluation:

Based on the evaluation of the clinical guideline and the provided medical record, the patient meets all the criteria for total hip arthroplasty approval:

Radiographic findings are consistent with both Kellgren Lawrence Grade III/IV and Tönnis Classification Grade 3.

The patient experiences severe pain that significantly interferes with daily activities for more than 3 months, despite conservative treatment.

There are no contraindications present.

Recommendation: The request for total hip arthroplasty should be approved based on the comprehensive documentation supporting each criterion. There is sufficient evidence of advanced joint disease and failure of conservative management, making the procedure medically necessary to improve the patient's quality of life and functional status.

Fraud, Waste, and Abuse

Problem in healthcare: Detecting Fraud, Waste, and Abuse (FWA) is critical to avoiding excess costs that impact members.
How Oscar is using AI to solve it: We are leveraging GPT-4o in prototypes to detect common Fraud, Waste, or Abuse schemes. With 4o, we have to break down each common scheme into our own rule set to evaluate the medical record.
Our observations on o1-preview: Unlike 4o, o1-preview is able to autonomously navigate disparate data sources to mine for common FWA signals on its own. It also provides a more logical and thorough explanation of its findings, which is crucial for maintaining robust and auditable processes in a highly regulated industry. For example, one common signal for FWA investigation is when the claim for an encounter is not billed under the same provider who rendered the service. This is called ‘incident-to billing,’ which is not allowed per Oscar’s reimbursement policies. Our testing suggested o1-preview was able to successfully identify incident-to billing, given a claim and corresponding clinical documentation for the visit.

Here’s o1-preview evaluating a synthetic claim and medical record. It correctly identifies that the services were provided by an Advanced Practice Registered Nurse (APRN) and flags that there is no signature from the MD, despite the claim being billed under the MD.

While GPT-4o was able to deduce the final correct decision of denial, the model’s deductive reasoning was incorrect. The model incorrectly identified that the services were provided by Dr. John Doe and made no reference to the services being provided by an APRN. Instead, 4o suggested a denial due to the NPI differences between rendering and billing NPI on the claim, which is not the correct reason for denial.

Here’s GPT-4o’s reasoning:

This example highlights that while both models can both suggest the same action, the reasoning of why the action should be taken is equally as important. o1-preview is able to reason at a much higher level that reflects the process our internal teams would take to review this claim for incident-to billing.

Medical Records

Problem in healthcare: Clinicians have to navigate extensive medical records where errors or omissions can have high stakes, especially for the sickest patients who are the ones with the longest charts. For our sickest patients, charts can be hundreds of pages to even thousands in the most acute scenarios.
How Oscar is using AI to solve it: We use GPT-4o to extract clinical data from charts. The insights from charts are then leveraged by clinicians and coders to automate various workflows and deliver better care. Current implementations for more complicated and multi-step workflows require Oscar teams to spend quite a bit of time independently verifying and sense checking 4o’s insights to get the full clinical context.
Our observations on o1-preview: o1-preview is able to perform clinical data extraction similar to 4o, but offers more specificity and detailed explanations of its reasoning. For example, we had one chart where both 4o and o1-preview identified that the patient had immunosuppression, but o1-preview provided a more precise identification because it could correlate the patient’s immunosuppression as a side effect of a particular drug they were on:

4o:

o1-preview:

We found another example where o1-preview was able to provide a more nuanced and helpful clinical context than 4o. 4o identified the primary reason for a hospital visit as cancer. Meanwhile, o1-preview was able to gather that while the patient had a history of cancer, the more specific reason the patient was at the hospital on that day was actually for removal of a feeding tube:

o1-preview’s increased robustness and specificity leads to a more accurate identification of the conditions or diagnoses a member has. You’ll note that o1-preview also provides an explanation of its reasoning, which is critical in a highly regulated and high stakes industry. This will increase the trust that Oscar teams will have in the insights it delivers. Ultimately, o1-preview is a step forward for workflows where we need to extract clinical insights reliably and quickly so that we can deliver better, more immediate care for our members.

Automating providers’ administrative tasks

Problem in healthcare: Providers are burdened with an overwhelming amount of administrative tasks that raise costs in the system overall and reduce access to care.
How Oscar is using AI to solve it: We launched a feature for Oscar’s Virtual Care practice leveraging 4o to ask the patient basic questions before a secure-messaging encounter with a clinician.
Our observations on o1-preview: While 4o does a fine job, when we tested o1-preview, we were stunned at how much more specific and helpful the conversation was. Simply put, o1-preview asked better questions. In an intake for a dermatology visit for a rash on the patient’s neck, 4o asked relatively unhelpful questions. For example, 4o asked for ‘triggers.’ As a patient, if I already know the triggers for my rash, I probably wouldn’t need the visit in the first place! We found that 4o also quickly jumped to ‘contact dermatitis’ while o1-preview focused on asking more questions to gather evidence: o1-preview asked if the patient ever had a previous diagnosis of two particular skin diseases - eczema and psoriasis. o1-preview also asked if the patient had specific historical labs it deemed would be helpful for the doctor to be aware of.
o1-preview was able to independently generate better questions to collect the clinical context from the patient. This will improve the quality of the pre-visit insights our clinicians receive before they see a patient, while continuing to dramatically reduce time for clinicians and allow them to focus on what they do best – providing care.

If you’ve made it to the end of this article, you probably have gathered that healthcare is gnarly, processes are often nonsensical, and systems are antiquated. This adds cost to the consumer. Previous generative models promised efficiency gains and quality improvements but required us to ‘simplify’ how healthcare works to actually realize gains.

On the other hand, reasoning offers an alternative where we may be able to deliver efficiency and quality gains without necessarily trying to change or simplify the world for a model; rather, the model is empowered to live in a world of messiness but still make sense of it, much like we do.