Making Sense of AI Report Cards: Understanding the Numbers that Could Save Your Life

Authors: Yashbir Singh, ME, PhD1, Jesper B. Andersen, MSc, PhD2, Gregory J. Gores, MD3

Affiliations:

1 Radiology, Mayo Clinic, Rochester, Minnesota, USA
2 Biotech Research and Innovation Centre (BRIC), Department of Health and Medical Sciences, University of Copenhagen, Denmark
3 Division of Gastroenterology & Hepatology, Mayo Clinic, Rochester, Minnesota

When your clinician mentions that a new Artificial Intelligence (AI) system has “95% accuracy” in detecting your disease – cholangiocarcinoma: What does this number mean for you? and should you trust it completely? Let us break down the numbers behind this accuracy measure in a way that hopefully will make bit more sense to you, because understanding them could genuinely impact your care decisions.

Imagine you are at the TSA airport security checkpoint. The metal detector needs to catch dangerous items – say a gun or a knife – (this relates to the machines sensitivity). However, it should not beep for smaller items such as your glasses or belt buckle (this relates to configuring the machines specificity). In medical AI approaches, we face the same challenges. When detecting cholangiocarcinoma, sensitivity means catching the cancer when it’s there – if the AI model has a 90% sensitivity, this means the model finds 9 out of 10 actual cases. However, here is the catch: that means it still misses one person with cholangiocarcinoma who really needs treatment.

Specificity is equally crucial. If the AI model has 85% specificity, it correctly identifies healthy people 85% of the time (85 people out of 100). However, what about those other 15%? These 15 cases are told they might have cancer when they really do not, leading to sleepless nights, unnecessary biopsies, and emotional trauma to the individual affected and family.

Here is where it gets interesting.

Clinicians look at something called the Receiving Operating Curve (ROC) – think of ROC as AI’s report card, showing how well it balances catching disease versus avoiding false alarms. The “area under the curve” (AUC) gives a single grade: 0.5 means the AI is guessing randomly (failing grade – which is similar to flipping the coin), while 1.0 means perfect performance (which never happens in real medicine). Most cholangiocarcinoma AI systems score between 0.75-0.92, which are good measures, but not infallible.

For clinicians and researchers: Consider evaluating AI models using additional metrics such as positive predictive value (PPV) and negative predictive value (NPV) in your specific patient population. The F1 score, which balances precision and recall, may be particularly relevant when comparing models across different prevalence settings. Document the confidence intervals around these metrics, as they provide crucial information about model reliability in clinical decision-making. Why does this matter? It matters because cholangiocarcinoma is rare, even a 95% accurate test can be wrong more often than right. Consider this example; If only 1 in 1,000 people have cholangiocarcinoma, and the AI model flags 100 people as positive, statistically only 2 might have the disease while 98 experience false alarms and the personal grief that follows such devastating information. This is why your clinician never relies on AI alone but uses it as one piece of a larger puzzle.

Clinical implementations note: When integrating AI tools into practice workflows, establish clear protocols for managing borderline cases (e.g., predictions between 40-60% probability). Consider implementing a “human-in-the-loop” approach for high-stakes decisions and maintain detailed logs of AI predictions versus final clinical outcomes to support continuous model improvement and validation in your specific patient population.

The most promising development is something called “calibration” – teaching the AI model to express an uncertainty. Instead of giving a de facto measure “cancer” or “no cancer,” newer AI systems report a percentage risk “70% risk of cancer,” helping clinicians decide if more tests are needed.

Research considerations: When developing or validating AI models for cholangiocarcinoma detection, ensure your training and validation datasets reflect the demographic and clinical diversity of your target population. Report model performance across different patient subgroups (age, sex, comorbidities, imaging protocols) and consider external validation in multiple healthcare settings. Document any preprocessing steps, feature selection methods, and hyperparameter tuning to ensure reproducibility.

Remember, these metrics are not just abstract numbers – they guide real decisions about your health. When evaluating any AI-based diagnosis, ask your clinician: What’s the sensitivity and specificity for someone like me? How confident is the model in this prediction? What other evidence supports or contradicts this finding?

Key questions for clinical teams:

Has this AI model been validated in patients with similar demographics and clinical characteristics as your population?

What is the model’s performance specifically in early stages versus advanced cholangiocarcinoma?

How does the AI tool integrate with existing diagnostic workflows and electronic health records?

What ongoing monitoring systems are in place to detect model drift or performance degradation over time?

Understanding these numbers empowers you to be an active participant in your care, asking the right questions and making informed decisions alongside your medical team.

References:

  1. Singh, Y., Eaton, J. E., Venkatesh, S. K., Welle, C. L., Smith, B., Faghani, S., … & Erickson, B. J. Deep learning analysis of MRI accurately detects early-stage perihilar cholangiocarcinoma in patients with primary sclerosing cholangitis. Hepatology, 10-1097.
  2. Schattenberg, J. M., Chalasani, N., & Alkhouri, N. (2023). Artificial intelligence applications in hepatology. Clinical Gastroenterology and Hepatology21(8), 2015-2025.
  3. Park, S. H., & Han, K. (2018). Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology, 286(3), 800-809.