Case Studies

From data readiness to evaluation design: a structured approach to AI reliability in medical imaging.

Project: NIH ChestX-ray14 Data Due Diligence Evaluation


Dataset Scale: 112,120 images / 30,805 patients


Objective: Applying the VeraDP Data Due Diligence Protocol to identify the dataset’s strengths, limitations, and risks


Identified Risks:
(Selected examples; the exhaustive list remains proprietary and confidential)

  • Traceability & Provenance: Core documentation refers to an outdated version (CXR8), presenting traceability gaps for the extended 14-disease version.
  • Annotation Reliability: Labeling relies on NLP-based text mining of radiological reports, introducing a structural uncertainty (estimated ~10% error rate).
  • Patient-Level Leakage: The strong imbalance in the number of images per patient indicates that a small subset of patients contributes disproportionately to the dataset (up to 184 images/patient). This creates a significant risk of data leakage if splits are not handled at the patient level.
  • Pathology Sparsity: Extreme label disparities (e.g., Hernia at 0.2%) present a barrier to achieving statistical power for specific clinical claims.

Strategic Outcome:

The evaluation provides a framework for judgment under uncertainty, allowing R&D teams to account for data limitations before locking in technical and validation strategies.

Bar chart showing the distribution of the number of images per patient in the NIH Chest X-ray dataset, highlighting that the majority of patients have very few images while a small subset has a high frequency of repeat scans.
Bar chart showing the frequency of clinical labels in the NIH Chest X-ray14 dataset, illustrating the prevalence of specific pathologies (Infiltration, Effusion).

Project: Evaluation Design of an AI System Trained on the NIH ChestX-ray14 Dataset


From data findings to evaluation strategy
The Data Due Diligence study of the NIH ChestX-Ray14 dataset directly shapes the evaluation framework. What the data reveals determines how performance must be measured.


What the label distribution tells us

  • “No Finding” label represents ~54% of the dataset, making global accuracy an unreliable metric: it would reflect the dominant class while masking failures on rare pathologies.
  • 9 out of 14 pathology classes represent less than 5% of images each. Evaluation must therefore be conducted per class, not globally.
Bar chart showing the clinical label distribution in the NIH ChestX-ray14 dataset, demonstrating extreme pathology sparsity, with specific conditions like Hernia representing only 0.2% of the total 112,120 images.

What the patient distribution tells us

  • 56% of patients have a single image.
  • A small subset contributes up to 184 images.

This asymmetry makes patient-level data splitting mandatory to prevent data leakage between train, validation, and test sets.

Bar chart illustrating the distribution of images per patient in the NIH ChestX-ray14 dataset, showing a heavy tail where a small subset of patients contributes up to 184 images, highlighting the risk of patient-level data leakage.

Proposed evaluation strategy

  • Data splits:
    • Patient-level split to avoid leakage
    • Stratification per label to preserve class distribution across splits
  • Metrics (recorded per class):
    Sensitivity, Specificity, AUC, Precision (PPV), NPV, F1-score, Mean Average Precision (mAP)
  • Subgroup evaluation:
    A dedicated stress-test subset on the 9 extreme minority classes, with the option to define subgroups depending on clinical objectives: Pneumothorax, Consolidation, Pleural Thickening, Cardiomegaly, Emphysema, Edema, Fibrosis, Pneumonia, Hernia.
  • Generalization & robustness: Two complementary approaches assess whether the model generalizes beyond its training distribution:
    • Testing on chest X-ray images outside the 15 listed classes. This imposes using other datasets and introduces both OOD and domain shift effects simultaneously, which cannot be fully dissociated.
    • Testing on the same 15 classes from other datasets with different manufacturers and acquisition settings. This isolates the domain shift effect while keeping the label space consistent.

Without data due diligence, there is no reliable evaluation design. The data determines the strategy before any model development begins.

From data evidence to technical decisions

The findings from the Data Due Diligence and Evaluation Design studies raise a practical question that any R&D team must address before model development: what do we do with the minority classes?


Observations

  • 9 out of 14 pathology classes represent less than 5% of the dataset each.
  • At this level of representation, training a classification model on these classes is unlikely to yield reliable performance and may introduce false confidence if global metrics are used without scrutiny.

A considered option: OOD-by-design

Rather than forcing the model to learn from insufficient data, one strategic option is to:

  • deliberately exclude these extreme minority classes from training
  • reserve them exclusively for out-of-distribution (OOD) testing.

This transforms a dataset limitation into a reliability instrument:

  • The model is trained on classes with sufficient representation
  • Minority class images become a dedicated OOD test set
  • The model’s behavior on these unseen classes provides measurable evidence of its generalization limits.

This approach produces a reliable system with known and documented boundaries, which privileges a strong clinical and regulatory position.


Different options. Your call.

The data due diligence evaluation clarifies that ignoring minority class representation is not a neutral choice.

The right strategy depends on clinical objectives, regulatory pathway, and acceptable risk thresholds. VeraDP provides clarity and options. Decisions belong to the development team