Case Studies

Data Due Diligence
Evaluation Design
Technical Strategy

Project: NIH ChestX-ray14 Data Due Diligence Evaluation

Dataset Scale: 112,120 images / 30,805 patients

Objective: Applying the VeraDP Data Due Diligence Protocol to identify the dataset’s strengths, limitations, and risks

Identified Risks:
(Selected examples; the exhaustive list remains proprietary and confidential)

Traceability & Provenance: Core documentation refers to an outdated version (CXR8), presenting traceability gaps for the extended 14-disease version.
Annotation Reliability: Labeling relies on NLP-based text mining of radiological reports, introducing a structural uncertainty (estimated ~10% error rate).
Patient-Level Leakage: The strong imbalance in the number of images per patient indicates that a small subset of patients contributes disproportionately to the dataset (up to 184 images/patient). This creates a significant risk of data leakage if splits are not handled at the patient level.
Pathology Sparsity: Extreme label disparities (e.g., Hernia at 0.2%) present a barrier to achieving statistical power for specific clinical claims.

Strategic Outcome:

The evaluation provides a framework for judgment under uncertainty, allowing R&D teams to account for data limitations before locking in technical and validation strategies.

Bar chart showing the distribution of the number of images per patient in the NIH Chest X-ray dataset, highlighting that the majority of patients have very few images while a small subset has a high frequency of repeat scans.

Bar chart showing the frequency of clinical labels in the NIH Chest X-ray14 dataset, illustrating the prevalence of specific pathologies (Infiltration, Effusion).

Project: Evaluation Design of an AI System Trained on the NIH ChestX-ray14 Dataset

From data findings to evaluation strategy
The Data Due Diligence study of the NIH ChestX-Ray14 dataset directly shapes the evaluation framework. What the data reveals determines how performance must be measured.

What the label distribution tells us

“No Finding” label represents ~54% of the dataset, making global accuracy an unreliable metric: it would reflect the dominant class while masking failures on rare pathologies.
9 out of 14 pathology classes represent less than 5% of images each. Evaluation must therefore be conducted per class, not globally.

Bar chart showing the clinical label distribution in the NIH ChestX-ray14 dataset, demonstrating extreme pathology sparsity, with specific conditions like Hernia representing only 0.2% of the total 112,120 images.

What the patient distribution tells us

56% of patients have a single image.
A small subset contributes up to 184 images.

This asymmetry makes patient-level data splitting mandatory to prevent data leakage between train, validation, and test sets.

Bar chart illustrating the distribution of images per patient in the NIH ChestX-ray14 dataset, showing a heavy tail where a small subset of patients contributes up to 184 images, highlighting the risk of patient-level data leakage.

Proposed evaluation strategy

Data splits:
- Patient-level split to avoid leakage
- Stratification per label to preserve class distribution across splits
Metrics (recorded per class):
Sensitivity, Specificity, AUC, Precision (PPV), NPV, F1-score, Mean Average Precision (mAP)
Subgroup evaluation:
A dedicated stress-test subset on the 9 extreme minority classes, with the option to define subgroups depending on clinical objectives: Pneumothorax, Consolidation, Pleural Thickening, Cardiomegaly, Emphysema, Edema, Fibrosis, Pneumonia, Hernia.
Generalization & robustness: Two complementary approaches assess whether the model generalizes beyond its training distribution:
- Testing on chest X-ray images outside the 15 listed classes. This imposes using other datasets and introduces both OOD and domain shift effects simultaneously, which cannot be fully dissociated.
- Testing on the same 15 classes from other datasets with different manufacturers and acquisition settings. This isolates the domain shift effect while keeping the label space consistent.

Without data due diligence, there is no reliable evaluation design. The data determines the strategy before any model development begins.

From data evidence to technical decisions

The findings from the Data Due Diligence and Evaluation Design studies raise a practical question that any R&D team must address before model development: what do we do with the minority classes?

Observations

9 out of 14 pathology classes represent less than 5% of the dataset each.
At this level of representation, training a classification model on these classes is unlikely to yield reliable performance and may introduce false confidence if global metrics are used without scrutiny.

A considered option: OOD-by-design

Rather than forcing the model to learn from insufficient data, one strategic option is to:

deliberately exclude these extreme minority classes from training

reserve them exclusively for out-of-distribution (OOD) testing.

This transforms a dataset limitation into a reliability instrument:

The model is trained on classes with sufficient representation
Minority class images become a dedicated OOD test set
The model’s behavior on these unseen classes provides measurable evidence of its generalization limits.

This approach produces a reliable system with known and documented boundaries, which privileges a strong clinical and regulatory position.

This data due diligence makes one thing clear: ignoring minority class representation is not a neutral choice. The right strategy depends on clinical objectives, regulatory pathway, and acceptable risk thresholds.

VeraDP provides the clarity and the options. Technical decisions can now be made on solid ground.

Context

CT reconstruction algorithms can substantially alter image appearance, noise characteristics, texture, and sharpness. As a result, reconstruction settings may influence the behavior of AI systems and should not necessarily be treated as simple protocol descriptors. They may instead represent acquisition-state variables against which a model should be validated and monitored.

GammaMetric [1] investigated whether reconstruction characteristics could be identified directly from image pixels independently of DICOM metadata availability or quality. Such an approach could support local acceptance testing and drift monitoring by verifying whether incoming images remain within validated acquisition conditions.

Before model development, however, a key question needed to be addressed:
Can reconstruction metadata be considered reliable ground truth for a reconstruction classification study?

VeraDP’s Contribution

VeraDP conducted a metadata traceability assessment on the QIBA [2] dataset from TCIA [3].

The objective was to evaluate whether reconstruction-related metadata could be considered reliable ground truth for model development and validation.

The goal was to identify which reconstruction descriptors were directly observable, which were inferred, and which required further validation before being used as reference information.

Dataset Scope

The QIBA dataset includes 3 sets of CT scan images acquired from anthropomorphic phantoms with replaceable liver inserts. Acquisition and reconstruction parameters were deliberately varied, including tube current, slice thickness, reconstruction algorithm, convolution kernel, and pitch.
Because these factors were controlled during acquisition, the dataset provided a suitable environment for investigating reconstruction traceability and metadata reliability.

The assessment combined online dataset documentation, metadata spreadsheets, and DICOM headers.

Key Findings

1. Documentation mismatch in dataset size

The dataset documentation indicates 642 scans. The accompanying metadata spreadsheet contains 684 unique series, while 627 unique DICOM series were identified in the downloaded data.

Conclusions:

The DICOM series constitute the most directly verifiable source and can therefore be used as the reference count.
Cross-referencing the metadata spreadsheet against the available DICOM series is recommended to ensure that only metadata corresponding to available image series are used.

2. Images with low slice count

Three CT series contained only 21 slices.

Conclusion: These series should be treated with caution (identified and possibly excluded), as their unusually low slice count may provide an alternative explanation for unsatisfactory results independently of reconstruction characteristics.

Distribution of slice counts across CT series. Three series are identified as low-slice-count outliers with 21 slices each. — Figure 1. Distribution of slice counts across CT series. Three series were identified as low-slice-count outliers with only 21 slices.

3. Incomplete reconstruction descriptors

The ConvolutionKernel DICOM tag distribution revealed three GE series for which no reconstruction kernel could be identified from the available metadata.

Conclusion: Although limited in number, these series illustrate that reconstruction metadata may be incomplete even within a controlled dataset and should not be assumed to be universally available.

Distribution of values in the ConvolutionKernel DICOM tag across CT series. Most series contain an identifiable reconstruction kernel, while three GE series have missing kernel information. — Figure 2. ConvolutionKernel DICOM tag distribution. Three GE series were identified without an associated reconstruction kernel descriptor.

4. Discrepancies between string-extracted and standard DICOM tags

Additional acquisition parameters were extracted from the SeriesDescription field using regular expressions. These values were then compared against the corresponding standard DICOM tags.

Discrepancies were identified for both slice thickness and tube current.

Conclusion: Metadata derived from SeriesDescription should not be considered reliable ground-truth information without additional validation against standard DICOM metadata.

Comparison between tube current values extracted from the SeriesDescription field and values recorded in the standard XRayTubeCurrent DICOM tag. Several discrepancies are visible, indicating that text-extracted metadata does not always match standard DICOM metadata. — Figure 3. Tube current mismatch between string-extracted metadata and standard DICOM metadata. Several series show inconsistent values across the two sources.

Comparison between slice thickness values extracted from the SeriesDescription field and values recorded in the standard SliceThickness DICOM tag. Several discrepancies are visible, indicating that text-extracted metadata does not always match standard DICOM metadata. — Figure 4. Slice thickness mismatch between string-extracted metadata and standard DICOM metadata. Several series show inconsistent values across the two sources.

5. Manufacturer-dependent reconstruction metadata

Reconstruction-related metadata availability differed across manufacturers. For example, GE reconstruction information may be available through private tags, while no equivalent source was identified for the Siemens reconstruction families present in the dataset. Furthermore, private reconstruction tags were not consistently available across all GE series.
Conclusion: Reconstruction metadata availability is manufacturer-dependent, limiting the possibility of relying on a single reconstruction reference strategy.

Impact

Observation	Implication
Documentation mismatch in dataset size	Ensure labels are generated only for image series actually available in the dataset.
Images with low slice count	Investigate potential confounding factors before attributing performance differences to reconstruction characteristics.
Incomplete reconstruction descriptors	Expect missing labels and define a strategy for handling unlabeled series. Given that only three affected series were identified, exclusion may be a practical and defensible option.
Discrepancies between string-extracted and standard DICOM tags	Avoid treating text-derived parameters as reference labels without verification.
Manufacturer-dependent reconstruction metadata	Reconstruction labeling strategy needs to differ between manufacturers. Manufacturer-specific label generation rules may be required.

By clarifying metadata traceability before experimentation, the study provided a more defensible foundation for subsequent reconstruction classification work.

Note: Dataset download and DICOM metadata extraction were performed by GammaMetric. VeraDP conducted the metadata traceability assessment and evaluated the suitability of reconstruction metadata as ground truth for reconstruction classification.

References

[1] https://gammametric.com/
[2] https://www.cancerimagingarchive.net/collection/qiba-ct-liver-phantom/
[3] https://www.cancerimagingarchive.net/