Skip to content

AI driven breast cancer subtyping and risk stratification from slide-inferred pathologist-interpretable nuclear features

License

Notifications You must be signed in to change notification settings

ruppinlab/EXPAND

Repository files navigation

EXPAND (EXplainable Pathologist Aligned Nuclear Discriminator)

Note: The manuscript has been submitted and is under review.

Citation (preprint available):
R. K. Barman, S. R. Dhruba, D. T. Hoang, E. D. Shulman, E. M. Campagnolo, A. T. Wang, S. A. Harmon, T. C. Hu, A. Papanicolau-Sengos, M. P. Nasrallah, K. D. Aldape, E. Ruppin.
"Pathologist-interpretable breast cancer subtyping and stratification from AI-inferred nuclear features", 2025.


Overview

EXPAND is an open-source, interpretable AI pipeline designed to predict breast cancer (BC) subtypes and patient survival risk directly from H&E-stained whole-slide images (WSIs).

While many deep learning models achieve strong accuracy, they often lack interpretability and do not reflect how pathologists evaluate morphology. EXPAND bridges this gap by focusing on a compact set of biologically meaningful nuclear features, making the pipeline intuitive, reproducible, and clinically relevant.

EXPAND

Figure: The full pipeline for EXPAND


Why EXPAND?

  • Transparent: Uses 12 Nuclear Pathologist-Interpretable Features (NPIFs) derived from nuclei segmented with open-source tools.
  • Robust: Achieves predictive performance comparable to or better than black-box DL models using logistic regression with cross-validation.
  • Generalizable: Validated on CPTAC-BRCA and POST-NAT-BRCA cohorts in addition to TCGA-BRCA.
  • Scalable: Requires only standard H&E slides and Hover-Net segmentation, making it deployable across cancer types and settings.
  • Prognostic: NPIFs independently predict survival outcomes (OS, PFS), enabling clinically interpretable risk stratification.

Key Features

  1. 12 NPIFs – compact, biologically interpretable nuclear features (area, perimeter, eccentricity, etc.) aligned with pathologist workflows.
  2. Subtype prediction – HER2+, HR+, and TNBC classifiers trained with logistic regression.
  3. External validation – tested on CPTAC-BRCA and POST-NAT-BRCA datasets.
  4. Survival modeling – multivariate Cox regression models per subtype with Kaplan–Meier analysis for OS and PFS.
  5. Workflow example – WSI tiling through NPIFs computation.

Availability

  • All source codes are included in this repository.
  • A short user guide is provided below for quick setup. A full, step-by-step pipeline walkthrough is available in the detailed User Guide (PDF).
  • The ML predictors were developed on macOS (Python) and tested on Linux (HPC environment). Scripts can be run interactively in a Python IDE or from the command line:
    python script_name.py
  • Please make sure to update the working directory and adjust all file/folder paths in each script to match your environment before running.

Dependencies

Developed with Python ≥ 3.10. Core dependencies:

numpy >= 1.24.4
pandas >= 2.0.3
matplotlib >= 3.7.2
seaborn >= 0.13.2
scikit-learn >= 1.3.0
joblib >= 1.3.0 
opencv-python >= 4.10.0
torch >= 1.12.1
torchvision >= 0.13
Pillow >= 9.2.0
openslide-python >= 1.3.1
tqdm >= 4.65.0
pickle >= 4.0
lifelines >= 0.28.0

To install requirements:

pip install -r requirements.txt

EXPAND Pipeline Overview

This repository contains the complete EXPAND pipeline for tile generation, nuclear segmentation, NPIF computation, subtype prediction, external validation, and survival analysis. The steps are organized sequentially so users can reproduce the workflow end-to-end.


1. Tile Generation (TCGA-BRCA)

  • Folder:
    Slide_preprocessing_codes
  • Scripts:
    • 1_01_get_tiles_from_slide.py
    • 1_11_jobs_to_get_tiles.py
  • Task: Generate 512×512 tiles from H&E WSIs at 20× magnification.

2. Tile-level Nucleus Segmentation with Hover-Net (TCGA-BRCA)

  • Folder:
    NPIFs_generation_codes/TCGA_BRCA/Segmentation
  • Scripts:
    • 2_01_22_ExtractMorphologicalFeaturesFromHnE.py/.ipynb
    • 2_01_100_01_JobSubmissionCode.py/.ipynb
  • Task: Run Hover-Net to segment and classify nuclei per tile.

3. TCGA-BRCA: Morphology Computation

  • Folder:
    NPIFs_generation_codes/TCGA_BRCA/Morphology_features_calculation
  • Scripts:
    • 2_02_03_MorphologyCalculation_All_Slides.py/.ipynb
    • 2_02_13_Job_Submission_MorphologyCalculation_All_Slides.py/.ipynb
  • Task: Compute per-nucleus morphology (area, perimeter, axis length, eccentricity, circularity).

4. NPIF Calculation (TCGA-BRCA)

  • Folder:
    NPIFs_generation_codes/TCGA_BRCA/NPIFs_Generation
  • Scripts:
    • 2_03_01_01_NPIFs_Calculation_HoverNet_V0.py/.ipynb (all tiles)
    • 2_03_01_01_NPIFs_Calculation_HoverNet_V1.py/.ipynb (top 25% cancer-enriched tiles)
  • Task: Compute 12 NPIFs per slide from Hover-Net outputs.

5. Mapping NPIFs to BRCA Biomarker Status

  • Folder:
    NPIFs_generation_codes/TCGA_BRCA/NPIFs_Generation
  • Scripts:
    • 3_01_01_02_Mapped_Original_Value_Hovernet_NPIFs_to_BRCA_Subtypes.py/.ipynb (all tiles)
    • 3_01_01_06_...Top25Q.py/.ipynb (top 25% tiles)
  • Task: Merge NPIFs with HER2, ER, PR metadata.

6. BRCA Subtype Prediction Using NPIFs

  • Folder:
    Subtypes_prediction_codes/TCGA_BRCA
  • Scripts:
    • 4_01_04_103_04_101_...All_Tiles_Using_Lasso.py/.ipynb
    • 4_01_04_103_04_103_...Top25Q.py/.ipynb
  • Task: Train logistic regression classifiers (L1 penalty) for HER2+, HR+, TNBC.

7. CPTAC-BRCA Pipeline

  • Folder: NPIFs_generation_codes/CPTAC_BRCA/Segmentation
  • Segmentation:
    • 2_01_22_02_Test_CPTAC_Dataset_ExtractMorphologicalFeaturesFromHnE.py/.ipynb
    • 2_01_100_02_01_JobSubmissionCode.py/.ipynb
  • Folder: NPIFs_generation_codes/CPTAC_BRCA/Morphology_features_calculation
  • Morphology:
    • 2_02_03_02_CPTAC_MorphologyCalculation_All_Slides.py/.ipynb
    • 2_02_13_02_CPTAC_Job_Submission_MorphologyCalculation_All_Slides.py/.ipynb
  • Folder: NPIFs_generation_codes/CPTAC_BRCA/NPIFs_Generation
  • NPIFs: 2_03_02_05_CPTAC_BRCA_NPIFs_Calculation_HoverNetPrediction_Filtered_Tiles_Top25Q.py/.ipynb
  • Folder: NPIFs_generation_codes/CPTAC_BRCA/NPIFs_Generation
  • Mapping: 3_01_01_07_CPTAC_Mapped_Original_Value...Top25Q.py/.ipynb
  • Folder: Subtypes_prediction_codes/CPTAC_BRCA
  • External Prediction: 6_01_04_103_04_103_CPTAC_Prediction_Using_...Top25Q.py/.ipynb

8. POST-NAT-BRCA Pipeline (equivalent steps)

  • Folder: NPIFs_generation_codes/POST_NAT_BRCA/Segmentation
  • Segmentation:
    • 2_01_22_02_Test_POST_NAT_Dataset_ExtractMorphologicalFeaturesFromHnE.py/.ipynb
    • 2_01_100_02_POST_NAT_JobSubmissionCode.py/.ipynb
  • Folder: NPIFs_generation_codes/POST_NAT_BRCA/Morphology_features_calculation
  • Morphology:
    • 2_02_03_02_POST_NAT_MorphologyCalculation_All_Slides.py
    • 2_02_13_02_POST_NAT_Job_Submission_MorphologyCalculation_All_Slides.py/.ipynb
  • Folder: NPIFs_generation_codes/POST_NAT_BRCA/NPIFs_Generation
  • NPIFs: 2_03_02_05_POST_NAT_BRCA_NPIFs_Calculation_HoverNetPrediction_Filtered_Tiles_Top25Q.py/.ipynb
  • Folder: NPIFs_generation_codes/POST_NAT_BRCA/NPIFs_Generation
  • Mapping: 3_01_01_07_POST_NAT_Mapped_Original_Value...Top25Q.py/.ipynb
  • Folder: Subtypes_prediction_codes/POST_NAT_BRCA
  • Subtype Prediction: 6_01_04_103_04_103_Lasso_POST_NAT_Prediction...Top25Q.py/.ipynb

9. Survival Analysis with EXPAND Features

  • Folder: Survival_codes
  • Mapping scripts:
    • 5_01_01_mapped_hovernet_npifs_to_tcga_survival.py/.ipynb
    • 5_01_02_mapped_pathai_hifs_to_tcga_survival.py/.ipynb
    • 5_01_03_mapped_pathai_nuhifs_to_tcga_survival.py/.ipynb
    • 5_01_04_mapped_pathai_pifs_to_tcga_survival.py/.ipynb
  • Folder: Survival_codes
  • Model scripts:
    • 6_01_01_all_npifs_OS_analysis_with_age_cv.py/.ipynb
    • 6_01_02_01_all_hifs_OS_analysis...py/.ipynb
    • 6_01_03_01_all_nuhifs_OS_analysis...py/.ipynb
    • 6_01_04_01_all_pifs_OS_analysis...py/.ipynb

10. Subtype Prediction from PathAI-derived Features

  • Folder: PathAI_codes
  • Scripts:
    • 1_01_01_mapped_tcga_biomarker_status_to_original_hifs_with_comments.py/.ipynb
    • 2_01_01_PathAI_Metadata_Original_nuHIFs_And_TCGA_BiomarkerStatus.py/.ipynb
    • 3_01_01_PathAI_Metadata_Original_PIFs_And_TCGA_BiomarkerStatus.py/.ipynb
    • 1_01_04_103_04_103_BRCA_Clinical_Subtype_..._All_PathAI_HIFs_...Classification.py/.ipynb
    • 2_01_04_103_04_103_BRCA_Clinical_Subtype_..._All_PathAI_nuHIFs_..._Classification.py/.ipynb
    • 3_01_04_103_04_103_BRCA_Clinical_Subtype_..._All_PathAI_PIFs_..._Classification.py/.ipynb
    • 3_01_04_103_04_103_01_BRCA_Clinical_Subtype_..._All_PathAI_NPIFs_..._Classification.py/.ipynb

11. Direct Feature Extraction (ResNet50)

  • Folder: Direct_codes
  • Scripts:
    • 1_01_get_tiles_from_slide.py
    • 1_02_get_features_from_tiles2.py
    • 1_03_collect_all_features_masks.py
    • 1_11_jobs_to_get_tiles.py
    • 1_12_jobs_to_get_features2.py
    • 1_13_jobs_to_collect_features2.py
    • 3_01_01_02_TCGA_BRCASubtypes_to_DirectHnE_Features_Resnet50.py/.ipynb
    • 3_01_04_103_04_103_02_BRCA_Clinical_Subtype_Prediction_Using_All_Direct_Features.py/.ipynb
  • Task: Extract slide-level embeddings with ResNet50 and train subtype classifiers.

Reproducing Results

All results described in the manuscript can be reproduced using the scripts provided in this repository.

  • Follow the step-by-step workflow in the User Guide (PDF) to replicate subtype classification, external validation, and survival analyses.
  • All manuscript-related figures are available here: Figures/.
  • All TCGA-BRCA subtype-specific models are available here: Models/.

Contact

Cancer Data Science Lab, NCI, NIH

About

AI driven breast cancer subtyping and risk stratification from slide-inferred pathologist-interpretable nuclear features

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •