Note: The manuscript has been submitted and is under review.
Citation (preprint available):
R. K. Barman, S. R. Dhruba, D. T. Hoang, E. D. Shulman, E. M. Campagnolo, A. T. Wang, S. A. Harmon, T. C. Hu, A. Papanicolau-Sengos, M. P. Nasrallah, K. D. Aldape, E. Ruppin.
"Pathologist-interpretable breast cancer subtyping and stratification from AI-inferred nuclear features", 2025.
EXPAND is an open-source, interpretable AI pipeline designed to predict breast cancer (BC) subtypes and patient survival risk directly from H&E-stained whole-slide images (WSIs).
While many deep learning models achieve strong accuracy, they often lack interpretability and do not reflect how pathologists evaluate morphology. EXPAND bridges this gap by focusing on a compact set of biologically meaningful nuclear features, making the pipeline intuitive, reproducible, and clinically relevant.
Figure: The full pipeline for EXPAND
- Transparent: Uses 12 Nuclear Pathologist-Interpretable Features (NPIFs) derived from nuclei segmented with open-source tools.
- Robust: Achieves predictive performance comparable to or better than black-box DL models using logistic regression with cross-validation.
- Generalizable: Validated on CPTAC-BRCA and POST-NAT-BRCA cohorts in addition to TCGA-BRCA.
- Scalable: Requires only standard H&E slides and Hover-Net segmentation, making it deployable across cancer types and settings.
- Prognostic: NPIFs independently predict survival outcomes (OS, PFS), enabling clinically interpretable risk stratification.
- 12 NPIFs – compact, biologically interpretable nuclear features (area, perimeter, eccentricity, etc.) aligned with pathologist workflows.
- Subtype prediction – HER2+, HR+, and TNBC classifiers trained with logistic regression.
- External validation – tested on CPTAC-BRCA and POST-NAT-BRCA datasets.
- Survival modeling – multivariate Cox regression models per subtype with Kaplan–Meier analysis for OS and PFS.
- Workflow example – WSI tiling through NPIFs computation.
- All source codes are included in this repository.
- A short user guide is provided below for quick setup. A full, step-by-step pipeline walkthrough is available in the detailed User Guide (PDF).
- The ML predictors were developed on macOS (Python) and tested on Linux (HPC environment). Scripts can be run interactively in a Python IDE or from the command line:
python script_name.py
- Please make sure to update the working directory and adjust all file/folder paths in each script to match your environment before running.
Developed with Python ≥ 3.10. Core dependencies:
numpy >= 1.24.4
pandas >= 2.0.3
matplotlib >= 3.7.2
seaborn >= 0.13.2
scikit-learn >= 1.3.0
joblib >= 1.3.0
opencv-python >= 4.10.0
torch >= 1.12.1
torchvision >= 0.13
Pillow >= 9.2.0
openslide-python >= 1.3.1
tqdm >= 4.65.0
pickle >= 4.0
lifelines >= 0.28.0
To install requirements:
pip install -r requirements.txt
This repository contains the complete EXPAND pipeline for tile generation, nuclear segmentation, NPIF computation, subtype prediction, external validation, and survival analysis. The steps are organized sequentially so users can reproduce the workflow end-to-end.
- Folder:
Slide_preprocessing_codes
- Scripts:
1_01_get_tiles_from_slide.py
1_11_jobs_to_get_tiles.py
- Task: Generate 512×512 tiles from H&E WSIs at 20× magnification.
- Folder:
NPIFs_generation_codes/TCGA_BRCA/Segmentation
- Scripts:
2_01_22_ExtractMorphologicalFeaturesFromHnE.py/.ipynb
2_01_100_01_JobSubmissionCode.py/.ipynb
- Task: Run Hover-Net to segment and classify nuclei per tile.
- Folder:
NPIFs_generation_codes/TCGA_BRCA/Morphology_features_calculation
- Scripts:
2_02_03_MorphologyCalculation_All_Slides.py/.ipynb
2_02_13_Job_Submission_MorphologyCalculation_All_Slides.py/.ipynb
- Task: Compute per-nucleus morphology (area, perimeter, axis length, eccentricity, circularity).
- Folder:
NPIFs_generation_codes/TCGA_BRCA/NPIFs_Generation
- Scripts:
2_03_01_01_NPIFs_Calculation_HoverNet_V0.py/.ipynb
(all tiles)2_03_01_01_NPIFs_Calculation_HoverNet_V1.py/.ipynb
(top 25% cancer-enriched tiles)
- Task: Compute 12 NPIFs per slide from Hover-Net outputs.
- Folder:
NPIFs_generation_codes/TCGA_BRCA/NPIFs_Generation
- Scripts:
3_01_01_02_Mapped_Original_Value_Hovernet_NPIFs_to_BRCA_Subtypes.py/.ipynb
(all tiles)3_01_01_06_...Top25Q.py/.ipynb
(top 25% tiles)
- Task: Merge NPIFs with HER2, ER, PR metadata.
- Folder:
Subtypes_prediction_codes/TCGA_BRCA
- Scripts:
4_01_04_103_04_101_...All_Tiles_Using_Lasso.py/.ipynb
4_01_04_103_04_103_...Top25Q.py/.ipynb
- Task: Train logistic regression classifiers (L1 penalty) for HER2+, HR+, TNBC.
- Folder:
NPIFs_generation_codes/CPTAC_BRCA/Segmentation
- Segmentation:
2_01_22_02_Test_CPTAC_Dataset_ExtractMorphologicalFeaturesFromHnE.py/.ipynb
2_01_100_02_01_JobSubmissionCode.py/.ipynb
- Folder:
NPIFs_generation_codes/CPTAC_BRCA/Morphology_features_calculation
- Morphology:
2_02_03_02_CPTAC_MorphologyCalculation_All_Slides.py/.ipynb
2_02_13_02_CPTAC_Job_Submission_MorphologyCalculation_All_Slides.py/.ipynb
- Folder:
NPIFs_generation_codes/CPTAC_BRCA/NPIFs_Generation
- NPIFs:
2_03_02_05_CPTAC_BRCA_NPIFs_Calculation_HoverNetPrediction_Filtered_Tiles_Top25Q.py/.ipynb
- Folder:
NPIFs_generation_codes/CPTAC_BRCA/NPIFs_Generation
- Mapping:
3_01_01_07_CPTAC_Mapped_Original_Value...Top25Q.py/.ipynb
- Folder:
Subtypes_prediction_codes/CPTAC_BRCA
- External Prediction:
6_01_04_103_04_103_CPTAC_Prediction_Using_...Top25Q.py/.ipynb
- Folder:
NPIFs_generation_codes/POST_NAT_BRCA/Segmentation
- Segmentation:
2_01_22_02_Test_POST_NAT_Dataset_ExtractMorphologicalFeaturesFromHnE.py/.ipynb
2_01_100_02_POST_NAT_JobSubmissionCode.py/.ipynb
- Folder:
NPIFs_generation_codes/POST_NAT_BRCA/Morphology_features_calculation
- Morphology:
2_02_03_02_POST_NAT_MorphologyCalculation_All_Slides.py
2_02_13_02_POST_NAT_Job_Submission_MorphologyCalculation_All_Slides.py/.ipynb
- Folder:
NPIFs_generation_codes/POST_NAT_BRCA/NPIFs_Generation
- NPIFs:
2_03_02_05_POST_NAT_BRCA_NPIFs_Calculation_HoverNetPrediction_Filtered_Tiles_Top25Q.py/.ipynb
- Folder:
NPIFs_generation_codes/POST_NAT_BRCA/NPIFs_Generation
- Mapping:
3_01_01_07_POST_NAT_Mapped_Original_Value...Top25Q.py/.ipynb
- Folder:
Subtypes_prediction_codes/POST_NAT_BRCA
- Subtype Prediction:
6_01_04_103_04_103_Lasso_POST_NAT_Prediction...Top25Q.py/.ipynb
- Folder:
Survival_codes
- Mapping scripts:
5_01_01_mapped_hovernet_npifs_to_tcga_survival.py/.ipynb
5_01_02_mapped_pathai_hifs_to_tcga_survival.py/.ipynb
5_01_03_mapped_pathai_nuhifs_to_tcga_survival.py/.ipynb
5_01_04_mapped_pathai_pifs_to_tcga_survival.py/.ipynb
- Folder:
Survival_codes
- Model scripts:
6_01_01_all_npifs_OS_analysis_with_age_cv.py/.ipynb
6_01_02_01_all_hifs_OS_analysis...py/.ipynb
6_01_03_01_all_nuhifs_OS_analysis...py/.ipynb
6_01_04_01_all_pifs_OS_analysis...py/.ipynb
- Folder:
PathAI_codes
- Scripts:
1_01_01_mapped_tcga_biomarker_status_to_original_hifs_with_comments.py/.ipynb
2_01_01_PathAI_Metadata_Original_nuHIFs_And_TCGA_BiomarkerStatus.py/.ipynb
3_01_01_PathAI_Metadata_Original_PIFs_And_TCGA_BiomarkerStatus.py/.ipynb
1_01_04_103_04_103_BRCA_Clinical_Subtype_..._All_PathAI_HIFs_...Classification.py/.ipynb
2_01_04_103_04_103_BRCA_Clinical_Subtype_..._All_PathAI_nuHIFs_..._Classification.py/.ipynb
3_01_04_103_04_103_BRCA_Clinical_Subtype_..._All_PathAI_PIFs_..._Classification.py/.ipynb
3_01_04_103_04_103_01_BRCA_Clinical_Subtype_..._All_PathAI_NPIFs_..._Classification.py/.ipynb
- Folder:
Direct_codes
- Scripts:
1_01_get_tiles_from_slide.py
1_02_get_features_from_tiles2.py
1_03_collect_all_features_masks.py
1_11_jobs_to_get_tiles.py
1_12_jobs_to_get_features2.py
1_13_jobs_to_collect_features2.py
3_01_01_02_TCGA_BRCASubtypes_to_DirectHnE_Features_Resnet50.py/.ipynb
3_01_04_103_04_103_02_BRCA_Clinical_Subtype_Prediction_Using_All_Direct_Features.py/.ipynb
- Task: Extract slide-level embeddings with ResNet50 and train subtype classifiers.
All results described in the manuscript can be reproduced using the scripts provided in this repository.
- Follow the step-by-step workflow in the User Guide (PDF) to replicate subtype classification, external validation, and survival analyses.
- All manuscript-related figures are available here: Figures/.
- All TCGA-BRCA subtype-specific models are available here: Models/.
- Ranjan Kumar Barman – [email protected]
- Saugato Rahman Dhruba – [email protected]
Cancer Data Science Lab, NCI, NIH