This directory contains comprehensive demonstrations of the ML Pipeline framework capabilities using real-world datasets. Each demo showcases different aspects of the framework and provides practical examples of machine learning workflows.
demos/
├── README.md # This file
├── data/ # Real datasets downloaded via wget
│ ├── telco_customer_churn.csv # Telecom customer churn data
│ ├── wine_quality_red.csv # Wine quality ratings
│ ├── adult_income.csv # Census income data
│ └── titanic.csv # Titanic passenger data
├── configs/ # Demo configurations
│ ├── telco_churn_demo.yaml # Customer churn classification
│ ├── wine_quality_demo.yaml # Wine quality regression
│ ├── titanic_survival_demo.yaml # Survival prediction
│ └── adult_income_demo.yaml # Income prediction
├── schemas/ # Data schemas (optional)
└── results/ # Demo outputs and logs
File: configs/telco_churn_demo.yaml
Dataset: Telecom customer data (7,043 customers)
What it demonstrates:
- ✅ Binary classification with imbalanced classes
- ✅ Feature engineering with mixed data types
- ✅ Handling categorical variables with one-hot encoding
- ✅ Class balancing techniques
- ✅ Hyperparameter optimization with Optuna
- ✅ Comprehensive evaluation metrics
- ✅ Drift detection and monitoring
Business Context: Predict which customers are likely to churn so marketing can proactively retain them.
Key Features:
- 20+ features including tenure, charges, services
- Binary target: Churn (Yes/No)
- Class imbalance handling with
class_weight="balanced" - F1-score optimization for imbalanced data
File: configs/wine_quality_demo.yaml
Dataset: Portuguese red wine quality (1,599 wines)
What it demonstrates:
- ✅ Regression with continuous target variable
- ✅ Feature scaling for numerical data
- ✅ Regression metrics (MSE, MAE, R², MAPE)
- ✅ Residual analysis and prediction plots
- ✅ Feature importance analysis
Business Context: Predict wine quality scores based on physicochemical properties to optimize wine production.
Key Features:
- 11 numerical features (acidity, sugar, alcohol, etc.)
- Target: Quality score (0-10)
- All numerical features requiring standardization
- Perfect for demonstrating regression capabilities
File: configs/titanic_survival_demo.yaml
Dataset: Titanic passenger data (891 passengers)
What it demonstrates:
- ✅ Feature engineering with missing data
- ✅ Handling mixed data types (numerical + categorical)
- ✅ Missing value imputation strategies
- ✅ Feature selection and engineering
- ✅ Classic binary classification
Business Context: Historical analysis of survival factors to understand passenger demographics and safety factors.
Key Features:
- Mixed data types: Age, Fare (numerical), Sex, Class (categorical)
- Significant missing data requiring imputation
- Classic ML problem for learning and comparison
- Well-balanced target variable
File: configs/adult_income_demo.yaml
Dataset: US Census data (32,561 individuals)
What it demonstrates:
- ✅ Large dataset handling
- ✅ Complex categorical encoding
- ✅ XGBoost for structured data
- ✅ High-cardinality categorical features
- ✅ Early stopping and advanced optimization
Business Context: Predict income levels for socioeconomic analysis and policy planning.
Key Features:
- Large dataset (32K+ samples)
- Many categorical features with high cardinality
- Income prediction (>50K vs ≤50K)
- Demonstrates scalability and performance
- Ensure ML Pipeline is installed:
cd /path/to/mlpipeline
pip install -e .- Start MLflow tracking server:
# Option 1: Using Docker (recommended)
make up-dev
# MLflow UI available at http://localhost:5000
# Option 2: Local MLflow
mlflow server --host 0.0.0.0 --port 5000# From project root directory
mlpipeline validate --config demos/configs/telco_churn_demo.yamlmlpipeline analyze --config demos/configs/telco_churn_demo.yaml --check-data --suggest-improvements# Quick demo run (reduced trials for speed)
mlpipeline train --config demos/configs/telco_churn_demo.yaml --experiment-id "demo_$(date +%Y%m%d)"
# Production-quality run (full hyperparameter optimization)
mlpipeline train --config demos/configs/telco_churn_demo.yaml --experiment-id "full_demo_$(date +%Y%m%d)"# Real-time monitoring
mlpipeline progress --experiment-id demo_20240118 --follow
# Check experiments
mlpipeline experiments --limit 5 --sort-by accuracy- MLflow UI: http://localhost:5000
- Logs:
demos/results/telco_churn_demo.log - Artifacts: Saved in MLflow (models, plots, metrics)
| Demo | Dataset Size | Problem Type | Complexity | Time to Run | Key Learning |
|---|---|---|---|---|---|
| Telco Churn | 7K rows | Binary Classification | Medium | ~20 min | Imbalanced data, business impact |
| Wine Quality | 1.6K rows | Regression | Low | ~15 min | Regression metrics, feature scaling |
| Titanic | 891 rows | Binary Classification | Medium | ~10 min | Feature engineering, missing data |
| Adult Income | 32K rows | Binary Classification | High | ~40 min | Large data, complex categories |
- Simple regression problem
- All numerical features
- Clear interpretation
- Fast execution
- Feature engineering
- Missing data handling
- Mixed data types
- Classic ML problem
- Business-relevant problem
- Class imbalance handling
- Comprehensive evaluation
- Drift detection
- Large dataset challenges
- Complex categorical features
- Advanced optimization
- Production considerations
# Reduce training time for quick tests
model:
hyperparameter_tuning:
n_trials: 10 # Instead of 50
timeout: 300 # 5 minutes instead of 30# Start distributed stack
make up-distributed
# Update config to use distributed backend
# Add to config.yaml:
distributed:
backend: "dask"
scheduler_address: "localhost:8786"# Try XGBoost instead of Random Forest
model:
type: xgboost
parameters:
objective: "binary:logistic"
n_estimators: 100
max_depth: 6- Accuracy: ~80-85%
- F1-Score: ~60-65% (due to class imbalance)
- ROC-AUC: ~85-90%
- Training Time: 15-30 minutes
- R² Score: ~0.35-0.45
- RMSE: ~0.65-0.75
- MAE: ~0.50-0.60
- Training Time: 10-20 minutes
- Accuracy: ~80-85%
- F1-Score: ~75-80%
- ROC-AUC: ~85-88%
- Training Time: 5-15 minutes
- Accuracy: ~85-87%
- F1-Score: ~70-75%
- ROC-AUC: ~90-92%
- Training Time: 30-60 minutes
-
"Dataset not found"
# Ensure you're running from project root pwd # Should end with /mlpipeline ls demos/data/ # Should show CSV files
-
"MLflow connection error"
# Start MLflow server make up-dev # Or manually: mlflow server --host 0.0.0.0 --port 5000
-
"Configuration validation failed"
# Check configuration syntax mlpipeline validate --config demos/configs/your_demo.yaml -
"Memory/performance issues"
# Reduce hyperparameter trials # Modify n_trials in config from 50 to 10 # Use distributed computing make up-distributed
- Check logs:
demos/results/*.log - Enable verbose mode: Add
--verboseto any command - Validate setup:
mlpipeline status --detailed - Check system:
mlpipeline analyze --config your_config.yaml
After running the demos:
- Explore MLflow UI - Compare experiments and models
- Modify configurations - Try different algorithms and parameters
- Add custom features - Implement your own preprocessing steps
- Scale up - Test with distributed computing
- Deploy models - Use inference capabilities
- Monitor drift - Set up real-time monitoring
Happy experimenting! 🚀 These demos showcase the full power of the ML Pipeline framework. Start with any demo that matches your interest and skill level!