This project demonstrates a minimal setup for data versioning using DVC and CI/CD using GitHub Actions, focused on the real-world scenario of managing datasets stored in the cloud (e.g., AWS S3 or GCP Cloud Storage).
This project addresses a common real-world need in data engineering and MLOps:
- β Version and track datasets used in data pipelines using DVC.
- βοΈ Simulate remote data access from cloud storage providers like AWS S3 or GCP Cloud Storage.
- π‘οΈ Validate dataset accessibility and structure before running pipelines β helping to:
- Avoid unnecessary data pushes or downloads
- Prevent pipeline failures due to missing or malformed files
- Reduce cloud compute and storage costs
The script test_pipeline.py
performs a dry run:
- Attempts to access a CSV file (e.g.,
data/sample.csv
). - Prints the number of rows and columns to verify structure.
- This script is run automatically in CI/CD.
- β Check if file exists before trying to read it.
- π¨ Use partial read (e.g.,
nrows=10
) or wildcard/glob to test schema/layout. - π¨ Learn and Setup DVC (e.g., S3, SSH, or mapped drive).
- π¨ Optionally add data schema validation (e.g., using
pandera
orgreat_expectations
).
- DVC β For data versioning, remote data storage, and pipeline tracking
- pandas β For reading and validating structured data (CSV, etc.)