📁 Data Versioning and CI/CD with DVC

This project demonstrates a minimal setup for data versioning using DVC and CI/CD using GitHub Actions, focused on the real-world scenario of managing datasets stored in the cloud (e.g., AWS S3 or GCP Cloud Storage).

🧩 Scenario

This project addresses a common real-world need in data engineering and MLOps:

✅ Version and track datasets used in data pipelines using DVC.
☁️ Simulate remote data access from cloud storage providers like AWS S3 or GCP Cloud Storage.
🛡️ Validate dataset accessibility and structure before running pipelines — helping to:
- Avoid unnecessary data pushes or downloads
- Prevent pipeline failures due to missing or malformed files
- Reduce cloud compute and storage costs

✅ Current Logic

The script test_pipeline.py performs a dry run:

Attempts to access a CSV file (e.g., data/sample.csv).
Prints the number of rows and columns to verify structure.
This script is run automatically in CI/CD.

🔧 To-Do

✅ Check if file exists before trying to read it.
🟨 Use partial read (e.g., nrows=10) or wildcard/glob to test schema/layout.
🟨 Learn and Setup DVC (e.g., S3, SSH, or mapped drive).
🟨 Optionally add data schema validation (e.g., using pandera or great_expectations).

📚 Libraries Used

DVC – For data versioning, remote data storage, and pipeline tracking
pandas – For reading and validating structured data (CSV, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.dvc		.dvc
.github/workflows		.github/workflows
data		data
.dvcignore		.dvcignore
README.md		README.md
test_pipeline.py		test_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📁 Data Versioning and CI/CD with DVC

🧩 Scenario

✅ Current Logic

🔧 To-Do

📚 Libraries Used

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Swapnil0115/Data_Versioning_and_CICD

Folders and files

Latest commit

History

Repository files navigation

📁 Data Versioning and CI/CD with DVC

🧩 Scenario

✅ Current Logic

🔧 To-Do

📚 Libraries Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages