Skip to content

Swapnil0115/Data_Versioning_and_CICD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“ Data Versioning and CI/CD with DVC

This project demonstrates a minimal setup for data versioning using DVC and CI/CD using GitHub Actions, focused on the real-world scenario of managing datasets stored in the cloud (e.g., AWS S3 or GCP Cloud Storage).


🧩 Scenario

This project addresses a common real-world need in data engineering and MLOps:

  • βœ… Version and track datasets used in data pipelines using DVC.
  • ☁️ Simulate remote data access from cloud storage providers like AWS S3 or GCP Cloud Storage.
  • πŸ›‘οΈ Validate dataset accessibility and structure before running pipelines β€” helping to:
    • Avoid unnecessary data pushes or downloads
    • Prevent pipeline failures due to missing or malformed files
    • Reduce cloud compute and storage costs

βœ… Current Logic

The script test_pipeline.py performs a dry run:

  • Attempts to access a CSV file (e.g., data/sample.csv).
  • Prints the number of rows and columns to verify structure.
  • This script is run automatically in CI/CD.

πŸ”§ To-Do

  1. βœ… Check if file exists before trying to read it.
  2. 🟨 Use partial read (e.g., nrows=10) or wildcard/glob to test schema/layout.
  3. 🟨 Learn and Setup DVC (e.g., S3, SSH, or mapped drive).
  4. 🟨 Optionally add data schema validation (e.g., using pandera or great_expectations).

πŸ“š Libraries Used

  • DVC – For data versioning, remote data storage, and pipeline tracking
  • pandas – For reading and validating structured data (CSV, etc.)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages