Data Processing

This repository provides reference data-processing pipelines and examples for Open Data Hub / Red Hat OpenShift AI. It focuses on document conversion and chunking using the Docling toolkit, packaged as Kubeflow Pipelines (KFP), example Jupyter Notebooks, and helper scripts.

The workbenches directory also provides a guide on how to create a custom workbench image to run Docling and the example notebooks in this repository.

📦 Repository Structure

data-processing
|
|- custom-workbench-image
|
|- kubeflow-pipelines
|   |- docling-standard
|   |- docling-vlm
|
|- notebooks
    |- tutorials
    |- use-cases
|
|- scripts
    |- subset_selection

✨ Getting Started

Kubeflow Pipelines

Refer to the Data Processing Kubeflow Pipelines documentation for instructions on how to install, run, and customize the Standard and VLM pipelines.

Notebooks

Data processing related jupyter notebooks are broken down into use-cases and tutorials.

Custom Workbench Image

Open Data Hub has the ability for users to add and run custom workbench images.

A sample Containerfile and instructions to create a custom workbench image are in custom-workbench-image.

Scripts

Curated scripts related to data processing live in this directory.

For example the subset selection scripts allows users to identify representative samples from large training datasets.

🤝 Contributing

We welcome issues and pull requests. Please:

Open an issue describing the change.
Include testing instructions.
For pipeline/component changes, recompile the pipeline and update generated YAML if applicable.
Keep parameter names and docs consistent between code and README.

📄 License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github		.github
custom-workbench-image		custom-workbench-image
docs/maintainers		docs/maintainers
kubeflow-pipelines		kubeflow-pipelines
notebooks		notebooks
scripts/subset_selection		scripts/subset_selection
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Processing

📦 Repository Structure

✨ Getting Started

Kubeflow Pipelines

Notebooks

Custom Workbench Image

Scripts

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

opendatahub-io/data-processing

Folders and files

Latest commit

History

Repository files navigation

Data Processing

📦 Repository Structure

✨ Getting Started

Kubeflow Pipelines

Notebooks

Custom Workbench Image

Scripts

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages