This repository provides reference data-processing pipelines and examples for Open Data Hub / Red Hat OpenShift AI. It focuses on document conversion and chunking using the Docling toolkit, packaged as Kubeflow Pipelines (KFP), example Jupyter Notebooks, and helper scripts.
The workbenches directory also provides a guide on how to create a custom workbench image to run Docling and the example notebooks in this repository.
data-processing
|
|- custom-workbench-image
|
|- kubeflow-pipelines
| |- docling-standard
| |- docling-vlm
|
|- notebooks
|- tutorials
|- use-cases
|
|- scripts
|- subset_selectionRefer to the Data Processing Kubeflow Pipelines documentation for instructions on how to install, run, and customize the Standard and VLM pipelines.
Data processing related jupyter notebooks are broken down into use-cases and tutorials.
Open Data Hub has the ability for users to add and run custom workbench images.
A sample Containerfile and instructions to create a custom workbench image are in custom-workbench-image.
Curated scripts related to data processing live in this directory.
For example the subset selection scripts allows users to identify representative samples from large training datasets.
We welcome issues and pull requests. Please:
- Open an issue describing the change.
- Include testing instructions.
- For pipeline/component changes, recompile the pipeline and update generated YAML if applicable.
- Keep parameter names and docs consistent between code and README.
Apache License 2.0