Skip to content

Conversation

@RobuRishabh
Copy link
Contributor

@RobuRishabh RobuRishabh commented Oct 1, 2025

R528: Add KFP Support for Accelerator Devices

Description

This PR extends the existing Docling Kubeflow Pipelines to support accelerator device configuration. The changes build upon the refactored common components from R1065 and add the ability to specify and configure accelerator devices (GPU, CPU, etc.) for both standard and VLM Docling pipelines.

Technical Details:

  • Extended AcceleratorOptions configuration in both standard and VLM pipelines
  • Added device selection parameters (AUTO, CPU, GPU, etc.)
  • Enhanced documentation and examples for accelerator device usage

How Has This Been Tested?

  • Tested both standard and VLM pipelines by successfully running local_run.py and on Openshift AI cluster
  • Environment: Tested on local Docker environment
  • Log files:
  1. https://rhods-dashboard-redhat-ods-applications.apps.dsp-nonfips-pool-wlggc.gcp.rh-ods.com/pipelineRuns/hukhan/runs/cf9523d0-3415-400d-a22e-4686b8bd268b
  2. https://rhods-dashboard-redhat-ods-applications.apps.dsp-nonfips-pool-wlggc.gcp.rh-ods.com/pipelineRuns/hukhan/runs/440450e2-71fd-42ae-8dbe-7205c10c25d9

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Related Issues

Dependencies

This PR builds upon the changes from R1065 (Refactor Common Components). It should be merged after R1065 is merged to main.

Summary by CodeRabbit

  • New Features

    • Added accelerator device selection (auto, cpu, cuda, mps) for Standard and VLM pipelines.
    • Enhanced model download options per pipeline type with optional remote endpoint support.
  • Refactor

    • Separated Standard and VLM pipelines and centralized shared components; renamed converters to match.
  • Documentation

    • Reorganized guides with new paths, added accelerator section, expanded configuration options, and updated compile/run examples.
  • Chores

    • Upgraded Kubeflow Pipelines SDK to 2.14.4.

- Merged latest main branch changes
- Resolved conflicts in docling_convert_pipeline_compiled.yaml
- Updated common components to work with latest main
- Ensured compatibility with upstream changes
- Added proper documentation and README updates
- Maintained refactored common components structure

Signed-off-by: roburishabh <[email protected]>
@RobuRishabh RobuRishabh requested a review from a team as a code owner October 1, 2025 17:24
@coderabbitai
Copy link

coderabbitai bot commented Oct 1, 2025

Walkthrough

Restructures Kubeflow Pipelines into standard and VLM variants with a new common/ package for shared components and constants. Adds accelerator_device support across pipelines, updates pipeline code and compiled YAML, replaces VLM component module, and revises READMEs and local_run scripts to match new paths, names, and parameters.

Changes

Cohort / File(s) Summary
Documentation updates
kubeflow-pipelines/README.md, kubeflow-pipelines/docling-standard/README.md, kubeflow-pipelines/docling-vlm/README.md
Repoint paths and commands to new standard/vlm files; add accelerator_device option and new config entries; renumber sections; update compile/run examples and YAML names.
Common package (shared)
kubeflow-pipelines/common/__init__.py, kubeflow-pipelines/common/components.py, kubeflow-pipelines/common/constants.py
New shared exports and constants; add import_pdfs, create_pdf_splits, download_docling_models components; define model/image constants; expose via all.
Standard pipeline code
kubeflow-pipelines/docling-standard/local_run.py, .../standard_components.py, .../standard_convert_pipeline.py
Rename converter to docling_convert_standard; add accelerator_device parameter; wire common components; pass pipeline_type=standard to model download; update pipeline signature and description.
Standard compiled spec
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
Rename component/task to -standard; add docling_accelerator_device input; extend model download with pipeline_type and remote_model_endpoint_enabled; bump kfp to 2.14.4; update logs and wiring.
VLM pipeline restructuring
kubeflow-pipelines/docling-vlm/docling_convert_components.py (removed), .../vlm_components.py (added), .../local_run.py, .../vlm_convert_pipeline.py
Replace old VLM components with docling_convert_vlm; add accelerator_device and remote model handling; use common components; pass pipeline_type=vlm; update local_run and pipeline signature.
VLM compiled spec
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
Rename component/task to -vlm; add accelerator parameter; extend model download to support standard/vlm/vlm-remote; adjust env/logs; bump kfp to 2.14.4; update description and wiring.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant Pipeline as Standard Convert Pipeline
    participant Common as common.components
    participant Models as download_docling_models
    participant Convert as docling_convert_standard
    participant Storage as Artifacts/Outputs

    User->>Pipeline: Run with params (pdf_filenames, accelerator_device="auto", ...)
    Pipeline->>Common: import_pdfs()
    Common-->>Storage: PDFs downloaded
    Pipeline->>Common: create_pdf_splits(num_splits)
    Common-->>Pipeline: Splits
    Pipeline->>Models: download_docling_models(pipeline_type="standard", remote_model_endpoint_enabled=false)
    Models-->>Storage: Models prepared
    loop For each split
        Pipeline->>Convert: convert(split, accelerator_device)
        Note right of Convert: Validate accelerator_device<br/>Configure accelerator options
        Convert-->>Storage: JSON/Markdown outputs
    end
    Pipeline-->>User: Completion + output location
Loading
sequenceDiagram
    autonumber
    actor User
    participant Pipeline as VLM Convert Pipeline
    participant Common as common.components
    participant Models as download_docling_models
    participant Convert as docling_convert_vlm
    participant Storage as Artifacts/Outputs

    User->>Pipeline: Run with params (pdf_filenames, accelerator_device, remote_model_enabled)
    Pipeline->>Common: import_pdfs()
    Common-->>Storage: PDFs downloaded
    Pipeline->>Common: create_pdf_splits()
    Common-->>Pipeline: Splits
    Pipeline->>Models: download_docling_models(pipeline_type="vlm", remote_model_endpoint_enabled=false)
    Models-->>Storage: VLM models ready
    alt remote model enabled
        Convert->>Convert: Read secrets, configure API VLM options
    else local model
        Convert->>Convert: Configure local VLM options
    end
    loop For each split
        Pipeline->>Convert: convert(split, accelerator_device)
        Note right of Convert: Validate accelerator_device and image export mode
        Convert-->>Storage: JSON/Markdown outputs
    end
    Pipeline-->>User: Completion + output location
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

I thump the ground: new paths appear,
Two trails—Standard, VLM—now clear.
With turbo paws (accelerators, aye!),
I fetch my PDFs, then hopify.
Models in a burrow, neatly shared—
YAMLs compiled, carrots prepared.
Convert complete—the meadow cheered! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “R528 kfp support for accelerator devices” directly reflects the core change of adding accelerator device support to Kubeflow Pipelines and ties back to the referenced issue, making it clear and specific without extraneous detail.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (10)
kubeflow-pipelines/docling-standard/standard_components.py (4)

5-6: Avoid sys.path hacks; package the shared module.

Relying on sys.path.insert to reach common is brittle inside component containers. Prefer a proper package import (e.g., rename directory to a valid module name and install it, or vendor the shared code into the image PYTHONPATH).


14-14: Too many parameters; group config to reduce API surface.

Static analysis flags R0917/R0915. Consider grouping rarely changed options into structured config (e.g., dataclass/pydantic parsed from JSON), or split OCR/table/image config into helper functions to keep the component interface manageable.


138-155: Build allowed set from device_map to avoid drift; set device via normalized key.

Use the normalized dev above to index device_map when instantiating AcceleratorOptions.
Apply this diff:

-    pipeline_options.accelerator_options = AcceleratorOptions(
-        num_threads=num_threads, device=device_map[accelerator_device.lower()]
-    )
+    pipeline_options.accelerator_options = AcceleratorOptions(
+        num_threads=num_threads,
+        device=device_map[dev],
+    )

120-120: Logging is helpful; include device + threads for traceability.

Augment start/finish logs with accelerator_device and num_threads to aid support.
Apply this diff:

-    print(f"docling-standard-convert: starting with backend='{pdf_backend}', files={len(input_pdfs)}", flush=True)
+    print(
+        f"docling-standard-convert: starting backend='{pdf_backend}', files={len(input_pdfs)}, "
+        f"device='{accelerator_device}', threads={num_threads}",
+        flush=True,
+    )
...
-    print("docling-standard-convert: done", flush=True)
+    print("docling-standard-convert: done", flush=True)

Also applies to: 195-202

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (3)

4-6: Replace runtime path injection with proper package import.

Same as in components: avoid sys.path manipulation; distribute common as a Python package in the image and import it normally.


71-95: No GPU resources requested; 'cuda' won’t schedule on GPU nodes.

If users choose cuda/gpu, add GPU resource requests/limits and (optionally) node selectors/tolerations; otherwise the pod may run CPU-only on generic nodes.
Example (adjust to your cluster conventions):

from kfp import kubernetes
if docling_accelerator_device.lower() in ("cuda", "gpu"):
    converter.set_gpu_limit(1)  # ensure a GPU is scheduled
    # Optional: steer to GPU nodes and tolerate taints
    kubernetes.add_node_selector(converter, "nvidia.com/gpu.present", "true")
    kubernetes.add_toleration(converter, key="nvidia.com/gpu", operator="Exists", effect="NoSchedule")

If your environment uses a different label/taint or requires explicit requests, mirror CPU pattern (e.g., set_gpu_request(1)) per your KFP/kfp-kubernetes version.

Would you like me to open a follow-up to wire GPU scheduling in both standard and VLM pipelines?


99-101: Compilation entry point: ensure compiled YAML is committed/ignored consistently.

Decide whether the compiled YAML belongs in source control. If checked in, add a CI step to recompile and diff; otherwise, add it to .gitignore.

kubeflow-pipelines/common/constants.py (3)

6-8: Centralize accelerator device choices to keep components/pipelines consistent.

Consider defining a canonical set/aliases here (e.g., ACCELERATOR_DEVICE_ALIASES = {'auto': 'auto','cpu':'cpu','cuda':'cuda','gpu':'cuda','mps':'mps'}) and reusing it in both standard and VLM components.
Example:

+# Accelerator device aliases (canonical target on the right)
+ACCELERATOR_DEVICE_ALIASES = {
+    "auto": "auto",
+    "cpu":  "cpu",
+    "cuda": "cuda",
+    "gpu":  "cuda",
+    "mps":  "mps",
+}

Then import and use in validation/mapping to avoid drift.


2-3: Revise image verification for Red Hat registry

  • Quay image tag “2.54.0” verified successfully; the UBI9 Python image tag check failed—registry.access.redhat.com’s V2 API requires authentication (or the tag may not exist).
  • Pin both images to immutable digests (image@sha256:…) and/or update your CI script to authenticate against Red Hat’s registry before querying tags.

8-8: Remove unused MODEL_TYPE_VLM_REMOTE constant
The MODEL_TYPE_VLM_REMOTE constant is only defined in constants.py and re-exported in common/__init__.py but never referenced elsewhere. Remove its declaration (constants.py) and export entries (common/init.py).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d688a0c and 14f35cf.

📒 Files selected for processing (15)
  • kubeflow-pipelines/README.md (7 hunks)
  • kubeflow-pipelines/common/__init__.py (1 hunks)
  • kubeflow-pipelines/common/components.py (1 hunks)
  • kubeflow-pipelines/common/constants.py (1 hunks)
  • kubeflow-pipelines/docling-standard/README.md (2 hunks)
  • kubeflow-pipelines/docling-standard/local_run.py (2 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (5 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (5 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (22 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (2 hunks)
  • kubeflow-pipelines/docling-vlm/docling_convert_components.py (0 hunks)
  • kubeflow-pipelines/docling-vlm/local_run.py (2 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (1 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (5 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (21 hunks)
💤 Files with no reviewable changes (1)
  • kubeflow-pipelines/docling-vlm/docling_convert_components.py
🧰 Additional context used
🧬 Code graph analysis (5)
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (3)
kubeflow-pipelines/common/components.py (3)
  • import_pdfs (9-115)
  • create_pdf_splits (120-138)
  • download_docling_models (143-212)
kubeflow-pipelines/docling-standard/standard_components.py (1)
  • docling_convert_standard (14-202)
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1)
  • convert_pipeline (23-83)
kubeflow-pipelines/docling-standard/local_run.py (2)
kubeflow-pipelines/common/components.py (3)
  • create_pdf_splits (120-138)
  • download_docling_models (143-212)
  • import_pdfs (9-115)
kubeflow-pipelines/docling-standard/standard_components.py (1)
  • docling_convert_standard (14-202)
kubeflow-pipelines/common/__init__.py (1)
kubeflow-pipelines/common/components.py (3)
  • import_pdfs (9-115)
  • create_pdf_splits (120-138)
  • download_docling_models (143-212)
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (2)
kubeflow-pipelines/common/components.py (3)
  • import_pdfs (9-115)
  • create_pdf_splits (120-138)
  • download_docling_models (143-212)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
  • docling_convert_vlm (14-174)
kubeflow-pipelines/docling-vlm/local_run.py (2)
kubeflow-pipelines/common/components.py (3)
  • create_pdf_splits (120-138)
  • download_docling_models (143-212)
  • import_pdfs (9-115)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
  • docling_convert_vlm (14-174)
🪛 Pylint (3.3.8)
kubeflow-pipelines/docling-vlm/vlm_components.py

[refactor] 14-14: Too many positional arguments (10/5)

(R0917)


[refactor] 107-112: Consider using '{"model_id": remote_model_name, "parameters": dict(max_new_tokens=400), ... }' instead of a call to 'dict'.

(R1735)


[refactor] 109-111: Consider using '{"max_new_tokens": 400}' instead of a call to 'dict'.

(R1735)


[refactor] 14-14: Too many branches (14/12)

(R0912)


[refactor] 14-14: Too many statements (72/50)

(R0915)

kubeflow-pipelines/common/components.py

[refactor] 9-9: Too many branches (21/12)

(R0912)


[refactor] 9-9: Too many statements (69/50)

(R0915)

kubeflow-pipelines/docling-standard/standard_components.py

[refactor] 14-14: Too many positional arguments (18/5)

(R0917)


[refactor] 14-14: Too many statements (66/50)

(R0915)

🪛 Ruff (0.13.2)
kubeflow-pipelines/docling-vlm/vlm_components.py

23-23: Possible hardcoded password assigned to function default: "remote_model_secret_mount_path"

(S107)


43-43: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


44-44: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


45-45: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


49-49: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


50-50: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


51-51: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


52-52: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


63-63: Avoid specifying long messages outside the exception class

(TRY003)


67-69: Avoid specifying long messages outside the exception class

(TRY003)


73-73: Avoid specifying long messages outside the exception class

(TRY003)


75-75: Possible hardcoded password assigned to: "remote_model_endpoint_url_secret"

(S105)


81-81: Avoid specifying long messages outside the exception class

(TRY003)


83-83: Possible hardcoded password assigned to: "remote_model_name_secret"

(S105)


89-89: Avoid specifying long messages outside the exception class

(TRY003)


91-91: Possible hardcoded password assigned to: "remote_model_api_key_secret"

(S105)


97-97: Avoid specifying long messages outside the exception class

(TRY003)


100-100: Avoid specifying long messages outside the exception class

(TRY003)


128-130: Avoid specifying long messages outside the exception class

(TRY003)

kubeflow-pipelines/common/components.py

14-14: Possible hardcoded password assigned to function default: "s3_secret_mount_path"

(S107)


33-33: Avoid specifying long messages outside the exception class

(TRY003)


40-40: Avoid specifying long messages outside the exception class

(TRY003)


42-42: Possible hardcoded password assigned to: "s3_endpoint_url_secret"

(S105)


48-48: Avoid specifying long messages outside the exception class

(TRY003)


50-50: Possible hardcoded password assigned to: "s3_access_key_secret"

(S105)


56-56: Avoid specifying long messages outside the exception class

(TRY003)


58-58: Possible hardcoded password assigned to: "s3_secret_key_secret"

(S105)


64-64: Avoid specifying long messages outside the exception class

(TRY003)


66-66: Possible hardcoded password assigned to: "s3_bucket_secret"

(S105)


72-72: Avoid specifying long messages outside the exception class

(TRY003)


74-74: Possible hardcoded password assigned to: "s3_prefix_secret"

(S105)


80-80: Avoid specifying long messages outside the exception class

(TRY003)


83-83: Avoid specifying long messages outside the exception class

(TRY003)


86-86: Avoid specifying long messages outside the exception class

(TRY003)


102-102: Avoid specifying long messages outside the exception class

(TRY003)


212-212: Avoid specifying long messages outside the exception class

(TRY003)

kubeflow-pipelines/common/__init__.py

23-32: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py

63-63: Possible hardcoded password assigned to: "remote_model_secret_mount_path"

(S105)

kubeflow-pipelines/docling-standard/standard_components.py

141-143: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (15)
kubeflow-pipelines/docling-standard/README.md (1)

7-48: LGTM! Documentation updates are clear and complete.

The documentation correctly reflects the refactored structure with standard-specific paths and comprehensively documents the new docling_accelerator_device parameter along with other configuration options. The parameter values (auto, cpu, cuda, mps) align with common accelerator device types.

kubeflow-pipelines/docling-vlm/local_run.py (3)

1-6: LGTM! Path manipulation enables local imports.

The sys.path modification correctly allows importing from the parent directory to access the common package for local testing.


10-17: LGTM! Import structure is clean.

The imports correctly reference the new common package and VLM-specific components, aligning with the refactored structure.


37-50: LGTM! Accelerator device support properly integrated.

The pipeline correctly:

  • Uses MODEL_TYPE_VLM constant for model downloads
  • Passes accelerator_device="auto" to the VLM converter
  • Sets remote_model_enabled=False for local testing
kubeflow-pipelines/docling-vlm/README.md (1)

7-41: LGTM! VLM documentation is comprehensive.

The documentation correctly:

  • References VLM-specific pipeline paths
  • Documents the docling_accelerator_device parameter with appropriate recommendation for GPU usage in VLM workloads
  • Includes all relevant configuration options
  • Provides accurate compilation instructions
kubeflow-pipelines/docling-standard/local_run.py (2)

1-14: LGTM! Import structure mirrors VLM pipeline.

The sys.path manipulation and imports correctly reference the common package and standard-specific components, maintaining consistency with the VLM pipeline structure.


33-45: LGTM! Standard pipeline correctly configured.

The pipeline properly:

  • Uses MODEL_TYPE_STANDARD constant for model downloads
  • Invokes docling_convert_standard instead of the generic converter
  • Passes accelerator_device="auto" parameter
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (5)

1-17: LGTM! Import structure is clean and consistent.

The sys.path manipulation and imports correctly reference the common package and VLM-specific components, maintaining consistency across the codebase.


34-34: LGTM! Pipeline signature extended for accelerator support.

The docling_accelerator_device parameter with default value "auto" is appropriately added to the pipeline signature, enabling users to configure accelerator devices.


56-59: LGTM! Model download correctly configured for VLM.

The model download uses MODEL_TYPE_VLM constant and properly passes the docling_remote_model_enabled flag, ensuring the correct models are downloaded for VLM pipelines.


64-74: LGTM! Accelerator device parameter properly threaded.

The accelerator_device parameter is correctly passed from the pipeline signature to the docling_convert_vlm component, enabling accelerator configuration at runtime.


87-89: LGTM! Output filename correctly updated.

The compiled YAML filename and log message appropriately reflect the VLM-specific pipeline naming convention.

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (2)

65-68: Good: centralized model download inputs via common + explicit pipeline_type.

The use of MODEL_TYPE_STANDARD and remote_model_endpoint_enabled=False clarifies intent and makes flows consistent with VLM.


9-15: Common re-exports verified. The symbols import_pdfs, create_pdf_splits, download_docling_models, and MODEL_TYPE_STANDARD are correctly re-exported in kubeflow-pipelines/common/__init__.py.

kubeflow-pipelines/common/__init__.py (1)

8-32: Clean export surface

Re-exporting shared components/constants here keeps downstream imports tidy. 👍

Comment on lines +74 to +113
s3_prefix_secret = "S3_PREFIX"
s3_prefix_file_path = os.path.join(s3_secret_mount_path, s3_prefix_secret)
if os.path.isfile(s3_prefix_file_path):
with open(s3_prefix_file_path) as f:
s3_prefix = f.read()
else:
raise ValueError(f"Key {s3_prefix_secret} not defined in secret {s3_secret_mount_path}")

if not s3_endpoint_url:
raise ValueError("S3_ENDPOINT_URL must be provided")

if not s3_bucket:
raise ValueError("S3_BUCKET must be provided")

s3_client = boto3.client(
's3',
endpoint_url=s3_endpoint_url,
aws_access_key_id=s3_access_key,
aws_secret_access_key=s3_secret_key,
)

for filename in filenames_list:
orig = f"{s3_prefix.rstrip('/')}/{filename.lstrip('/')}"
dest = output_path_p / filename
print(f"import-test-pdfs: downloading {orig} -> {dest} from s3", flush=True)
s3_client.download_file(s3_bucket, orig, dest)
else:
if not base_url:
raise ValueError("base_url must be provided")

for filename in filenames_list:
url = f"{base_url.rstrip('/')}/{filename.lstrip('/')}"
dest = output_path_p / filename
print(f"import-test-pdfs: downloading {url} -> {dest}", flush=True)
with requests.get(url, stream=True, timeout=30) as resp:
resp.raise_for_status()
with dest.open("wb") as f:
for chunk in resp.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle empty/whitespace S3 prefixes before building the object key

If S3_PREFIX is left blank (quite common when pulling from the bucket root), the current expression builds orig = "/<filename>", which does not match the actual key and the download fails. Newlines from the secret files also bleed into the key and credentials because we never strip them. Please trim the secret contents and only prepend the prefix when it’s non-empty.

-        if os.path.isfile(s3_prefix_file_path):
-            with open(s3_prefix_file_path) as f:
-                s3_prefix = f.read()
+        if os.path.isfile(s3_prefix_file_path):
+            with open(s3_prefix_file_path) as f:
+                s3_prefix = f.read().strip()
         else:
             raise ValueError(f"Key {s3_prefix_secret} not defined in secret {s3_secret_mount_path}")
 
-        if not s3_endpoint_url:
+        s3_endpoint_url = s3_endpoint_url.strip()
+        s3_access_key = s3_access_key.strip()
+        s3_secret_key = s3_secret_key.strip()
+        s3_bucket = s3_bucket.strip()
+
+        if not s3_endpoint_url:
             raise ValueError("S3_ENDPOINT_URL must be provided")
 
         if not s3_bucket:
             raise ValueError("S3_BUCKET must be provided")
 
         s3_client = boto3.client(
             's3',
             endpoint_url=s3_endpoint_url,
             aws_access_key_id=s3_access_key,
             aws_secret_access_key=s3_secret_key,
         )
 
         for filename in filenames_list:
-            orig = f"{s3_prefix.rstrip('/')}/{filename.lstrip('/')}"
+            key = filename.lstrip('/')
+            prefix = s3_prefix.rstrip('/')
+            orig = f"{prefix}/{key}" if prefix else key
             dest = output_path_p / filename
             print(f"import-test-pdfs: downloading {orig} -> {dest} from s3", flush=True)
             s3_client.download_file(s3_bucket, orig, dest)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
s3_prefix_secret = "S3_PREFIX"
s3_prefix_file_path = os.path.join(s3_secret_mount_path, s3_prefix_secret)
if os.path.isfile(s3_prefix_file_path):
with open(s3_prefix_file_path) as f:
s3_prefix = f.read()
else:
raise ValueError(f"Key {s3_prefix_secret} not defined in secret {s3_secret_mount_path}")
if not s3_endpoint_url:
raise ValueError("S3_ENDPOINT_URL must be provided")
if not s3_bucket:
raise ValueError("S3_BUCKET must be provided")
s3_client = boto3.client(
's3',
endpoint_url=s3_endpoint_url,
aws_access_key_id=s3_access_key,
aws_secret_access_key=s3_secret_key,
)
for filename in filenames_list:
orig = f"{s3_prefix.rstrip('/')}/{filename.lstrip('/')}"
dest = output_path_p / filename
print(f"import-test-pdfs: downloading {orig} -> {dest} from s3", flush=True)
s3_client.download_file(s3_bucket, orig, dest)
else:
if not base_url:
raise ValueError("base_url must be provided")
for filename in filenames_list:
url = f"{base_url.rstrip('/')}/{filename.lstrip('/')}"
dest = output_path_p / filename
print(f"import-test-pdfs: downloading {url} -> {dest}", flush=True)
with requests.get(url, stream=True, timeout=30) as resp:
resp.raise_for_status()
with dest.open("wb") as f:
for chunk in resp.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
s3_prefix_secret = "S3_PREFIX"
s3_prefix_file_path = os.path.join(s3_secret_mount_path, s3_prefix_secret)
if os.path.isfile(s3_prefix_file_path):
with open(s3_prefix_file_path) as f:
s3_prefix = f.read().strip()
else:
raise ValueError(f"Key {s3_prefix_secret} not defined in secret {s3_secret_mount_path}")
s3_endpoint_url = s3_endpoint_url.strip()
s3_access_key = s3_access_key.strip()
s3_secret_key = s3_secret_key.strip()
s3_bucket = s3_bucket.strip()
if not s3_endpoint_url:
raise ValueError("S3_ENDPOINT_URL must be provided")
if not s3_bucket:
raise ValueError("S3_BUCKET must be provided")
s3_client = boto3.client(
's3',
endpoint_url=s3_endpoint_url,
aws_access_key_id=s3_access_key,
aws_secret_access_key=s3_secret_key,
)
for filename in filenames_list:
key = filename.lstrip('/')
prefix = s3_prefix.rstrip('/')
orig = f"{prefix}/{key}" if prefix else key
dest = output_path_p / filename
print(f"import-test-pdfs: downloading {orig} -> {dest} from s3", flush=True)
s3_client.download_file(s3_bucket, orig, dest)
🧰 Tools
🪛 Ruff (0.13.2)

74-74: Possible hardcoded password assigned to: "s3_prefix_secret"

(S105)


80-80: Avoid specifying long messages outside the exception class

(TRY003)


83-83: Avoid specifying long messages outside the exception class

(TRY003)


86-86: Avoid specifying long messages outside the exception class

(TRY003)


102-102: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In kubeflow-pipelines/common/components.py around lines 74 to 113, the code
reads S3 secret values and builds object keys but doesn't strip
whitespace/newlines and unconditionally prepends a slash when S3_PREFIX is empty
which produces keys like "/filename". Trim whitespace from the secret contents
(s3_prefix, s3_access_key, s3_secret_key) using .strip(), and construct the S3
object key by only prepending the prefix when the stripped s3_prefix is
non-empty (e.g., join prefix and filename without producing a leading slash if
prefix is empty), leaving filename.lstrip('/') as-is so you never end up with a
leading '/' in the final key. Ensure any other secret-derived strings are
similarly stripped before use.

enrich_formula: bool = False,
enrich_picture_classes: bool = False,
enrich_picture_description: bool = False,
accelerator_device: str = "auto", # parameter for accelerator device
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Device param naming vs docs: accept 'gpu' alias.

PR text mentions AUTO/CPU/GPU, but code only accepts 'cuda'. Add 'gpu' as an alias to prevent user-visible ValueError.
Apply this diff near device validation/mapping:

-    # device validation
-    allowed_devices = ["auto", "cpu", "cuda", "mps"]
-    if accelerator_device.lower() not in allowed_devices:
-        raise ValueError(
-            f"Invalid accelerator_device: {accelerator_device}. Must be one of {allowed_devices}"
-        )
-    
-    # Map string to AcceleratorDevice enum
-    device_map = {
-        "auto": AcceleratorDevice.AUTO,
-        "cpu": AcceleratorDevice.CPU,
-        "cuda": AcceleratorDevice.CUDA,
-        "mps": AcceleratorDevice.MPS,
-    }
+    # Device aliases and validation (keep allowed values in one place to avoid drift)
+    device_map = {
+        "auto": AcceleratorDevice.AUTO,
+        "cpu":  AcceleratorDevice.CPU,
+        "cuda": AcceleratorDevice.CUDA,
+        "gpu":  AcceleratorDevice.CUDA,  # alias
+        "mps":  AcceleratorDevice.MPS,
+    }
+    dev = accelerator_device.lower()
+    if dev not in device_map:
+        raise ValueError(
+            f"Invalid accelerator_device: {accelerator_device}. Must be one of {sorted(device_map.keys())}"
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
accelerator_device: str = "auto", # parameter for accelerator device
# Device aliases and validation (keep allowed values in one place to avoid drift)
device_map = {
"auto": AcceleratorDevice.AUTO,
"cpu": AcceleratorDevice.CPU,
"cuda": AcceleratorDevice.CUDA,
"gpu": AcceleratorDevice.CUDA, # alias
"mps": AcceleratorDevice.MPS,
}
dev = accelerator_device.lower()
if dev not in device_map:
raise ValueError(
f"Invalid accelerator_device: {accelerator_device}. Must be one of {sorted(device_map.keys())}"
)
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-standard/standard_components.py around line 32,
the accelerator_device parameter and its validation only accept 'cuda' but the
docs/PR mention 'gpu' as an allowed alias; update the device validation/mapping
to treat 'gpu' (case-insensitive) as equivalent to 'cuda' (and continue to
accept 'auto' and 'cpu'), map 'gpu' -> 'cuda' before downstream use, and raise
the same ValueError for truly unsupported values; ensure any error messages
mention supported values including 'gpu'.

Comment on lines 281 to 285
- "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\
\ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
\ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.14.3'\
\ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.14.4'\
\ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
$0\" \"$@\"\n"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

kfp==2.14.4 is not published—pip install will abort

All executor bootstrap scripts pin to kfp==2.14.4, but PyPI only ships 2.14.3 as of October 1, 2025. Every task will fail during setup instead of running the component code. Please drop back to the latest available release.

(pypi.org)

-        - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'kfp==2.14.4' '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"$0\" \"$@\"\n"
+        - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'kfp==2.14.3' '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"$0\" \"$@\"\n"

Also applies to: 317-321, 460-463, 526-530

🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
around lines 281-285 (and similarly at 317-321, 460-463, 526-530), the executor
bootstrap script pins kfp==2.14.4 which is not published; update the pinned
version to the latest available release on PyPI (e.g., kfp==2.14.3) or change to
a permissive spec (e.g., kfp>=2.14.3,<3) so pip install succeeds; edit each
occurrence to replace 2.14.4 with the chosen valid version/spec and keep the
surrounding bootstrap logic unchanged.

docling_enrich_formula: bool = False,
docling_enrich_picture_classes: bool = False,
docling_enrich_picture_description: bool = False,
docling_accelerator_device: str = "auto", # parameter for accelerator device
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Param value mismatch: docs say GPU, code expects cuda.

Either update docs to say cuda, or accept 'gpu' here to match user expectations (and forward to the component).
Apply this diff to normalize earlier and pass through:

-    docling_accelerator_device: str = "auto",  # parameter for accelerator device
+    docling_accelerator_device: str = "auto",  # auto|cpu|cuda|gpu|mps

And (optional) normalize before use:

-            accelerator_device=docling_accelerator_device,  # parameter for accelerator device
+            accelerator_device=docling_accelerator_device,

(The component change will accept 'gpu'.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
docling_accelerator_device: str = "auto", # parameter for accelerator device
# In kubeflow-pipelines/docling-standard/standard_convert_pipeline.py,
# expand the parameter doc on line 43:
docling_accelerator_device: str = "auto", # auto|cpu|cuda|gpu|mps
# …later, when passing through to the component, drop the stale comment:
# before
accelerator_device=docling_accelerator_device,
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-standard/standard_convert_pipeline.py around line
43, the parameter docling_accelerator_device currently defaults/accepts values
like "cuda" while docs reference "GPU"; update the parameter handling so
user-facing values accept "gpu" (and "GPU") and are normalized to the
component-expected "cuda" before being passed through — either change the
parameter default/docs to "cuda" and update docs, or add a small normalization
step that maps case-insensitive "gpu" -> "cuda" (and leaves "auto"/"cpu"/"cuda"
unchanged) so both docs and code are consistent.

Comment on lines +125 to +146
# device validation
allowed_devices = ["auto", "cpu", "cuda", "mps"]
if accelerator_device.lower() not in allowed_devices:
raise ValueError(
f"Invalid accelerator_device: {accelerator_device}. Must be one of {allowed_devices}"
)

# Map string to AcceleratorDevice enum
device_map = {
"auto": AcceleratorDevice.AUTO,
"cpu": AcceleratorDevice.CPU,
"cuda": AcceleratorDevice.CUDA,
"mps": AcceleratorDevice.MPS,
}

pipeline_cls = VlmPipeline
pipeline_options.artifacts_path = artifacts_path_p
pipeline_options.document_timeout = float(timeout_per_document)
# Replace lines 127-129 with:
pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=num_threads, device=device_map[accelerator_device.lower()]
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Allow full accelerator device syntax (including cuda:N)

New validation hard-blocks legitimate Docling values such as cuda:0/cuda:1. AcceleratorOptions explicitly supports the cuda:<index> forms, so this now throws for common multi-GPU setups. Please accept the broader syntax (or delegate validation to Docling) so we don’t regress deployments that pin to a specific GPU.

Apply this diff to handle both enumerated names and cuda:<index>:

-    allowed_devices = ["auto", "cpu", "cuda", "mps"]
-    if accelerator_device.lower() not in allowed_devices:
-        raise ValueError(
-            f"Invalid accelerator_device: {accelerator_device}. Must be one of {allowed_devices}"
-        )
-
-    device_map = {
-        "auto": AcceleratorDevice.AUTO,
-        "cpu": AcceleratorDevice.CPU,
-        "cuda": AcceleratorDevice.CUDA,
-        "mps": AcceleratorDevice.MPS,
-    }
-
-    pipeline_options.accelerator_options = AcceleratorOptions(
-        num_threads=num_threads, device=device_map[accelerator_device.lower()]
-    )
+    normalized_device = accelerator_device.strip().lower()
+    if normalized_device.startswith("cuda:"):
+        target_device = normalized_device
+    else:
+        try:
+            target_device = AcceleratorDevice(normalized_device)
+        except ValueError as exc:
+            raise ValueError(
+                "Invalid accelerator_device. Expected 'auto', 'cpu', 'cuda', 'mps', or 'cuda:<index>'."
+            ) from exc
+
+    pipeline_options.accelerator_options = AcceleratorOptions(
+        num_threads=num_threads,
+        device=target_device,
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# device validation
allowed_devices = ["auto", "cpu", "cuda", "mps"]
if accelerator_device.lower() not in allowed_devices:
raise ValueError(
f"Invalid accelerator_device: {accelerator_device}. Must be one of {allowed_devices}"
)
# Map string to AcceleratorDevice enum
device_map = {
"auto": AcceleratorDevice.AUTO,
"cpu": AcceleratorDevice.CPU,
"cuda": AcceleratorDevice.CUDA,
"mps": AcceleratorDevice.MPS,
}
pipeline_cls = VlmPipeline
pipeline_options.artifacts_path = artifacts_path_p
pipeline_options.document_timeout = float(timeout_per_document)
# Replace lines 127-129 with:
pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=num_threads, device=device_map[accelerator_device.lower()]
)
# device validation
normalized_device = accelerator_device.strip().lower()
if normalized_device.startswith("cuda:"):
target_device = normalized_device
else:
try:
target_device = AcceleratorDevice(normalized_device)
except ValueError as exc:
raise ValueError(
"Invalid accelerator_device. Expected 'auto', 'cpu', 'cuda', 'mps', or 'cuda:<index>'."
) from exc
pipeline_cls = VlmPipeline
pipeline_options.artifacts_path = artifacts_path_p
pipeline_options.document_timeout = float(timeout_per_document)
pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=num_threads,
device=target_device,
)
🧰 Tools
🪛 Ruff (0.13.2)

128-130: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-vlm/vlm_components.py around lines 125 to 146, the
current validation rejects valid values like "cuda:0" and mapping only handles
bare names; update the logic to accept both enumerated names and "cuda:<index>"
forms by loosening validation and mapping: allow strings that match
r"^cuda:\d+$" (case-insensitive) in addition to "auto","cpu","cuda","mps", and
when a "cuda:<index>" form is provided map it to AcceleratorDevice.CUDA for
AcceleratorOptions while preserving the original "cuda:<index>" string (or pass
the string to AcceleratorOptions if it accepts device strings) so multi-GPU
pinned values are accepted; adjust the device_map/assignment accordingly and
ensure accelerator_device.lower() is used for comparisons.

Comment on lines 199 to 203
- "\nif ! [ -x \"$(command -v pip)\" ]; then\n python3 -m ensurepip ||\
\ python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1\
\ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.14.3'\
\ python3 -m pip install --quiet --no-warn-script-location 'kfp==2.14.4'\
\ '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"\
$0\" \"$@\"\n"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

kfp==2.14.4 will break every task start-up

pip install ... 'kfp==2.14.4' fails today because PyPI only exposes releases through 2.14.3. Every executor shell here (create-pdf-splits, docling-convert-vlm, download-docling-models, import-pdfs) will abort before the component code runs. Please pin to the latest published build instead. (pypi.org)

Apply this diff across the four executor blocks:

-        - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'kfp==2.14.4' '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"$0\" \"$@\"\n"
+        - "\nif ! [ -x \"$(command -v pip)\" ]; then\n    python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip\nfi\n\nPIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'kfp==2.14.3' '--no-deps' 'typing-extensions>=3.7.4,<5; python_version<\"3.9\"' && \"$0\" \"$@\"\n"

Also applies to: 235-239, 371-374, 438-442

🤖 Prompt for AI Agents
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml around lines
199-203 (and also update the same pattern at 235-239, 371-374, 438-442): the
shell executor tries to pip install kfp==2.14.4 which is not available on PyPI
and causes task startup to abort; change the pinned package to the latest
published release (kfp==2.14.3) in each of the four executor blocks so the pip
install succeeds, keeping the rest of the pip flags intact and applying the same
replacement to the other three line ranges mentioned.

@alinaryan
Copy link
Contributor

alinaryan commented Oct 7, 2025

Could you title this following the format Fabiano is doing in #19 ?
It will be easier to map to a Jira ticket.

@RobuRishabh RobuRishabh closed this Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants