[RHAIENG-1045] Add a notebook for standard conversion #19

fabianofranz · 2025-10-03T20:23:00Z

Description

https://issues.redhat.com/browse/RHAIENG-1045

Adds a notebook for standard Docling conversions. Support standard (default) configs, OCR, and enrichments.

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Documentation
- Added a Jupyter notebook demonstrating a standard Docling-based document-conversion workflow.
- Describes two pipeline variants (standard and OCR-forced) and a configurable pipeline selection per run.
- Shows batch PDF processing, exporting per-file results to JSON and Markdown with embedded images.
- Reports per-file and per-page confidence metrics, highlights low-confidence pages, and documents optional enrichments (off by default) with performance guidance.

coderabbitai · 2025-10-03T20:23:10Z

Walkthrough

Adds a new Jupyter notebook implementing a Docling-based PDF → JSON/Markdown conversion workflow with two pipeline variants (standard and OCR-forced), configurable enrichments, per-file exports, and per-file/per-page confidence computation and reporting.

Changes

Cohort / File(s)	Summary
Notebook: Docling document conversion workflow `notebooks/use-cases/document-conversion-standard.ipynb`	New notebook that installs/configures Docling, defines pipeline variants (standard and OCR-forced) with a shared base configuration and optional enrichments, sets input PDF URLs and output directory, runs a selected converter per file to produce Docling Document JSON and Markdown (with embedded images), computes per-file mean grade/score, flags below-average pages, and exports/logs results and confidence reports.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Notebook
  participant PipelineFactory as "Pipeline factory (standard / OCR-forced)"
  participant Converter as "Docling converter"
  participant Exporter as "JSON & Markdown exporter"
  participant FS as "Filesystem (outdir)"

  User->>Notebook: open notebook, provide input URLs & outdir, choose pipeline
  Notebook->>PipelineFactory: create pipelines (base + variant, enrichments)
  loop each PDF
    Notebook->>PipelineFactory: select pipeline for PDF
    PipelineFactory->>Converter: run conversion
    alt conversion success
      Converter-->>PipelineFactory: Docling Document + per-page confidence
      PipelineFactory-->>Notebook: result
      Notebook->>Exporter: write JSON, Markdown, images
      Exporter->>FS: save files to outdir
      Notebook-->>User: report output paths and confidence summary
    else conversion failure
      Converter-->>Notebook: error
      Notebook-->>User: surface error and suggestions (e.g., enable OCR)
    end
  end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I hop through notebook cells with glee,
turning PDFs to tidy JSON and MD.
I flag the pages where confidence dips,
and leave neat outputs from my rabbit skips.
Soft paws, sharp checks — conversion done with tea 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "[RHAIENG-1045] Add a notebook for standard conversion" accurately summarizes the primary change by indicating the addition of a new notebook focused on standard document conversion. It is concise and specific, avoiding unnecessary detail while still conveying the central update. Including the ticket identifier is acceptable for context and traceability.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fabianofranz · 2025-10-07T16:21:02Z

This is ready for review.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

notebooks/use-cases/document-conversion-standard.ipynb (2)
195-195: Consider using literal {} instead of dict().

Using {} is more idiomatic and slightly more efficient than calling dict().

Apply this diff:
-confidence_reports = dict()
+confidence_reports = {}
197-197: Remove unused json_files list.

The json_files list is initialized and populated but never used afterward. If this was intended for future use, consider adding a comment or TODO; otherwise, remove it to reduce cognitive load.

Apply this diff to remove the unused list:
-json_files=[]
-             
 for file in files:
     print(f"Converting {file}...")
 
@@ -216,7 +214,6 @@
     with open(json_output_path, "w", encoding="utf-8") as f:
         json.dump(dictionary, f)
         print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
-        json_files.append(json_output_path.resolve())
Also applies to: 220-220

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2ab4e0 and 169fae8.

📒 Files selected for processing (1)

notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)

🧰 Additional context used

🪛 Pylint (3.3.8)

notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 89-89: Consider using '{}' instead of a call to 'dict'.

(R1735)

notebooks/use-cases/document-conversion-standard.ipynb

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

notebooks/use-cases/document-conversion-standard.ipynb (1)
194-194: Consider using {} instead of dict().

For consistency with Python style conventions, use the literal syntax.

Apply this diff:
-confidence_reports = dict()
+confidence_reports = {}
Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 169fae8 and 17a5a79.

📒 Files selected for processing (1)

notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)

🧰 Additional context used

🪛 Pylint (3.3.8)

notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 88-88: Consider using '{}' instead of a call to 'dict'.

(R1735)

notebooks/use-cases/document-conversion-standard.ipynb

coderabbitai · 2025-10-07T16:43:42Z

notebooks/use-cases/document-conversion-standard.ipynb

+    "for file, confidence_report in confidence_reports.items():\n",
+    "    print(f\"Conversion confidence for {file}:\")\n",
+    "    \n",
+    "    print(f\"Average confidence: \\x1b[1m{confidence_report.mean_grade.name}\\033[0m (score {confidence_report.mean_score:.3f})\")\n",


⚠️ Potential issue | 🟡 Minor

ANSI escape codes may not render in all notebook environments.

While ANSI escape codes (\x1b[1m and \033[0m) work in terminal output, they may not render correctly in all Jupyter notebook environments or when exported to other formats. Consider using Markdown formatting in the print statement or using IPython's display with HTML instead for better cross-environment compatibility.

For better compatibility, consider using IPython's display capabilities:

from IPython.display import display, Markdown display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))

Or simply remove the escape codes if bold formatting is not critical:

- print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})") + print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")

🤖 Prompt for AI Agents

notebooks/use-cases/document-conversion-standard.ipynb around line 251: the print uses ANSI escape codes which may not render in many notebook/export environments; replace the ANSI formatting by either using IPython.display.display with Markdown to render bold (e.g., construct a Markdown string with **...**) or simply remove the escape sequences and print plain text; ensure you import IPython.display.Display/Markdown when using display and format the score to three decimals as before.

Signed-off-by: Fabiano Franz <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

notebooks/use-cases/document-conversion-standard.ipynb (1)
246-246: ANSI escape codes may not render in all notebook environments.

The ANSI escape sequences (\x1b[1m and \033[0m) may not display correctly when the notebook is exported or viewed in certain environments.

Consider using IPython's display capabilities for better compatibility:
+from IPython.display import display, Markdown
+
 for file, confidence_report in confidence_reports.items():
     print(f"Conversion confidence for {file}:")
     
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))
Or simply remove the escape codes:
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")

🧹 Nitpick comments (1)

notebooks/use-cases/document-conversion-standard.ipynb (1)
193-193: Consider using literal {} instead of dict().

For consistency with modern Python style, prefer the literal syntax.

Apply this diff:
-confidence_reports = dict()
+confidence_reports = {}
Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17a5a79 and 0d2a207.

📒 Files selected for processing (1)

notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)

🧰 Additional context used

🪛 Pylint (3.3.8)

notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 87-87: Consider using '{}' instead of a call to 'dict'.

(R1735)

🔇 Additional comments (4)

notebooks/use-cases/document-conversion-standard.ipynb (4)

1-58: LGTM!

The installation and file configuration sections are well-structured. The use of public sample files from GitHub is appropriate for demonstration purposes.

61-81: LGTM!

Proper use of pathlib for directory creation with appropriate safety flags.

84-171: LGTM!

The three converter configurations (standard, OCR, enrichment) are properly set up with distinct pipeline options. The documentation accurately references the correct converter variable names, and the Tesseract environment variable is appropriately configured.

260-271: LGTM!

The additional resources and feedback section provides helpful links and encourages user engagement.

alinaryan

This LGTM! I tried running the notebook with the default files and then also with a random text PDF I had on my laptop, and both outputs looked great and ended with a confidence score of "Excellent". I am happy with the repo structure too, I can easily see the other usecases fitting into this folder structure. I didn't try the other conversion pipelines (OCR or enrichment), but maybe you can demo that for us.

Signed-off-by: Fabiano Franz <[email protected]>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

notebooks/use-cases/document-conversion-standard.ipynb (1)
254-254: ANSI escape codes remain - previous feedback not addressed.

The ANSI escape codes (\x1b[1m and \033[0m) flagged in a previous review are still present. These may not render correctly in all Jupyter environments, especially when exporting to HTML, PDF, or viewing in JupyterLab vs classic Notebook.

As suggested in the previous review, use IPython's display capabilities for better compatibility:
+from IPython.display import display, Markdown
+
 for file, confidence_report in confidence_reports.items():
     print(f"Conversion confidence for {file}:")
     
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))
Or simply remove the formatting:
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")

🧹 Nitpick comments (2)

notebooks/use-cases/document-conversion-standard.ipynb (2)

201-201: Prefer {} over dict() for empty dictionary initialization.

As per static analysis:

-confidence_reports = dict()
+confidence_reports = {}

203-229: Add error handling to prevent single file failures from stopping the entire batch.

The conversion loop lacks error handling. If any file fails to download, convert, or export, the entire batch processing will stop. This could be frustrating when processing multiple documents, especially with remote URLs that may be temporarily unavailable.

Consider wrapping the conversion logic in a try-except block:

 for file in files:
-    # Set the converter to use (standard_converter, ocr_converter, or enrichment_converter)
-    converter = standard_converter
-    
-    # Convert the file
-    print(f"Converting {file}...")
-    result = converter.convert(file)
-    document = result.document
-    dictionary = document.export_to_dict()
-
-    file_path = Path(file)
-
-    # Calculate conversion confidence
-    confidence_reports[file] = result.confidence
-
-    # Export the document to JSON
-    json_output_path = (output_dir / f"{file_path.stem}.json")
-    with open(json_output_path, "w", encoding="utf-8") as f:
-        json.dump(dictionary, f)
-        print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
-
-    # Export the document to Markdown
-    md_output_path = output_dir / f"{file_path.stem}.md"
-    with open(md_output_path, "w", encoding="utf-8") as f:
-        markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
-        f.write(markdown)
-        print(f"Path of markdown output is: {Path(md_output_path).resolve()}")
+    try:
+        # Set the converter to use (standard_converter, ocr_converter, or enrichment_converter)
+        converter = standard_converter
+        
+        # Convert the file
+        print(f"Converting {file}...")
+        result = converter.convert(file)
+        document = result.document
+        dictionary = document.export_to_dict()
+
+        file_path = Path(file)
+
+        # Calculate conversion confidence
+        confidence_reports[file] = result.confidence
+
+        # Export the document to JSON
+        json_output_path = (output_dir / f"{file_path.stem}.json")
+        with open(json_output_path, "w", encoding="utf-8") as f:
+            json.dump(dictionary, f)
+            print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
+
+        # Export the document to Markdown
+        md_output_path = output_dir / f"{file_path.stem}.md"
+        with open(md_output_path, "w", encoding="utf-8") as f:
+            markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
+            f.write(markdown)
+            print(f"Path of markdown output is: {Path(md_output_path).resolve()}")
+    except Exception as e:
+        print(f"Error processing {file}: {e}")
+        continue

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d2a207 and 6dc1038.

📒 Files selected for processing (1)

notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)

🧰 Additional context used

🪛 Pylint (3.3.8)

notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 87-87: Consider using '{}' instead of a call to 'dict'.

(R1735)

coderabbitai · 2025-10-08T16:21:55Z

notebooks/use-cases/document-conversion-standard.ipynb

+    ")\n",
+    "\n",
+    "# Force OCR on the entire page\n",
+    "%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata\n",


⚠️ Potential issue | 🟡 Minor

Hardcoded Tesseract data path may not work in all environments.

The TESSDATA_PREFIX environment variable is set to /usr/share/tesseract/tessdata, which assumes a specific installation location. This path may not exist in all environments, container images, or operating systems.

Consider one of these alternatives:

Remove the line entirely if the default Tesseract installation already sets this variable correctly

Check if the path exists before setting it

Document the assumption that users must have Tesseract installed at this specific location

Example for option 2:

# Force OCR on the entire page -%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata +import os +tessdata_path = "/usr/share/tesseract/tessdata" +if os.path.exists(tessdata_path): + %env TESSDATA_PREFIX=/usr/share/tesseract/tessdata +else: + print(f"Warning: Tesseract data path {tessdata_path} not found. Using default.")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In notebooks/use-cases/document-conversion-standard.ipynb around line 139, the notebook hardcodes TESSDATA_PREFIX to /usr/share/tesseract/tessdata which may not exist in all environments; replace this by either removing the line, or adding a conditional that detects whether that path exists and only sets TESSDATA_PREFIX if present, otherwise leave it unset and add a short note in the notebook explaining the assumption and how to set TESSDATA_PREFIX for custom installs or containers.

notebooks/use-cases/document-conversion-standard.ipynb

RobuRishabh · 2025-10-08T18:49:27Z

notebooks/use-cases/document-conversion-standard.ipynb

+   "source": [
+    "from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\n",
+    "from docling.document_converter import DocumentConverter, PdfFormatOption\n",
+    "from docling.datamodel.base_models import InputFormat\n",
+    "from docling.datamodel.pipeline_options import (\n",
+    "    TesseractOcrOptions,\n",
+    "    PdfPipelineOptions,\n",
+    ")\n",
+    "from docling.pipeline.vlm_pipeline import VlmPipeline\n",
+    "from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\n",
+    "\n",
+    "# Standard pipeline options\n",
+    "standard_pipeline_options = PdfPipelineOptions()\n",
+    "standard_pipeline_options.generate_picture_images = True\n",
+    "standard_pipeline_options.do_table_structure = True\n",
+    "standard_pipeline_options.table_structure_options.do_cell_matching = True\n",
+    "standard_converter = DocumentConverter(\n",
+    "    format_options={\n",
+    "        InputFormat.PDF: PdfFormatOption(\n",
+    "            pipeline_options=standard_pipeline_options,\n",
+    "            backend=DoclingParseV4DocumentBackend,\n",
+    "        )\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "# Force OCR on the entire page\n",
+    "%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata\n",
+    "ocr_pipeline_options = PdfPipelineOptions()\n",
+    "ocr_pipeline_options.generate_picture_images = True\n",
+    "ocr_pipeline_options.do_table_structure = True\n",
+    "ocr_pipeline_options.table_structure_options.do_cell_matching = True\n",
+    "ocr_pipeline_options.do_ocr = True\n",
+    "ocr_pipeline_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n",
+    "ocr_pipeline_options.accelerator_options = AcceleratorOptions(\n",
+    "    num_threads=4, device=AcceleratorDevice.AUTO\n",
+    ")\n",
+    "ocr_converter = DocumentConverter(\n",
+    "    format_options={\n",
+    "        InputFormat.PDF: PdfFormatOption(\n",
+    "            pipeline_options=ocr_pipeline_options,\n",
+    "            backend=DoclingParseV4DocumentBackend,\n",
+    "        )\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "# Code and formula enrichments and picture description and classification\n",
+    "enrichment_pipeline_options = PdfPipelineOptions()\n",
+    "enrichment_pipeline_options.generate_picture_images = True\n",
+    "enrichment_pipeline_options.do_table_structure = True\n",
+    "enrichment_pipeline_options.table_structure_options.do_cell_matching = True\n",
+    "enrichment_pipeline_options.do_code_enrichment = True\n",
+    "enrichment_pipeline_options.do_formula_enrichment = True\n",
+    "enrichment_pipeline_options.do_picture_description = True\n",
+    "enrichment_pipeline_options.images_scale = 2\n",
+    "enrichment_pipeline_options.do_picture_classification = True\n",
+    "enrichment_pipeline_options.accelerator_options = AcceleratorOptions(\n",
+    "    num_threads=4, device=AcceleratorDevice.AUTO\n",
+    ")\n",
+    "enrichment_converter = DocumentConverter(\n",
+    "    format_options={\n",
+    "        InputFormat.PDF: PdfFormatOption(\n",
+    "            pipeline_options=enrichment_pipeline_options,\n",
+    "            backend=DoclingParseV4DocumentBackend,\n",
+    "        )\n",
+    "    }\n",
+    ")"
+   ]
+  },


Here, all the three converters have similar base configurations that are repeated. Can we divide it in something like this:

Base configuration (DRY principle)

Standard converter

OCR converter (inherit from base)

Enrichment converter

from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions from docling.document_converter import DocumentConverter, PdfFormatOption from docling.datamodel.base_models import InputFormat from docling.datamodel.pipeline_options import ( TesseractOcrOptions, PdfPipelineOptions, ) from docling.pipeline.vlm_pipeline import VlmPipeline from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend def create_base_options(): """Create base pipeline options with common settings.""" options = PdfPipelineOptions() options.generate_picture_images = True options.do_table_structure = True options.table_structure_options.do_cell_matching = True return options def create_converter(pipeline_options): """Create a DocumentConverter with the given pipeline options.""" return DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption( pipeline_options=pipeline_options, backend=DoclingParseV4DocumentBackend, ) } )

# Standard converter standard_options = create_base_options() standard_converter = create_converter(standard_options)

# OCR converter %env TESSDATA_PREFIX=/usr/share/tesseract/tessdata ocr_options = create_base_options() ocr_options.do_ocr = True ocr_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True) ocr_options.accelerator_options = AcceleratorOptions( num_threads=4, device=AcceleratorDevice.AUTO ) ocr_converter = create_converter(ocr_options)

# Enrichment converter enrichment_options = create_base_options() enrichment_options.do_code_enrichment = True enrichment_options.do_formula_enrichment = True enrichment_options.do_picture_description = True enrichment_options.images_scale = 2 enrichment_options.do_picture_classification = True enrichment_options.accelerator_options = AcceleratorOptions( num_threads=4, device=AcceleratorDevice.AUTO ) enrichment_converter = create_converter(enrichment_options)

I think it would also be good to add some context about the enrichment options cell to point at the docs and educate users more about enrichment (https://docling-project.github.io/docling/usage/enrichments/) since it's not technically a conversion option but something that happens at the end of a conversion.

Yes, having a docs about enrichment would be nice.

After looking at the overlap in the options more I realized I always thought of enrichment as being independent conversion since you can convert with or without it. Maybe a section on enrichment could just be done with supplementary pipeline_options for enrichment instead of labeling it as it's own converter.

After looking at the overlap in the options more I realized I always thought of enrichment as being independent conversion since you can convert with or without it. Maybe a section on enrichment could just be done with supplementary pipeline_options for enrichment instead of labeling it as it's own converter.

Agree, working on it.

I added a separate section for enrichments and moved things around a bit to adjust the flow (setting enrichments must sit in between the creation of pipeline options and the creation of the converter). I implemented approach 1 but since the mentioned changes in the flow, it's get_pipeline_options rather than get converter. Let me know what you think.

notebooks/use-cases/document-conversion-standard.ipynb

shruthis4 · 2025-10-09T14:36:53Z

notebooks/use-cases/document-conversion-standard.ipynb

+    "\n",
+    "Next we set the configuration options for our conversion pipeline. \n",
+    "\n",
+    "The next cell contains three combinations of pipeline options: the default (standard) options, a variant that forces OCR on the entire document, and another one which enables code, formula, and picture enrichments. Later in the *Conversion* section, you'll set the converter to either `standard_converter`, `ocr_converter`, or `enrichment_converter` depending on which conversion technique you'd like to use.\n",


The notebook gives an overview of all the options for the converter. Wondering if it is possible to show how different pipeline options impact the confidence scores of the produced jsons. Right now, seems like we leave that exercise to the user and use the standard converter.

Do you mean performing multiple conversions for a given pdf, with the intention of comparing the different conversion techniques and how they impact the quality of the results and the conversion confidence?

I like that idea. It could be a separate notebook, specifically targeted as a tool for conversion comparisons and finding the best technique. The quality of the conversion varies a lot depending on the file, for some the standard pipeline will produce good enough results, for others you get a bad confidence and fix that with OCR, for some others only a VLM will work.

But yeah, in this notebook the intention is to leave the exercise to the users, which can switch between techniques by setting pipeline_to_use and tweaking the enrichment knobs. ;)

shruthis4 · 2025-10-09T14:41:26Z

notebooks/use-cases/document-conversion-standard.ipynb

+   "outputs": [],
+   "source": [
+    "files = [\n",
+    "    \"https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/001-trivial/minimal-document.pdf\",\n",


Are these files guaranteed to live here? If not, we could check-in some files to our own repo and refer to them.

I think they're stable enough, but definitely not guaranteed. I suggest we address this more broadly in a separate PR, maybe add an assets folder to the repo and our own collection of sample PDF's.

@shruthis4 This is a good call out. We should discuss whether or not we want input files/code for notebooks within the repo at all.

Some thoughts around this:

I think some users might find it beneficial to copy notebooks out of the repo stand alone and just get going.

If we have tests continuously running if any urls went out of date we should be able to quickly see such an error and adjust the url accordingly.

We could also do a hybrid approach where we have our own sample-files directory and point to the link there for all notebooks as well.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

notebooks/use-cases/document-conversion-standard.ipynb (2)
141-141: Hardcoded Tesseract data path may not work in all environments.

The TESSDATA_PREFIX environment variable is set to /usr/share/tesseract/tessdata, which assumes a specific installation location. This path may not exist in all environments, container images, or operating systems.

Consider one of these alternatives:

Remove the line entirely if the default Tesseract installation already sets this variable correctly

Check if the path exists before setting it

Document the assumption in a markdown cell that users must have Tesseract installed at this specific location

Example for option 2:
import os
tessdata_path = "/usr/share/tesseract/tessdata"
if os.path.exists(tessdata_path):
    %env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
else:
    print(f"Warning: Tesseract data path {tessdata_path} not found. Using default.")
317-317: ANSI escape codes may not render in all notebook environments.

While ANSI escape codes (\x1b[1m and \033[0m) work in terminal output, they may not render correctly in all Jupyter notebook environments or when exported to other formats.

Consider using IPython's display capabilities for better cross-environment compatibility:
from IPython.display import display, Markdown

for file, confidence_report in confidence_reports.items():
    print(f"✓ Conversion confidence for {file}:")
    
    display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))
    
    low_score_pages = []
    for page in confidence_report.pages:
        page_confidence_report = confidence_report.pages[page]
        if page_confidence_report.mean_score < confidence_report.mean_score:
            low_score_pages.append(page)

    print(f"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages) or 'none'}")
    
    print()
Or simply remove the escape codes if bold formatting is not critical:
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")

🧹 Nitpick comments (3)

notebooks/use-cases/document-conversion-standard.ipynb (3)

123-123: Remove unused import.

The VlmPipeline import is not used anywhere in this notebook. Remove it to keep the imports clean.

Apply this diff:

-from docling.pipeline.vlm_pipeline import VlmPipeline
 from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend

267-267: Prefer {} literal over dict() call.

Static analysis suggests using {} instead of dict() for creating an empty dictionary, which is more concise and Pythonic.

Apply this diff:

-confidence_reports = dict()
+confidence_reports = {}

269-292: Add error handling for file operations.

The conversion loop performs file I/O operations (reading source files, writing JSON and Markdown) without error handling. If a file operation fails (e.g., network timeout for URL, disk full, permission denied), the entire loop will stop and the user won't get clear feedback.

Consider wrapping the conversion and file operations in a try-except block:

for file in files:
    try:
        # Convert the file
        print(f"Converting {file}...")
        result = converter.convert(file)
        document = result.document
        dictionary = document.export_to_dict()

        file_path = Path(file)

        # Calculate conversion confidence
        confidence_reports[file] = result.confidence

        # Export the document to JSON
        json_output_path = (output_dir / f"{file_path.stem}.json")
        with open(json_output_path, "w", encoding="utf-8") as f:
            json.dump(dictionary, f)
            print(f"✓ Path of JSON output is: {json_output_path.resolve()}")

        # Export the document to Markdown
        md_output_path = output_dir / f"{file_path.stem}.md"
        with open(md_output_path, "w", encoding="utf-8") as f:
            markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
            f.write(markdown)
            print(f"✓ Path of markdown output is: {md_output_path.resolve()}")
    
    except Exception as e:
        print(f"✗ Error converting {file}: {e}")
        continue

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc1038 and ab204f7.

📒 Files selected for processing (1)

notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)

🧰 Additional context used

🪛 Pylint (3.3.9)

notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 111-111: Consider using '{}' instead of a call to 'dict'.

(R1735)

🪛 Ruff (0.13.3)

notebooks/use-cases/document-conversion-standard.ipynb

67-70: Avoid specifying long messages outside the exception class

(TRY003)

…ements Signed-off-by: Fabiano Franz <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

notebooks/use-cases/document-conversion-standard.ipynb (1)
320-320: ANSI escape codes may not render in all notebook environments.

This issue was flagged in a previous review but remains unaddressed. ANSI escape codes (\x1b[1m and \033[0m) may not render correctly in all Jupyter environments or when exported to other formats.

For better compatibility, use IPython's display capabilities or remove the formatting:
+from IPython.display import display, Markdown
+
 for file, confidence_report in confidence_reports.items():
     print(f"✓ Conversion confidence for {file}:")
     
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))
Or simply remove the escape codes:
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")

🧹 Nitpick comments (2)

notebooks/use-cases/document-conversion-standard.ipynb (2)

282-282: Consider using urlparse or checking for URL scheme.

Using Path(file) on URL strings is fragile and semantically incorrect. While it may work for simple URLs ending with a filename, it will fail for URLs with query parameters or fragments (e.g., example.com/file.pdf?version=1).

Consider using urllib.parse.urlparse to extract the filename properly:

+    from urllib.parse import urlparse
+    
     # Calculate conversion confidence
     confidence_reports[file] = result.confidence
 
-    file_path = Path(file)
+    # Extract filename from URL or local path
+    if file.startswith(('http://', 'https://')):
+        parsed_url = urlparse(file)
+        file_path = Path(parsed_url.path)
+    else:
+        file_path = Path(file)

Alternatively, if you want a simpler approach that works for both URLs and paths:

-    file_path = Path(file)
+    # Extract base filename from URL or local path
+    file_path = Path(file.split('?')[0].split('#')[0])

268-295: Consider adding error handling for conversion and file operations.

The conversion loop lacks error handling, which means a single failed conversion will stop processing all remaining files. Consider wrapping operations in try-except blocks to handle errors gracefully and continue processing other files.

 for file in files:
-    # Convert the file
-    print(f"Converting {file}...")
-
-    result = converter.convert(file)
-    document = result.document
-    dictionary = document.export_to_dict()
-
-    # Calculate conversion confidence
-    confidence_reports[file] = result.confidence
-
-    file_path = Path(file)
-
-    # Export the document to JSON
-    json_output_path = (output_dir / f"{file_path.stem}.json")
-    with open(json_output_path, "w", encoding="utf-8") as f:
-        json.dump(dictionary, f)
-        print(f"✓ Path of JSON output is: {json_output_path.resolve()}")
-
-    # Export the document to Markdown
-    md_output_path = output_dir / f"{file_path.stem}.md"
-    with open(md_output_path, "w", encoding="utf-8") as f:
-        markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
-        f.write(markdown)
-        print(f"✓ Path of markdown output is: {md_output_path.resolve()}")
+    try:
+        # Convert the file
+        print(f"Converting {file}...")
+
+        result = converter.convert(file)
+        document = result.document
+        dictionary = document.export_to_dict()
+
+        # Calculate conversion confidence
+        confidence_reports[file] = result.confidence
+
+        file_path = Path(file)
+
+        # Export the document to JSON
+        json_output_path = (output_dir / f"{file_path.stem}.json")
+        with open(json_output_path, "w", encoding="utf-8") as f:
+            json.dump(dictionary, f)
+            print(f"✓ Path of JSON output is: {json_output_path.resolve()}")
+
+        # Export the document to Markdown
+        md_output_path = output_dir / f"{file_path.stem}.md"
+        with open(md_output_path, "w", encoding="utf-8") as f:
+            markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
+            f.write(markdown)
+            print(f"✓ Path of markdown output is: {md_output_path.resolve()}")
+    except Exception as e:
+        print(f"✗ Error processing {file}: {e}")
+        continue

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab204f7 and b21084c.

📒 Files selected for processing (1)

notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)

🧰 Additional context used

🪛 Ruff (0.13.3)

notebooks/use-cases/document-conversion-standard.ipynb

66-69: Avoid specifying long messages outside the exception class

(TRY003)

110-110: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (7)

notebooks/use-cases/document-conversion-standard.ipynb (7)

32-32: LGTM!

The installation cell correctly installs Docling with quiet output.

62-65: LGTM!

The file list configuration is clear and mixes URLs and local paths appropriately. As noted in past reviews, URL stability is a separate concern to be addressed later.

85-90: LGTM!

The output directory creation properly uses parents=True and exist_ok=True for safe directory creation.

116-168: LGTM!

The pipeline configuration structure is well-organized with:

A factory function for standard options following DRY principles

Separate OCR configuration building on the standard options

A selection function with clear error handling

This addresses previous feedback about refactoring converter creation.

190-195: LGTM!

The pipeline selection mechanism properly addresses past feedback by using a configuration variable rather than hardcoding the converter in the loop.

226-233: LGTM!

Enrichment configuration is clear, with all options disabled by default and helpful comments about the performance trade-offs.

317-330: LGTM with suggestions above.

The confidence reporting logic correctly computes per-file metrics and identifies low-scoring pages by comparing against the mean. The page number adjustment (+1) properly converts from 0-indexed to user-facing 1-indexed numbering.

alimaredia

Looks like all comments were addressed

fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch 3 times, most recently from 7f932b9 to 222fbe1 Compare October 7, 2025 16:20

fabianofranz marked this pull request as ready for review October 7, 2025 16:20

fabianofranz requested a review from a team as a code owner October 7, 2025 16:20

fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from 222fbe1 to 169fae8 Compare October 7, 2025 16:24

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

notebooks/use-cases/document-conversion-standard.ipynb Outdated Show resolved Hide resolved

notebooks/use-cases/document-conversion-standard.ipynb Outdated Show resolved Hide resolved

fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from 169fae8 to 17a5a79 Compare October 7, 2025 16:39

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from 17a5a79 to d8ee70c Compare October 7, 2025 16:45

Add a notebook for standard conversion

0d2a207

Signed-off-by: Fabiano Franz <[email protected]>

fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from d8ee70c to 0d2a207 Compare October 7, 2025 18:28

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

alinaryan approved these changes Oct 7, 2025

View reviewed changes

This was referenced Oct 7, 2025

R528 kfp support for accelerator devices #16

Closed

[RHAIENG-1146] Subset Selection Notebooks #8

Closed

fabianofranz mentioned this pull request Oct 7, 2025

[RHAIENG-1115] Smoke tests for notebooks #20

Merged

3 tasks

Add parameters tag to support headless papermill

6dc1038

Signed-off-by: Fabiano Franz <[email protected]>

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

RobuRishabh reviewed Oct 8, 2025

View reviewed changes

notebooks/use-cases/document-conversion-standard.ipynb Outdated Show resolved Hide resolved

RobuRishabh reviewed Oct 9, 2025

View reviewed changes

notebooks/use-cases/document-conversion-standard.ipynb Outdated Show resolved Hide resolved

shruthis4 reviewed Oct 9, 2025

View reviewed changes

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

Split enrichments from the group of pipeline options and other improv…

b21084c

…ements Signed-off-by: Fabiano Franz <[email protected]>

fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from ab204f7 to b21084c Compare October 10, 2025 15:18

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

alimaredia approved these changes Oct 10, 2025

View reviewed changes

alimaredia merged commit e2f1d7e into opendatahub-io:main Oct 10, 2025
1 check passed

[RHAIENG-1045] Add a notebook for standard conversion #19

[RHAIENG-1045] Add a notebook for standard conversion #19

Uh oh!

Conversation

fabianofranz commented Oct 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

fabianofranz commented Oct 7, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

alinaryan left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

fabianofranz commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 3, 2025 •

edited

Loading