Skip to content

Conversation

@fabianofranz
Copy link
Contributor

@fabianofranz fabianofranz commented Oct 3, 2025

Description

https://issues.redhat.com/browse/RHAIENG-1045

Adds a notebook for standard Docling conversions. Support standard (default) configs, OCR, and enrichments.

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Documentation
    • Added a Jupyter notebook demonstrating a standard Docling-based document-conversion workflow.
    • Describes two pipeline variants (standard and OCR-forced) and a configurable pipeline selection per run.
    • Shows batch PDF processing, exporting per-file results to JSON and Markdown with embedded images.
    • Reports per-file and per-page confidence metrics, highlights low-confidence pages, and documents optional enrichments (off by default) with performance guidance.

@coderabbitai
Copy link

coderabbitai bot commented Oct 3, 2025

Walkthrough

Adds a new Jupyter notebook implementing a Docling-based PDF → JSON/Markdown conversion workflow with two pipeline variants (standard and OCR-forced), configurable enrichments, per-file exports, and per-file/per-page confidence computation and reporting.

Changes

Cohort / File(s) Summary
Notebook: Docling document conversion workflow
notebooks/use-cases/document-conversion-standard.ipynb
New notebook that installs/configures Docling, defines pipeline variants (standard and OCR-forced) with a shared base configuration and optional enrichments, sets input PDF URLs and output directory, runs a selected converter per file to produce Docling Document JSON and Markdown (with embedded images), computes per-file mean grade/score, flags below-average pages, and exports/logs results and confidence reports.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Notebook
  participant PipelineFactory as "Pipeline factory (standard / OCR-forced)"
  participant Converter as "Docling converter"
  participant Exporter as "JSON & Markdown exporter"
  participant FS as "Filesystem (outdir)"

  User->>Notebook: open notebook, provide input URLs & outdir, choose pipeline
  Notebook->>PipelineFactory: create pipelines (base + variant, enrichments)
  loop each PDF
    Notebook->>PipelineFactory: select pipeline for PDF
    PipelineFactory->>Converter: run conversion
    alt conversion success
      Converter-->>PipelineFactory: Docling Document + per-page confidence
      PipelineFactory-->>Notebook: result
      Notebook->>Exporter: write JSON, Markdown, images
      Exporter->>FS: save files to outdir
      Notebook-->>User: report output paths and confidence summary
    else conversion failure
      Converter-->>Notebook: error
      Notebook-->>User: surface error and suggestions (e.g., enable OCR)
    end
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I hop through notebook cells with glee,
turning PDFs to tidy JSON and MD.
I flag the pages where confidence dips,
and leave neat outputs from my rabbit skips.
Soft paws, sharp checks — conversion done with tea 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "[RHAIENG-1045] Add a notebook for standard conversion" accurately summarizes the primary change by indicating the addition of a new notebook focused on standard document conversion. It is concise and specific, avoiding unnecessary detail while still conveying the central update. Including the ticket identifier is acceptable for context and traceability.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@fabianofranz fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch 3 times, most recently from 7f932b9 to 222fbe1 Compare October 7, 2025 16:20
@fabianofranz fabianofranz marked this pull request as ready for review October 7, 2025 16:20
@fabianofranz fabianofranz requested a review from a team as a code owner October 7, 2025 16:20
@fabianofranz
Copy link
Contributor Author

This is ready for review.

@fabianofranz fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from 222fbe1 to 169fae8 Compare October 7, 2025 16:24
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
notebooks/use-cases/document-conversion-standard.ipynb (2)

195-195: Consider using literal {} instead of dict().

Using {} is more idiomatic and slightly more efficient than calling dict().

Apply this diff:

-confidence_reports = dict()
+confidence_reports = {}

197-197: Remove unused json_files list.

The json_files list is initialized and populated but never used afterward. If this was intended for future use, consider adding a comment or TODO; otherwise, remove it to reduce cognitive load.

Apply this diff to remove the unused list:

-json_files=[]
-             
 for file in files:
     print(f"Converting {file}...")
 
@@ -216,7 +214,6 @@
     with open(json_output_path, "w", encoding="utf-8") as f:
         json.dump(dictionary, f)
         print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
-        json_files.append(json_output_path.resolve())

Also applies to: 220-220

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2ab4e0 and 169fae8.

📒 Files selected for processing (1)
  • notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.8)
notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 89-89: Consider using '{}' instead of a call to 'dict'.

(R1735)

@fabianofranz fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from 169fae8 to 17a5a79 Compare October 7, 2025 16:39
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)

194-194: Consider using {} instead of dict().

For consistency with Python style conventions, use the literal syntax.

Apply this diff:

-confidence_reports = dict()
+confidence_reports = {}

Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 169fae8 and 17a5a79.

📒 Files selected for processing (1)
  • notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.8)
notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 88-88: Consider using '{}' instead of a call to 'dict'.

(R1735)

"for file, confidence_report in confidence_reports.items():\n",
" print(f\"Conversion confidence for {file}:\")\n",
" \n",
" print(f\"Average confidence: \\x1b[1m{confidence_report.mean_grade.name}\\033[0m (score {confidence_report.mean_score:.3f})\")\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

ANSI escape codes may not render in all notebook environments.

While ANSI escape codes (\x1b[1m and \033[0m) work in terminal output, they may not render correctly in all Jupyter notebook environments or when exported to other formats. Consider using Markdown formatting in the print statement or using IPython's display with HTML instead for better cross-environment compatibility.

For better compatibility, consider using IPython's display capabilities:

from IPython.display import display, Markdown

display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))

Or simply remove the escape codes if bold formatting is not critical:

-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🤖 Prompt for AI Agents
notebooks/use-cases/document-conversion-standard.ipynb around line 251: the
print uses ANSI escape codes which may not render in many notebook/export
environments; replace the ANSI formatting by either using
IPython.display.display with Markdown to render bold (e.g., construct a Markdown
string with **...**) or simply remove the escape sequences and print plain text;
ensure you import IPython.display.Display/Markdown when using display and format
the score to three decimals as before.

@fabianofranz fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from 17a5a79 to d8ee70c Compare October 7, 2025 16:45
@fabianofranz fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from d8ee70c to 0d2a207 Compare October 7, 2025 18:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)

246-246: ANSI escape codes may not render in all notebook environments.

The ANSI escape sequences (\x1b[1m and \033[0m) may not display correctly when the notebook is exported or viewed in certain environments.

Consider using IPython's display capabilities for better compatibility:

+from IPython.display import display, Markdown
+
 for file, confidence_report in confidence_reports.items():
     print(f"Conversion confidence for {file}:")
     
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))

Or simply remove the escape codes:

-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🧹 Nitpick comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)

193-193: Consider using literal {} instead of dict().

For consistency with modern Python style, prefer the literal syntax.

Apply this diff:

-confidence_reports = dict()
+confidence_reports = {}

Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17a5a79 and 0d2a207.

📒 Files selected for processing (1)
  • notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.8)
notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 87-87: Consider using '{}' instead of a call to 'dict'.

(R1735)

🔇 Additional comments (4)
notebooks/use-cases/document-conversion-standard.ipynb (4)

1-58: LGTM!

The installation and file configuration sections are well-structured. The use of public sample files from GitHub is appropriate for demonstration purposes.


61-81: LGTM!

Proper use of pathlib for directory creation with appropriate safety flags.


84-171: LGTM!

The three converter configurations (standard, OCR, enrichment) are properly set up with distinct pipeline options. The documentation accurately references the correct converter variable names, and the Tesseract environment variable is appropriately configured.


260-271: LGTM!

The additional resources and feedback section provides helpful links and encourages user engagement.

Copy link
Contributor

@alinaryan alinaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! I tried running the notebook with the default files and then also with a random text PDF I had on my laptop, and both outputs looked great and ended with a confidence score of "Excellent". I am happy with the repo structure too, I can easily see the other usecases fitting into this folder structure. I didn't try the other conversion pipelines (OCR or enrichment), but maybe you can demo that for us.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)

254-254: ANSI escape codes remain - previous feedback not addressed.

The ANSI escape codes (\x1b[1m and \033[0m) flagged in a previous review are still present. These may not render correctly in all Jupyter environments, especially when exporting to HTML, PDF, or viewing in JupyterLab vs classic Notebook.

As suggested in the previous review, use IPython's display capabilities for better compatibility:

+from IPython.display import display, Markdown
+
 for file, confidence_report in confidence_reports.items():
     print(f"Conversion confidence for {file}:")
     
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))

Or simply remove the formatting:

-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🧹 Nitpick comments (2)
notebooks/use-cases/document-conversion-standard.ipynb (2)

201-201: Prefer {} over dict() for empty dictionary initialization.

As per static analysis:

-confidence_reports = dict()
+confidence_reports = {}

203-229: Add error handling to prevent single file failures from stopping the entire batch.

The conversion loop lacks error handling. If any file fails to download, convert, or export, the entire batch processing will stop. This could be frustrating when processing multiple documents, especially with remote URLs that may be temporarily unavailable.

Consider wrapping the conversion logic in a try-except block:

 for file in files:
-    # Set the converter to use (standard_converter, ocr_converter, or enrichment_converter)
-    converter = standard_converter
-    
-    # Convert the file
-    print(f"Converting {file}...")
-    result = converter.convert(file)
-    document = result.document
-    dictionary = document.export_to_dict()
-
-    file_path = Path(file)
-
-    # Calculate conversion confidence
-    confidence_reports[file] = result.confidence
-
-    # Export the document to JSON
-    json_output_path = (output_dir / f"{file_path.stem}.json")
-    with open(json_output_path, "w", encoding="utf-8") as f:
-        json.dump(dictionary, f)
-        print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
-
-    # Export the document to Markdown
-    md_output_path = output_dir / f"{file_path.stem}.md"
-    with open(md_output_path, "w", encoding="utf-8") as f:
-        markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
-        f.write(markdown)
-        print(f"Path of markdown output is: {Path(md_output_path).resolve()}")
+    try:
+        # Set the converter to use (standard_converter, ocr_converter, or enrichment_converter)
+        converter = standard_converter
+        
+        # Convert the file
+        print(f"Converting {file}...")
+        result = converter.convert(file)
+        document = result.document
+        dictionary = document.export_to_dict()
+
+        file_path = Path(file)
+
+        # Calculate conversion confidence
+        confidence_reports[file] = result.confidence
+
+        # Export the document to JSON
+        json_output_path = (output_dir / f"{file_path.stem}.json")
+        with open(json_output_path, "w", encoding="utf-8") as f:
+            json.dump(dictionary, f)
+            print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
+
+        # Export the document to Markdown
+        md_output_path = output_dir / f"{file_path.stem}.md"
+        with open(md_output_path, "w", encoding="utf-8") as f:
+            markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
+            f.write(markdown)
+            print(f"Path of markdown output is: {Path(md_output_path).resolve()}")
+    except Exception as e:
+        print(f"Error processing {file}: {e}")
+        continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0d2a207 and 6dc1038.

📒 Files selected for processing (1)
  • notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.8)
notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 87-87: Consider using '{}' instead of a call to 'dict'.

(R1735)

")\n",
"\n",
"# Force OCR on the entire page\n",
"%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded Tesseract data path may not work in all environments.

The TESSDATA_PREFIX environment variable is set to /usr/share/tesseract/tessdata, which assumes a specific installation location. This path may not exist in all environments, container images, or operating systems.

Consider one of these alternatives:

  1. Remove the line entirely if the default Tesseract installation already sets this variable correctly
  2. Check if the path exists before setting it
  3. Document the assumption that users must have Tesseract installed at this specific location

Example for option 2:

 # Force OCR on the entire page
-%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
+import os
+tessdata_path = "/usr/share/tesseract/tessdata"
+if os.path.exists(tessdata_path):
+    %env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
+else:
+    print(f"Warning: Tesseract data path {tessdata_path} not found. Using default.")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In notebooks/use-cases/document-conversion-standard.ipynb around line 139, the
notebook hardcodes TESSDATA_PREFIX to /usr/share/tesseract/tessdata which may
not exist in all environments; replace this by either removing the line, or
adding a conditional that detects whether that path exists and only sets
TESSDATA_PREFIX if present, otherwise leave it unset and add a short note in the
notebook explaining the assumption and how to set TESSDATA_PREFIX for custom
installs or containers.

Comment on lines 113 to 180
"source": [
"from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\n",
"from docling.document_converter import DocumentConverter, PdfFormatOption\n",
"from docling.datamodel.base_models import InputFormat\n",
"from docling.datamodel.pipeline_options import (\n",
" TesseractOcrOptions,\n",
" PdfPipelineOptions,\n",
")\n",
"from docling.pipeline.vlm_pipeline import VlmPipeline\n",
"from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\n",
"\n",
"# Standard pipeline options\n",
"standard_pipeline_options = PdfPipelineOptions()\n",
"standard_pipeline_options.generate_picture_images = True\n",
"standard_pipeline_options.do_table_structure = True\n",
"standard_pipeline_options.table_structure_options.do_cell_matching = True\n",
"standard_converter = DocumentConverter(\n",
" format_options={\n",
" InputFormat.PDF: PdfFormatOption(\n",
" pipeline_options=standard_pipeline_options,\n",
" backend=DoclingParseV4DocumentBackend,\n",
" )\n",
" }\n",
")\n",
"\n",
"# Force OCR on the entire page\n",
"%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata\n",
"ocr_pipeline_options = PdfPipelineOptions()\n",
"ocr_pipeline_options.generate_picture_images = True\n",
"ocr_pipeline_options.do_table_structure = True\n",
"ocr_pipeline_options.table_structure_options.do_cell_matching = True\n",
"ocr_pipeline_options.do_ocr = True\n",
"ocr_pipeline_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n",
"ocr_pipeline_options.accelerator_options = AcceleratorOptions(\n",
" num_threads=4, device=AcceleratorDevice.AUTO\n",
")\n",
"ocr_converter = DocumentConverter(\n",
" format_options={\n",
" InputFormat.PDF: PdfFormatOption(\n",
" pipeline_options=ocr_pipeline_options,\n",
" backend=DoclingParseV4DocumentBackend,\n",
" )\n",
" }\n",
")\n",
"\n",
"# Code and formula enrichments and picture description and classification\n",
"enrichment_pipeline_options = PdfPipelineOptions()\n",
"enrichment_pipeline_options.generate_picture_images = True\n",
"enrichment_pipeline_options.do_table_structure = True\n",
"enrichment_pipeline_options.table_structure_options.do_cell_matching = True\n",
"enrichment_pipeline_options.do_code_enrichment = True\n",
"enrichment_pipeline_options.do_formula_enrichment = True\n",
"enrichment_pipeline_options.do_picture_description = True\n",
"enrichment_pipeline_options.images_scale = 2\n",
"enrichment_pipeline_options.do_picture_classification = True\n",
"enrichment_pipeline_options.accelerator_options = AcceleratorOptions(\n",
" num_threads=4, device=AcceleratorDevice.AUTO\n",
")\n",
"enrichment_converter = DocumentConverter(\n",
" format_options={\n",
" InputFormat.PDF: PdfFormatOption(\n",
" pipeline_options=enrichment_pipeline_options,\n",
" backend=DoclingParseV4DocumentBackend,\n",
" )\n",
" }\n",
")"
]
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, all the three converters have similar base configurations that are repeated. Can we divide it in something like this:

  1. Base configuration (DRY principle)
  2. Standard converter
  3. OCR converter (inherit from base)
  4. Enrichment converter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    TesseractOcrOptions,
    PdfPipelineOptions,
)
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend

def create_base_options():
    """Create base pipeline options with common settings."""
    options = PdfPipelineOptions()
    options.generate_picture_images = True
    options.do_table_structure = True
    options.table_structure_options.do_cell_matching = True
    return options

def create_converter(pipeline_options):
    """Create a DocumentConverter with the given pipeline options."""
    return DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                backend=DoclingParseV4DocumentBackend,
            )
        }
    )
# Standard converter
standard_options = create_base_options()
standard_converter = create_converter(standard_options)
# OCR converter
%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata

ocr_options = create_base_options()
ocr_options.do_ocr = True
ocr_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
ocr_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO
)

ocr_converter = create_converter(ocr_options)
# Enrichment converter
enrichment_options = create_base_options()
enrichment_options.do_code_enrichment = True
enrichment_options.do_formula_enrichment = True
enrichment_options.do_picture_description = True
enrichment_options.images_scale = 2
enrichment_options.do_picture_classification = True
enrichment_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO
)

enrichment_converter = create_converter(enrichment_options)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also be good to add some context about the enrichment options cell to point at the docs and educate users more about enrichment (https://docling-project.github.io/docling/usage/enrichments/) since it's not technically a conversion option but something that happens at the end of a conversion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, having a docs about enrichment would be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the overlap in the options more I realized I always thought of enrichment as being independent conversion since you can convert with or without it. Maybe a section on enrichment could just be done with supplementary pipeline_options for enrichment instead of labeling it as it's own converter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the overlap in the options more I realized I always thought of enrichment as being independent conversion since you can convert with or without it. Maybe a section on enrichment could just be done with supplementary pipeline_options for enrichment instead of labeling it as it's own converter.

Agree, working on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a separate section for enrichments and moved things around a bit to adjust the flow (setting enrichments must sit in between the creation of pipeline options and the creation of the converter). I implemented approach 1 but since the mentioned changes in the flow, it's get_pipeline_options rather than get converter. Let me know what you think.

"\n",
"Next we set the configuration options for our conversion pipeline. \n",
"\n",
"The next cell contains three combinations of pipeline options: the default (standard) options, a variant that forces OCR on the entire document, and another one which enables code, formula, and picture enrichments. Later in the *Conversion* section, you'll set the converter to either `standard_converter`, `ocr_converter`, or `enrichment_converter` depending on which conversion technique you'd like to use.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notebook gives an overview of all the options for the converter. Wondering if it is possible to show how different pipeline options impact the confidence scores of the produced jsons. Right now, seems like we leave that exercise to the user and use the standard converter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean performing multiple conversions for a given pdf, with the intention of comparing the different conversion techniques and how they impact the quality of the results and the conversion confidence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. It could be a separate notebook, specifically targeted as a tool for conversion comparisons and finding the best technique. The quality of the conversion varies a lot depending on the file, for some the standard pipeline will produce good enough results, for others you get a bad confidence and fix that with OCR, for some others only a VLM will work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah, in this notebook the intention is to leave the exercise to the users, which can switch between techniques by setting pipeline_to_use and tweaking the enrichment knobs. ;)

"outputs": [],
"source": [
"files = [\n",
" \"https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/001-trivial/minimal-document.pdf\",\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these files guaranteed to live here? If not, we could check-in some files to our own repo and refer to them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're stable enough, but definitely not guaranteed. I suggest we address this more broadly in a separate PR, maybe add an assets folder to the repo and our own collection of sample PDF's.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shruthis4 This is a good call out. We should discuss whether or not we want input files/code for notebooks within the repo at all.

Some thoughts around this:

  • I think some users might find it beneficial to copy notebooks out of the repo stand alone and just get going.
  • If we have tests continuously running if any urls went out of date we should be able to quickly see such an error and adjust the url accordingly.
  • We could also do a hybrid approach where we have our own sample-files directory and point to the link there for all notebooks as well.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
notebooks/use-cases/document-conversion-standard.ipynb (2)

141-141: Hardcoded Tesseract data path may not work in all environments.

The TESSDATA_PREFIX environment variable is set to /usr/share/tesseract/tessdata, which assumes a specific installation location. This path may not exist in all environments, container images, or operating systems.

Consider one of these alternatives:

  1. Remove the line entirely if the default Tesseract installation already sets this variable correctly
  2. Check if the path exists before setting it
  3. Document the assumption in a markdown cell that users must have Tesseract installed at this specific location

Example for option 2:

import os
tessdata_path = "/usr/share/tesseract/tessdata"
if os.path.exists(tessdata_path):
    %env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
else:
    print(f"Warning: Tesseract data path {tessdata_path} not found. Using default.")

317-317: ANSI escape codes may not render in all notebook environments.

While ANSI escape codes (\x1b[1m and \033[0m) work in terminal output, they may not render correctly in all Jupyter notebook environments or when exported to other formats.

Consider using IPython's display capabilities for better cross-environment compatibility:

from IPython.display import display, Markdown

for file, confidence_report in confidence_reports.items():
    print(f"✓ Conversion confidence for {file}:")
    
    display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))
    
    low_score_pages = []
    for page in confidence_report.pages:
        page_confidence_report = confidence_report.pages[page]
        if page_confidence_report.mean_score < confidence_report.mean_score:
            low_score_pages.append(page)

    print(f"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages) or 'none'}")
    
    print()

Or simply remove the escape codes if bold formatting is not critical:

-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🧹 Nitpick comments (3)
notebooks/use-cases/document-conversion-standard.ipynb (3)

123-123: Remove unused import.

The VlmPipeline import is not used anywhere in this notebook. Remove it to keep the imports clean.

Apply this diff:

-from docling.pipeline.vlm_pipeline import VlmPipeline
 from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend

267-267: Prefer {} literal over dict() call.

Static analysis suggests using {} instead of dict() for creating an empty dictionary, which is more concise and Pythonic.

Apply this diff:

-confidence_reports = dict()
+confidence_reports = {}

269-292: Add error handling for file operations.

The conversion loop performs file I/O operations (reading source files, writing JSON and Markdown) without error handling. If a file operation fails (e.g., network timeout for URL, disk full, permission denied), the entire loop will stop and the user won't get clear feedback.

Consider wrapping the conversion and file operations in a try-except block:

for file in files:
    try:
        # Convert the file
        print(f"Converting {file}...")
        result = converter.convert(file)
        document = result.document
        dictionary = document.export_to_dict()

        file_path = Path(file)

        # Calculate conversion confidence
        confidence_reports[file] = result.confidence

        # Export the document to JSON
        json_output_path = (output_dir / f"{file_path.stem}.json")
        with open(json_output_path, "w", encoding="utf-8") as f:
            json.dump(dictionary, f)
            print(f"✓ Path of JSON output is: {json_output_path.resolve()}")

        # Export the document to Markdown
        md_output_path = output_dir / f"{file_path.stem}.md"
        with open(md_output_path, "w", encoding="utf-8") as f:
            markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
            f.write(markdown)
            print(f"✓ Path of markdown output is: {md_output_path.resolve()}")
    
    except Exception as e:
        print(f"✗ Error converting {file}: {e}")
        continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc1038 and ab204f7.

📒 Files selected for processing (1)
  • notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.9)
notebooks/use-cases/document-conversion-standard.ipynb

[refactor] 111-111: Consider using '{}' instead of a call to 'dict'.

(R1735)

🪛 Ruff (0.13.3)
notebooks/use-cases/document-conversion-standard.ipynb

67-70: Avoid specifying long messages outside the exception class

(TRY003)

@fabianofranz fabianofranz force-pushed the RHAIENG-1045-standard-notebook branch from ab204f7 to b21084c Compare October 10, 2025 15:18
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)

320-320: ANSI escape codes may not render in all notebook environments.

This issue was flagged in a previous review but remains unaddressed. ANSI escape codes (\x1b[1m and \033[0m) may not render correctly in all Jupyter environments or when exported to other formats.

For better compatibility, use IPython's display capabilities or remove the formatting:

+from IPython.display import display, Markdown
+
 for file, confidence_report in confidence_reports.items():
     print(f"✓ Conversion confidence for {file}:")
     
-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))

Or simply remove the escape codes:

-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🧹 Nitpick comments (2)
notebooks/use-cases/document-conversion-standard.ipynb (2)

282-282: Consider using urlparse or checking for URL scheme.

Using Path(file) on URL strings is fragile and semantically incorrect. While it may work for simple URLs ending with a filename, it will fail for URLs with query parameters or fragments (e.g., example.com/file.pdf?version=1).

Consider using urllib.parse.urlparse to extract the filename properly:

+    from urllib.parse import urlparse
+    
     # Calculate conversion confidence
     confidence_reports[file] = result.confidence
 
-    file_path = Path(file)
+    # Extract filename from URL or local path
+    if file.startswith(('http://', 'https://')):
+        parsed_url = urlparse(file)
+        file_path = Path(parsed_url.path)
+    else:
+        file_path = Path(file)

Alternatively, if you want a simpler approach that works for both URLs and paths:

-    file_path = Path(file)
+    # Extract base filename from URL or local path
+    file_path = Path(file.split('?')[0].split('#')[0])

268-295: Consider adding error handling for conversion and file operations.

The conversion loop lacks error handling, which means a single failed conversion will stop processing all remaining files. Consider wrapping operations in try-except blocks to handle errors gracefully and continue processing other files.

 for file in files:
-    # Convert the file
-    print(f"Converting {file}...")
-
-    result = converter.convert(file)
-    document = result.document
-    dictionary = document.export_to_dict()
-
-    # Calculate conversion confidence
-    confidence_reports[file] = result.confidence
-
-    file_path = Path(file)
-
-    # Export the document to JSON
-    json_output_path = (output_dir / f"{file_path.stem}.json")
-    with open(json_output_path, "w", encoding="utf-8") as f:
-        json.dump(dictionary, f)
-        print(f"✓ Path of JSON output is: {json_output_path.resolve()}")
-
-    # Export the document to Markdown
-    md_output_path = output_dir / f"{file_path.stem}.md"
-    with open(md_output_path, "w", encoding="utf-8") as f:
-        markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
-        f.write(markdown)
-        print(f"✓ Path of markdown output is: {md_output_path.resolve()}")
+    try:
+        # Convert the file
+        print(f"Converting {file}...")
+
+        result = converter.convert(file)
+        document = result.document
+        dictionary = document.export_to_dict()
+
+        # Calculate conversion confidence
+        confidence_reports[file] = result.confidence
+
+        file_path = Path(file)
+
+        # Export the document to JSON
+        json_output_path = (output_dir / f"{file_path.stem}.json")
+        with open(json_output_path, "w", encoding="utf-8") as f:
+            json.dump(dictionary, f)
+            print(f"✓ Path of JSON output is: {json_output_path.resolve()}")
+
+        # Export the document to Markdown
+        md_output_path = output_dir / f"{file_path.stem}.md"
+        with open(md_output_path, "w", encoding="utf-8") as f:
+            markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
+            f.write(markdown)
+            print(f"✓ Path of markdown output is: {md_output_path.resolve()}")
+    except Exception as e:
+        print(f"✗ Error processing {file}: {e}")
+        continue
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab204f7 and b21084c.

📒 Files selected for processing (1)
  • notebooks/use-cases/document-conversion-standard.ipynb (1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
notebooks/use-cases/document-conversion-standard.ipynb

66-69: Avoid specifying long messages outside the exception class

(TRY003)


110-110: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (7)
notebooks/use-cases/document-conversion-standard.ipynb (7)

32-32: LGTM!

The installation cell correctly installs Docling with quiet output.


62-65: LGTM!

The file list configuration is clear and mixes URLs and local paths appropriately. As noted in past reviews, URL stability is a separate concern to be addressed later.


85-90: LGTM!

The output directory creation properly uses parents=True and exist_ok=True for safe directory creation.


116-168: LGTM!

The pipeline configuration structure is well-organized with:

  • A factory function for standard options following DRY principles
  • Separate OCR configuration building on the standard options
  • A selection function with clear error handling

This addresses previous feedback about refactoring converter creation.


190-195: LGTM!

The pipeline selection mechanism properly addresses past feedback by using a configuration variable rather than hardcoding the converter in the loop.


226-233: LGTM!

Enrichment configuration is clear, with all options disabled by default and helpful comments about the performance trade-offs.


317-330: LGTM with suggestions above.

The confidence reporting logic correctly computes per-file metrics and identifies low-scoring pages by comparing against the mean. The page number adjustment (+1) properly converts from 0-indexed to user-facing 1-indexed numbering.

Copy link
Contributor

@alimaredia alimaredia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all comments were addressed

@alimaredia alimaredia merged commit e2f1d7e into opendatahub-io:main Oct 10, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants