-
Notifications
You must be signed in to change notification settings - Fork 8
[RHAIENG-1045] Add a notebook for standard conversion #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RHAIENG-1045] Add a notebook for standard conversion #19
Conversation
WalkthroughAdds a new Jupyter notebook implementing a Docling-based PDF → JSON/Markdown conversion workflow with two pipeline variants (standard and OCR-forced), configurable enrichments, per-file exports, and per-file/per-page confidence computation and reporting. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Notebook
participant PipelineFactory as "Pipeline factory (standard / OCR-forced)"
participant Converter as "Docling converter"
participant Exporter as "JSON & Markdown exporter"
participant FS as "Filesystem (outdir)"
User->>Notebook: open notebook, provide input URLs & outdir, choose pipeline
Notebook->>PipelineFactory: create pipelines (base + variant, enrichments)
loop each PDF
Notebook->>PipelineFactory: select pipeline for PDF
PipelineFactory->>Converter: run conversion
alt conversion success
Converter-->>PipelineFactory: Docling Document + per-page confidence
PipelineFactory-->>Notebook: result
Notebook->>Exporter: write JSON, Markdown, images
Exporter->>FS: save files to outdir
Notebook-->>User: report output paths and confidence summary
else conversion failure
Converter-->>Notebook: error
Notebook-->>User: surface error and suggestions (e.g., enable OCR)
end
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
7f932b9 to
222fbe1
Compare
|
This is ready for review. |
222fbe1 to
169fae8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
notebooks/use-cases/document-conversion-standard.ipynb (2)
195-195: Consider using literal{}instead ofdict().Using
{}is more idiomatic and slightly more efficient than callingdict().Apply this diff:
-confidence_reports = dict() +confidence_reports = {}
197-197: Remove unusedjson_fileslist.The
json_fileslist is initialized and populated but never used afterward. If this was intended for future use, consider adding a comment or TODO; otherwise, remove it to reduce cognitive load.Apply this diff to remove the unused list:
-json_files=[] - for file in files: print(f"Converting {file}...") @@ -216,7 +214,6 @@ with open(json_output_path, "w", encoding="utf-8") as f: json.dump(dictionary, f) print(f"Path of JSON output is: {Path(json_output_path).resolve()}") - json_files.append(json_output_path.resolve())Also applies to: 220-220
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/use-cases/document-conversion-standard.ipynb(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.8)
notebooks/use-cases/document-conversion-standard.ipynb
[refactor] 89-89: Consider using '{}' instead of a call to 'dict'.
(R1735)
169fae8 to
17a5a79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)
194-194: Consider using{}instead ofdict().For consistency with Python style conventions, use the literal syntax.
Apply this diff:
-confidence_reports = dict() +confidence_reports = {}Based on static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/use-cases/document-conversion-standard.ipynb(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.8)
notebooks/use-cases/document-conversion-standard.ipynb
[refactor] 88-88: Consider using '{}' instead of a call to 'dict'.
(R1735)
| "for file, confidence_report in confidence_reports.items():\n", | ||
| " print(f\"Conversion confidence for {file}:\")\n", | ||
| " \n", | ||
| " print(f\"Average confidence: \\x1b[1m{confidence_report.mean_grade.name}\\033[0m (score {confidence_report.mean_score:.3f})\")\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANSI escape codes may not render in all notebook environments.
While ANSI escape codes (\x1b[1m and \033[0m) work in terminal output, they may not render correctly in all Jupyter notebook environments or when exported to other formats. Consider using Markdown formatting in the print statement or using IPython's display with HTML instead for better cross-environment compatibility.
For better compatibility, consider using IPython's display capabilities:
from IPython.display import display, Markdown
display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))Or simply remove the escape codes if bold formatting is not critical:
- print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+ print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")🤖 Prompt for AI Agents
notebooks/use-cases/document-conversion-standard.ipynb around line 251: the
print uses ANSI escape codes which may not render in many notebook/export
environments; replace the ANSI formatting by either using
IPython.display.display with Markdown to render bold (e.g., construct a Markdown
string with **...**) or simply remove the escape sequences and print plain text;
ensure you import IPython.display.Display/Markdown when using display and format
the score to three decimals as before.
17a5a79 to
d8ee70c
Compare
Signed-off-by: Fabiano Franz <[email protected]>
d8ee70c to
0d2a207
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)
246-246: ANSI escape codes may not render in all notebook environments.The ANSI escape sequences (
\x1b[1mand\033[0m) may not display correctly when the notebook is exported or viewed in certain environments.Consider using IPython's display capabilities for better compatibility:
+from IPython.display import display, Markdown + for file, confidence_report in confidence_reports.items(): print(f"Conversion confidence for {file}:") - print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})") + display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))Or simply remove the escape codes:
- print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})") + print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🧹 Nitpick comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)
193-193: Consider using literal{}instead ofdict().For consistency with modern Python style, prefer the literal syntax.
Apply this diff:
-confidence_reports = dict() +confidence_reports = {}Based on static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/use-cases/document-conversion-standard.ipynb(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.8)
notebooks/use-cases/document-conversion-standard.ipynb
[refactor] 87-87: Consider using '{}' instead of a call to 'dict'.
(R1735)
🔇 Additional comments (4)
notebooks/use-cases/document-conversion-standard.ipynb (4)
1-58: LGTM!The installation and file configuration sections are well-structured. The use of public sample files from GitHub is appropriate for demonstration purposes.
61-81: LGTM!Proper use of pathlib for directory creation with appropriate safety flags.
84-171: LGTM!The three converter configurations (standard, OCR, enrichment) are properly set up with distinct pipeline options. The documentation accurately references the correct converter variable names, and the Tesseract environment variable is appropriately configured.
260-271: LGTM!The additional resources and feedback section provides helpful links and encourages user engagement.
alinaryan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! I tried running the notebook with the default files and then also with a random text PDF I had on my laptop, and both outputs looked great and ended with a confidence score of "Excellent". I am happy with the repo structure too, I can easily see the other usecases fitting into this folder structure. I didn't try the other conversion pipelines (OCR or enrichment), but maybe you can demo that for us.
Signed-off-by: Fabiano Franz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)
254-254: ANSI escape codes remain - previous feedback not addressed.The ANSI escape codes (
\x1b[1mand\033[0m) flagged in a previous review are still present. These may not render correctly in all Jupyter environments, especially when exporting to HTML, PDF, or viewing in JupyterLab vs classic Notebook.As suggested in the previous review, use IPython's display capabilities for better compatibility:
+from IPython.display import display, Markdown + for file, confidence_report in confidence_reports.items(): print(f"Conversion confidence for {file}:") - print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})") + display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))Or simply remove the formatting:
- print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})") + print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🧹 Nitpick comments (2)
notebooks/use-cases/document-conversion-standard.ipynb (2)
201-201: Prefer{}overdict()for empty dictionary initialization.As per static analysis:
-confidence_reports = dict() +confidence_reports = {}
203-229: Add error handling to prevent single file failures from stopping the entire batch.The conversion loop lacks error handling. If any file fails to download, convert, or export, the entire batch processing will stop. This could be frustrating when processing multiple documents, especially with remote URLs that may be temporarily unavailable.
Consider wrapping the conversion logic in a try-except block:
for file in files: - # Set the converter to use (standard_converter, ocr_converter, or enrichment_converter) - converter = standard_converter - - # Convert the file - print(f"Converting {file}...") - result = converter.convert(file) - document = result.document - dictionary = document.export_to_dict() - - file_path = Path(file) - - # Calculate conversion confidence - confidence_reports[file] = result.confidence - - # Export the document to JSON - json_output_path = (output_dir / f"{file_path.stem}.json") - with open(json_output_path, "w", encoding="utf-8") as f: - json.dump(dictionary, f) - print(f"Path of JSON output is: {Path(json_output_path).resolve()}") - - # Export the document to Markdown - md_output_path = output_dir / f"{file_path.stem}.md" - with open(md_output_path, "w", encoding="utf-8") as f: - markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED) - f.write(markdown) - print(f"Path of markdown output is: {Path(md_output_path).resolve()}") + try: + # Set the converter to use (standard_converter, ocr_converter, or enrichment_converter) + converter = standard_converter + + # Convert the file + print(f"Converting {file}...") + result = converter.convert(file) + document = result.document + dictionary = document.export_to_dict() + + file_path = Path(file) + + # Calculate conversion confidence + confidence_reports[file] = result.confidence + + # Export the document to JSON + json_output_path = (output_dir / f"{file_path.stem}.json") + with open(json_output_path, "w", encoding="utf-8") as f: + json.dump(dictionary, f) + print(f"Path of JSON output is: {Path(json_output_path).resolve()}") + + # Export the document to Markdown + md_output_path = output_dir / f"{file_path.stem}.md" + with open(md_output_path, "w", encoding="utf-8") as f: + markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED) + f.write(markdown) + print(f"Path of markdown output is: {Path(md_output_path).resolve()}") + except Exception as e: + print(f"Error processing {file}: {e}") + continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/use-cases/document-conversion-standard.ipynb(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.8)
notebooks/use-cases/document-conversion-standard.ipynb
[refactor] 87-87: Consider using '{}' instead of a call to 'dict'.
(R1735)
| ")\n", | ||
| "\n", | ||
| "# Force OCR on the entire page\n", | ||
| "%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded Tesseract data path may not work in all environments.
The TESSDATA_PREFIX environment variable is set to /usr/share/tesseract/tessdata, which assumes a specific installation location. This path may not exist in all environments, container images, or operating systems.
Consider one of these alternatives:
- Remove the line entirely if the default Tesseract installation already sets this variable correctly
- Check if the path exists before setting it
- Document the assumption that users must have Tesseract installed at this specific location
Example for option 2:
# Force OCR on the entire page
-%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
+import os
+tessdata_path = "/usr/share/tesseract/tessdata"
+if os.path.exists(tessdata_path):
+ %env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
+else:
+ print(f"Warning: Tesseract data path {tessdata_path} not found. Using default.")Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In notebooks/use-cases/document-conversion-standard.ipynb around line 139, the
notebook hardcodes TESSDATA_PREFIX to /usr/share/tesseract/tessdata which may
not exist in all environments; replace this by either removing the line, or
adding a conditional that detects whether that path exists and only sets
TESSDATA_PREFIX if present, otherwise leave it unset and add a short note in the
notebook explaining the assumption and how to set TESSDATA_PREFIX for custom
installs or containers.
| "source": [ | ||
| "from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\n", | ||
| "from docling.document_converter import DocumentConverter, PdfFormatOption\n", | ||
| "from docling.datamodel.base_models import InputFormat\n", | ||
| "from docling.datamodel.pipeline_options import (\n", | ||
| " TesseractOcrOptions,\n", | ||
| " PdfPipelineOptions,\n", | ||
| ")\n", | ||
| "from docling.pipeline.vlm_pipeline import VlmPipeline\n", | ||
| "from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\n", | ||
| "\n", | ||
| "# Standard pipeline options\n", | ||
| "standard_pipeline_options = PdfPipelineOptions()\n", | ||
| "standard_pipeline_options.generate_picture_images = True\n", | ||
| "standard_pipeline_options.do_table_structure = True\n", | ||
| "standard_pipeline_options.table_structure_options.do_cell_matching = True\n", | ||
| "standard_converter = DocumentConverter(\n", | ||
| " format_options={\n", | ||
| " InputFormat.PDF: PdfFormatOption(\n", | ||
| " pipeline_options=standard_pipeline_options,\n", | ||
| " backend=DoclingParseV4DocumentBackend,\n", | ||
| " )\n", | ||
| " }\n", | ||
| ")\n", | ||
| "\n", | ||
| "# Force OCR on the entire page\n", | ||
| "%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata\n", | ||
| "ocr_pipeline_options = PdfPipelineOptions()\n", | ||
| "ocr_pipeline_options.generate_picture_images = True\n", | ||
| "ocr_pipeline_options.do_table_structure = True\n", | ||
| "ocr_pipeline_options.table_structure_options.do_cell_matching = True\n", | ||
| "ocr_pipeline_options.do_ocr = True\n", | ||
| "ocr_pipeline_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n", | ||
| "ocr_pipeline_options.accelerator_options = AcceleratorOptions(\n", | ||
| " num_threads=4, device=AcceleratorDevice.AUTO\n", | ||
| ")\n", | ||
| "ocr_converter = DocumentConverter(\n", | ||
| " format_options={\n", | ||
| " InputFormat.PDF: PdfFormatOption(\n", | ||
| " pipeline_options=ocr_pipeline_options,\n", | ||
| " backend=DoclingParseV4DocumentBackend,\n", | ||
| " )\n", | ||
| " }\n", | ||
| ")\n", | ||
| "\n", | ||
| "# Code and formula enrichments and picture description and classification\n", | ||
| "enrichment_pipeline_options = PdfPipelineOptions()\n", | ||
| "enrichment_pipeline_options.generate_picture_images = True\n", | ||
| "enrichment_pipeline_options.do_table_structure = True\n", | ||
| "enrichment_pipeline_options.table_structure_options.do_cell_matching = True\n", | ||
| "enrichment_pipeline_options.do_code_enrichment = True\n", | ||
| "enrichment_pipeline_options.do_formula_enrichment = True\n", | ||
| "enrichment_pipeline_options.do_picture_description = True\n", | ||
| "enrichment_pipeline_options.images_scale = 2\n", | ||
| "enrichment_pipeline_options.do_picture_classification = True\n", | ||
| "enrichment_pipeline_options.accelerator_options = AcceleratorOptions(\n", | ||
| " num_threads=4, device=AcceleratorDevice.AUTO\n", | ||
| ")\n", | ||
| "enrichment_converter = DocumentConverter(\n", | ||
| " format_options={\n", | ||
| " InputFormat.PDF: PdfFormatOption(\n", | ||
| " pipeline_options=enrichment_pipeline_options,\n", | ||
| " backend=DoclingParseV4DocumentBackend,\n", | ||
| " )\n", | ||
| " }\n", | ||
| ")" | ||
| ] | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, all the three converters have similar base configurations that are repeated. Can we divide it in something like this:
- Base configuration (DRY principle)
- Standard converter
- OCR converter (inherit from base)
- Enrichment converter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
TesseractOcrOptions,
PdfPipelineOptions,
)
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
def create_base_options():
"""Create base pipeline options with common settings."""
options = PdfPipelineOptions()
options.generate_picture_images = True
options.do_table_structure = True
options.table_structure_options.do_cell_matching = True
return options
def create_converter(pipeline_options):
"""Create a DocumentConverter with the given pipeline options."""
return DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
backend=DoclingParseV4DocumentBackend,
)
}
)
# Standard converter
standard_options = create_base_options()
standard_converter = create_converter(standard_options)
# OCR converter
%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
ocr_options = create_base_options()
ocr_options.do_ocr = True
ocr_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
ocr_options.accelerator_options = AcceleratorOptions(
num_threads=4, device=AcceleratorDevice.AUTO
)
ocr_converter = create_converter(ocr_options)
# Enrichment converter
enrichment_options = create_base_options()
enrichment_options.do_code_enrichment = True
enrichment_options.do_formula_enrichment = True
enrichment_options.do_picture_description = True
enrichment_options.images_scale = 2
enrichment_options.do_picture_classification = True
enrichment_options.accelerator_options = AcceleratorOptions(
num_threads=4, device=AcceleratorDevice.AUTO
)
enrichment_converter = create_converter(enrichment_options)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would also be good to add some context about the enrichment options cell to point at the docs and educate users more about enrichment (https://docling-project.github.io/docling/usage/enrichments/) since it's not technically a conversion option but something that happens at the end of a conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, having a docs about enrichment would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at the overlap in the options more I realized I always thought of enrichment as being independent conversion since you can convert with or without it. Maybe a section on enrichment could just be done with supplementary pipeline_options for enrichment instead of labeling it as it's own converter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at the overlap in the options more I realized I always thought of enrichment as being independent conversion since you can convert with or without it. Maybe a section on enrichment could just be done with supplementary pipeline_options for enrichment instead of labeling it as it's own converter.
Agree, working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a separate section for enrichments and moved things around a bit to adjust the flow (setting enrichments must sit in between the creation of pipeline options and the creation of the converter). I implemented approach 1 but since the mentioned changes in the flow, it's get_pipeline_options rather than get converter. Let me know what you think.
| "\n", | ||
| "Next we set the configuration options for our conversion pipeline. \n", | ||
| "\n", | ||
| "The next cell contains three combinations of pipeline options: the default (standard) options, a variant that forces OCR on the entire document, and another one which enables code, formula, and picture enrichments. Later in the *Conversion* section, you'll set the converter to either `standard_converter`, `ocr_converter`, or `enrichment_converter` depending on which conversion technique you'd like to use.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The notebook gives an overview of all the options for the converter. Wondering if it is possible to show how different pipeline options impact the confidence scores of the produced jsons. Right now, seems like we leave that exercise to the user and use the standard converter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean performing multiple conversions for a given pdf, with the intention of comparing the different conversion techniques and how they impact the quality of the results and the conversion confidence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. It could be a separate notebook, specifically targeted as a tool for conversion comparisons and finding the best technique. The quality of the conversion varies a lot depending on the file, for some the standard pipeline will produce good enough results, for others you get a bad confidence and fix that with OCR, for some others only a VLM will work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah, in this notebook the intention is to leave the exercise to the users, which can switch between techniques by setting pipeline_to_use and tweaking the enrichment knobs. ;)
| "outputs": [], | ||
| "source": [ | ||
| "files = [\n", | ||
| " \"https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/001-trivial/minimal-document.pdf\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these files guaranteed to live here? If not, we could check-in some files to our own repo and refer to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they're stable enough, but definitely not guaranteed. I suggest we address this more broadly in a separate PR, maybe add an assets folder to the repo and our own collection of sample PDF's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shruthis4 This is a good call out. We should discuss whether or not we want input files/code for notebooks within the repo at all.
Some thoughts around this:
- I think some users might find it beneficial to copy notebooks out of the repo stand alone and just get going.
- If we have tests continuously running if any urls went out of date we should be able to quickly see such an error and adjust the url accordingly.
- We could also do a hybrid approach where we have our own
sample-filesdirectory and point to the link there for all notebooks as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
notebooks/use-cases/document-conversion-standard.ipynb (2)
141-141: Hardcoded Tesseract data path may not work in all environments.The
TESSDATA_PREFIXenvironment variable is set to/usr/share/tesseract/tessdata, which assumes a specific installation location. This path may not exist in all environments, container images, or operating systems.Consider one of these alternatives:
- Remove the line entirely if the default Tesseract installation already sets this variable correctly
- Check if the path exists before setting it
- Document the assumption in a markdown cell that users must have Tesseract installed at this specific location
Example for option 2:
import os tessdata_path = "/usr/share/tesseract/tessdata" if os.path.exists(tessdata_path): %env TESSDATA_PREFIX=/usr/share/tesseract/tessdata else: print(f"Warning: Tesseract data path {tessdata_path} not found. Using default.")
317-317: ANSI escape codes may not render in all notebook environments.While ANSI escape codes (
\x1b[1mand\033[0m) work in terminal output, they may not render correctly in all Jupyter notebook environments or when exported to other formats.Consider using IPython's display capabilities for better cross-environment compatibility:
from IPython.display import display, Markdown for file, confidence_report in confidence_reports.items(): print(f"✓ Conversion confidence for {file}:") display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})")) low_score_pages = [] for page in confidence_report.pages: page_confidence_report = confidence_report.pages[page] if page_confidence_report.mean_score < confidence_report.mean_score: low_score_pages.append(page) print(f"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages) or 'none'}") print()Or simply remove the escape codes if bold formatting is not critical:
- print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})") + print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🧹 Nitpick comments (3)
notebooks/use-cases/document-conversion-standard.ipynb (3)
123-123: Remove unused import.The
VlmPipelineimport is not used anywhere in this notebook. Remove it to keep the imports clean.Apply this diff:
-from docling.pipeline.vlm_pipeline import VlmPipeline from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
267-267: Prefer{}literal overdict()call.Static analysis suggests using
{}instead ofdict()for creating an empty dictionary, which is more concise and Pythonic.Apply this diff:
-confidence_reports = dict() +confidence_reports = {}
269-292: Add error handling for file operations.The conversion loop performs file I/O operations (reading source files, writing JSON and Markdown) without error handling. If a file operation fails (e.g., network timeout for URL, disk full, permission denied), the entire loop will stop and the user won't get clear feedback.
Consider wrapping the conversion and file operations in a try-except block:
for file in files: try: # Convert the file print(f"Converting {file}...") result = converter.convert(file) document = result.document dictionary = document.export_to_dict() file_path = Path(file) # Calculate conversion confidence confidence_reports[file] = result.confidence # Export the document to JSON json_output_path = (output_dir / f"{file_path.stem}.json") with open(json_output_path, "w", encoding="utf-8") as f: json.dump(dictionary, f) print(f"✓ Path of JSON output is: {json_output_path.resolve()}") # Export the document to Markdown md_output_path = output_dir / f"{file_path.stem}.md" with open(md_output_path, "w", encoding="utf-8") as f: markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED) f.write(markdown) print(f"✓ Path of markdown output is: {md_output_path.resolve()}") except Exception as e: print(f"✗ Error converting {file}: {e}") continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/use-cases/document-conversion-standard.ipynb(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.9)
notebooks/use-cases/document-conversion-standard.ipynb
[refactor] 111-111: Consider using '{}' instead of a call to 'dict'.
(R1735)
🪛 Ruff (0.13.3)
notebooks/use-cases/document-conversion-standard.ipynb
67-70: Avoid specifying long messages outside the exception class
(TRY003)
…ements Signed-off-by: Fabiano Franz <[email protected]>
ab204f7 to
b21084c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
notebooks/use-cases/document-conversion-standard.ipynb (1)
320-320: ANSI escape codes may not render in all notebook environments.This issue was flagged in a previous review but remains unaddressed. ANSI escape codes (
\x1b[1mand\033[0m) may not render correctly in all Jupyter environments or when exported to other formats.For better compatibility, use IPython's display capabilities or remove the formatting:
+from IPython.display import display, Markdown + for file, confidence_report in confidence_reports.items(): print(f"✓ Conversion confidence for {file}:") - print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})") + display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))Or simply remove the escape codes:
- print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})") + print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🧹 Nitpick comments (2)
notebooks/use-cases/document-conversion-standard.ipynb (2)
282-282: Consider usingurlparseor checking for URL scheme.Using
Path(file)on URL strings is fragile and semantically incorrect. While it may work for simple URLs ending with a filename, it will fail for URLs with query parameters or fragments (e.g.,example.com/file.pdf?version=1).Consider using
urllib.parse.urlparseto extract the filename properly:+ from urllib.parse import urlparse + # Calculate conversion confidence confidence_reports[file] = result.confidence - file_path = Path(file) + # Extract filename from URL or local path + if file.startswith(('http://', 'https://')): + parsed_url = urlparse(file) + file_path = Path(parsed_url.path) + else: + file_path = Path(file)Alternatively, if you want a simpler approach that works for both URLs and paths:
- file_path = Path(file) + # Extract base filename from URL or local path + file_path = Path(file.split('?')[0].split('#')[0])
268-295: Consider adding error handling for conversion and file operations.The conversion loop lacks error handling, which means a single failed conversion will stop processing all remaining files. Consider wrapping operations in try-except blocks to handle errors gracefully and continue processing other files.
for file in files: - # Convert the file - print(f"Converting {file}...") - - result = converter.convert(file) - document = result.document - dictionary = document.export_to_dict() - - # Calculate conversion confidence - confidence_reports[file] = result.confidence - - file_path = Path(file) - - # Export the document to JSON - json_output_path = (output_dir / f"{file_path.stem}.json") - with open(json_output_path, "w", encoding="utf-8") as f: - json.dump(dictionary, f) - print(f"✓ Path of JSON output is: {json_output_path.resolve()}") - - # Export the document to Markdown - md_output_path = output_dir / f"{file_path.stem}.md" - with open(md_output_path, "w", encoding="utf-8") as f: - markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED) - f.write(markdown) - print(f"✓ Path of markdown output is: {md_output_path.resolve()}") + try: + # Convert the file + print(f"Converting {file}...") + + result = converter.convert(file) + document = result.document + dictionary = document.export_to_dict() + + # Calculate conversion confidence + confidence_reports[file] = result.confidence + + file_path = Path(file) + + # Export the document to JSON + json_output_path = (output_dir / f"{file_path.stem}.json") + with open(json_output_path, "w", encoding="utf-8") as f: + json.dump(dictionary, f) + print(f"✓ Path of JSON output is: {json_output_path.resolve()}") + + # Export the document to Markdown + md_output_path = output_dir / f"{file_path.stem}.md" + with open(md_output_path, "w", encoding="utf-8") as f: + markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED) + f.write(markdown) + print(f"✓ Path of markdown output is: {md_output_path.resolve()}") + except Exception as e: + print(f"✗ Error processing {file}: {e}") + continue
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
notebooks/use-cases/document-conversion-standard.ipynb(1 hunks)
🧰 Additional context used
🪛 Ruff (0.13.3)
notebooks/use-cases/document-conversion-standard.ipynb
66-69: Avoid specifying long messages outside the exception class
(TRY003)
110-110: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (7)
notebooks/use-cases/document-conversion-standard.ipynb (7)
32-32: LGTM!The installation cell correctly installs Docling with quiet output.
62-65: LGTM!The file list configuration is clear and mixes URLs and local paths appropriately. As noted in past reviews, URL stability is a separate concern to be addressed later.
85-90: LGTM!The output directory creation properly uses
parents=Trueandexist_ok=Truefor safe directory creation.
116-168: LGTM!The pipeline configuration structure is well-organized with:
- A factory function for standard options following DRY principles
- Separate OCR configuration building on the standard options
- A selection function with clear error handling
This addresses previous feedback about refactoring converter creation.
190-195: LGTM!The pipeline selection mechanism properly addresses past feedback by using a configuration variable rather than hardcoding the converter in the loop.
226-233: LGTM!Enrichment configuration is clear, with all options disabled by default and helpful comments about the performance trade-offs.
317-330: LGTM with suggestions above.The confidence reporting logic correctly computes per-file metrics and identifies low-scoring pages by comparing against the mean. The page number adjustment (+1) properly converts from 0-indexed to user-facing 1-indexed numbering.
alimaredia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all comments were addressed
Description
https://issues.redhat.com/browse/RHAIENG-1045
Adds a notebook for standard Docling conversions. Support standard (default) configs, OCR, and enrichments.
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit