Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
369 changes: 369 additions & 0 deletions notebooks/use-cases/document-conversion-standard.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,369 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "7c83150f-fa8b-42a1-8974-ce9483912fba",
"metadata": {},
"source": [
"# Data Processing: Document Conversion with Standard Docling\n",
"\n",
"This notebook uses **standard (non-VLM)** [Docling](https://docling-project.github.io/docling/) techniques to convert PDF documents into markdown and the [Docling Document](https://docling-project.github.io/docling/concepts/docling_document/) format, a structured representation of the original document that can be exported as JSON.\n",
"\n",
"Conversions using the standard pipeline options generally yield good and fast results for most documents. In some cases, however, alternative conversion pipelines can lead to better outcomes. For instance, forcing OCR is effective for scanned documents or images that contain text to be extracted and analyzed. In cases where relevant information is contained within formulas, code, or pictures; enrichments might be useful. All these use cases are supported by this notebook."
]
},
{
"cell_type": "markdown",
"id": "d58fb60e",
"metadata": {},
"source": [
"## 📦 Installation\n",
"\n",
"Install the [Docling](https://docling-project.github.io/docling/) package into this notebook environment. Run this once per session, it may take a minute. If you restart the kernel or change runtimes, re-run this cell before continuing."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8cf4340c-cfd4-418c-955b-be8c0d544e67",
"metadata": {},
"outputs": [],
"source": [
"!pip install -qq docling"
]
},
{
"cell_type": "markdown",
"id": "6f5e2659-e626-4235-97fc-f311adf8f5b7",
"metadata": {},
"source": [
"## 🔧 Configuration\n",
"\n",
"### Set files to convert\n",
"\n",
"Set the list of PDF files to convert. You can mix public web URLs and local file paths, each entry will be processed in order. Replace the examples with your own documents as needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8d12166-8b40-4c46-9147-27cfc1c8b09a",
"metadata": {
"editable": true,
"slideshow": {
"slide_type": ""
},
"tags": [
"parameters"
]
},
"outputs": [],
"source": [
"files = [\n",
" \"https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/001-trivial/minimal-document.pdf\",\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these files guaranteed to live here? If not, we could check-in some files to our own repo and refer to them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're stable enough, but definitely not guaranteed. I suggest we address this more broadly in a separate PR, maybe add an assets folder to the repo and our own collection of sample PDF's.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shruthis4 This is a good call out. We should discuss whether or not we want input files/code for notebooks within the repo at all.

Some thoughts around this:

  • I think some users might find it beneficial to copy notebooks out of the repo stand alone and just get going.
  • If we have tests continuously running if any urls went out of date we should be able to quickly see such an error and adjust the url accordingly.
  • We could also do a hybrid approach where we have our own sample-files directory and point to the link there for all notebooks as well.

" \"https://github.com/docling-project/docling/raw/v2.43.0/tests/data/pdf/2203.01017v2.pdf\"\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "20fa56da",
"metadata": {},
"source": [
"### Set output directory\n",
"\n",
"Choose where to save results. This notebook creates the folder if it doesn’t exist and writes one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` file per source file, using the source's base name."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46dd4e6e",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"output_dir_name = \"document-conversion-standard/output\"\n",
"\n",
"output_dir = Path(output_dir_name)\n",
"output_dir.mkdir(parents=True, exist_ok=True)"
]
},
{
"cell_type": "markdown",
"id": "39d4e3ef-2fdb-45a3-a217-b07638a35363",
"metadata": {},
"source": [
"### Configure conversion pipelines\n",
"\n",
"Next we create the configuration options for the conversion pipelines supported by this notebook. \n",
"\n",
"The next cell configures two combinations of pipeline options: the **default (standard)** options, and a variant that **forces OCR** on the entire document. In a later cell you'll choose either the `standard` or `ocr` pipeline options depending on which conversion technique you'd like to use.\n",
"\n",
"Note: OCR requires the Tesseract binary to run. Please refer to the Docling [installation](https://docling-project.github.io/docling/installation/) docs if you're not running this notebook from a Workbench image that has it installed already. \n",
"\n",
"For additional customization and a complete reference of Docling's conversion pipeline configuration, check the [official documentation](https://docling-project.github.io/docling/examples/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9be47eb2-8e2d-445c-a5de-fcdd17ef7097",
"metadata": {},
"outputs": [],
"source": [
"from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\n",
"from docling.document_converter import DocumentConverter, PdfFormatOption\n",
"from docling.datamodel.base_models import InputFormat\n",
"from docling.datamodel.pipeline_options import (\n",
" TesseractOcrOptions,\n",
" PdfPipelineOptions,\n",
")\n",
"from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\n",
"\n",
"def create_standard_options() -> PdfPipelineOptions:\n",
" \"\"\"Create base pipeline options with standard settings.\"\"\"\n",
" pipeline_options = PdfPipelineOptions()\n",
" pipeline_options.generate_picture_images = True\n",
" pipeline_options.do_table_structure = True\n",
" pipeline_options.table_structure_options.do_cell_matching = True\n",
" pipeline_options.accelerator_options = AcceleratorOptions(\n",
" num_threads=4, device=AcceleratorDevice.AUTO\n",
" )\n",
" return pipeline_options\n",
"\n",
"# Standard converter\n",
"standard_options = create_standard_options()\n",
"\n",
"# OCR converter: force OCR on the entire page\n",
"%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded Tesseract data path may not work in all environments.

The TESSDATA_PREFIX environment variable is set to /usr/share/tesseract/tessdata, which assumes a specific installation location. This path may not exist in all environments, container images, or operating systems.

Consider one of these alternatives:

  1. Remove the line entirely if the default Tesseract installation already sets this variable correctly
  2. Check if the path exists before setting it
  3. Document the assumption that users must have Tesseract installed at this specific location

Example for option 2:

 # Force OCR on the entire page
-%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
+import os
+tessdata_path = "/usr/share/tesseract/tessdata"
+if os.path.exists(tessdata_path):
+    %env TESSDATA_PREFIX=/usr/share/tesseract/tessdata
+else:
+    print(f"Warning: Tesseract data path {tessdata_path} not found. Using default.")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In notebooks/use-cases/document-conversion-standard.ipynb around line 139, the
notebook hardcodes TESSDATA_PREFIX to /usr/share/tesseract/tessdata which may
not exist in all environments; replace this by either removing the line, or
adding a conditional that detects whether that path exists and only sets
TESSDATA_PREFIX if present, otherwise leave it unset and add a short note in the
notebook explaining the assumption and how to set TESSDATA_PREFIX for custom
installs or containers.

"ocr_options = create_standard_options()\n",
"ocr_options.do_ocr = True\n",
"ocr_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n",
"\n",
"def get_pipeline_options(pipeline_options_name: str = \"standard\") -> PdfPipelineOptions:\n",
" \"\"\"Get the configured pipeline options based on name.\n",
"\n",
" Args:\n",
" pipeline_options_name: One of \"standard\" or \"ocr\"\n",
" \n",
" Returns:\n",
" PdfPipelineOptions instance\n",
" \n",
" Raises:\n",
" ValueError: If pipeline_options_name is not recognized\n",
" \"\"\"\n",
" pipeline_options = {\n",
" \"standard\": standard_options,\n",
" \"ocr\": ocr_options\n",
" }\n",
"\n",
" if pipeline_options_name not in pipeline_options:\n",
" raise ValueError(\n",
" f\"Unknown pipeline options name: '{pipeline_options_name}'. \"\n",
" f\"Choose from {list(pipeline_options.keys())}\"\n",
" )\n",
" \n",
" return pipeline_options[pipeline_options_name]"
]
},
{
"cell_type": "markdown",
"id": "89396e34",
"metadata": {},
"source": [
"### Choose a conversion pipeline\n",
"\n",
"Next we choose the conversion pipeline to be used in the conversion.\n",
"\n",
"The **standard** pipeline generally yield good and fast results for the majority of documents. However, if you didn't get good results with the standard pipeline, or if you're converting scanned documents or ones that contain relevant text information within images, consider using the **ocr** pipeline. Just set `pipeline_to_use` to either `standard` or `ocr` accordingly."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "621f644c",
"metadata": {},
"outputs": [],
"source": [
"# Set the pipeline to use (either \"standard\" or \"ocr\")\n",
"pipeline_to_use = \"standard\"\n",
"\n",
"pipeline_options = get_pipeline_options(pipeline_to_use)\n",
"\n",
"print(f\"✓ Using '{pipeline_to_use}' pipeline\")"
]
},
{
"cell_type": "markdown",
"id": "b286e6f2",
"metadata": {},
"source": [
"### Configure enrichments\n",
"\n",
"Depending on the characteristics of the documents being converted, you may benefit from the use of enrichments.\n",
"\n",
"Docling supports the enrichment of conversion pipelines with additional steps that'll process specific document components like code blocks, formulas, and pictures.\n",
"\n",
"All enrichment features are disabled by default, but you may enable them individually by setting one or more of the enrichment options in the next cell to `True`. Note that the extra steps usually require extra models executions, which may increase the processing time consistently.\n",
"\n",
" * `do_code_enrichment`: Performs advanced parsing of code blocks and sets the code language of each block accordingly.\n",
" * `do_formula_enrichment`: Analyzes equations and extracts their LaTeX representation.\n",
" * `do_picture_description`: Generates captions for pictures with a vision model.\n",
" * `do_picture_classification`: Classifies pictures, for example chart types, flow diagrams, logos, or signatures.\n",
" \n",
"For additional customization and a complete reference of Docling's enrichments, check the [official documentation](https://docling-project.github.io/docling/usage/enrichments/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d5bf3df",
"metadata": {},
"outputs": [],
"source": [
"# Sets code and formula enrichments and picture description and classification\n",
"pipeline_options.do_code_enrichment = False\n",
"pipeline_options.do_formula_enrichment = False\n",
"pipeline_options.do_picture_description = False\n",
"pipeline_options.do_picture_classification = False\n",
"\n",
"# If you enable enrichments, you may benefit from increasing the image scale (e.g. to 2)\n",
"pipeline_options.images_scale = 1\n"
]
},
{
"cell_type": "markdown",
"id": "78ab035e-e05e-41e8-be90-23527b5d4bc4",
"metadata": {},
"source": [
"## ✨ Conversion\n",
"\n",
"Finally, use the pipeline options we configured to convert every document into one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` (markdown), which will be stored in the output directory configured earlier."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b2a45e16-8fa0-4223-9890-7d75f6869aeb",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"from docling_core.types.doc import ImageRefMode\n",
"\n",
"# Create the document converter\n",
"converter = DocumentConverter(\n",
" format_options={\n",
" InputFormat.PDF: PdfFormatOption(\n",
" pipeline_options=pipeline_options,\n",
" backend=DoclingParseV4DocumentBackend,\n",
" )\n",
" }\n",
" )\n",
"\n",
"confidence_reports = {}\n",
"\n",
"if not files:\n",
" raise ValueError(\"No input files specified. Please set the 'files' list above.\")\n",
"\n",
"for file in files:\n",
" # Convert the file\n",
" print(f\"Converting {file}...\")\n",
"\n",
" result = converter.convert(file)\n",
" document = result.document\n",
" dictionary = document.export_to_dict()\n",
"\n",
" # Calculate conversion confidence\n",
" confidence_reports[file] = result.confidence\n",
"\n",
" file_path = Path(file)\n",
"\n",
" # Export the document to JSON\n",
" json_output_path = (output_dir / f\"{file_path.stem}.json\")\n",
" with open(json_output_path, \"w\", encoding=\"utf-8\") as f:\n",
" json.dump(dictionary, f)\n",
" print(f\"✓ Path of JSON output is: {json_output_path.resolve()}\")\n",
"\n",
" # Export the document to Markdown\n",
" md_output_path = output_dir / f\"{file_path.stem}.md\"\n",
" with open(md_output_path, \"w\", encoding=\"utf-8\") as f:\n",
" markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)\n",
" f.write(markdown)\n",
" print(f\"✓ Path of markdown output is: {md_output_path.resolve()}\")"
]
},
{
"cell_type": "markdown",
"id": "13802403-9342-4e33-96a9-b04b4fedd070",
"metadata": {},
"source": [
"### Conversion confidence\n",
"\n",
"When converting a document, Docling can calculate how confident it is in the quality of the conversion. This *confidence* is expressed as both a *score* and a *grade*. The score is a numeric value between 0 and 1, and the grade is a label that can be **poor**, **fair**, **good**, or **excellent**. If Docling is unable to calculate a confidence grade, the value will be marked as *unspecified*.\n",
"\n",
"If your document receives a low score (for example, below 0.8) and a grade of *poor* or *fair*, you'll probably benefit from using a different conversion technique. In that case, go back to the *Conversion* section and try selecting a different approach (e.g. forcing OCR) and compare the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32624c0e-f32c-48c9-85ba-6ab6a3d4cebc",
"metadata": {},
"outputs": [],
"source": [
"for file, confidence_report in confidence_reports.items():\n",
" print(f\"✓ Conversion confidence for {file}:\")\n",
" \n",
" print(f\"Average confidence: \\x1b[1m{confidence_report.mean_grade.name}\\033[0m (score {confidence_report.mean_score:.3f})\")\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

ANSI escape codes may not render in all notebook environments.

While ANSI escape codes (\x1b[1m and \033[0m) work in terminal output, they may not render correctly in all Jupyter notebook environments or when exported to other formats. Consider using Markdown formatting in the print statement or using IPython's display with HTML instead for better cross-environment compatibility.

For better compatibility, consider using IPython's display capabilities:

from IPython.display import display, Markdown

display(Markdown(f"Average confidence: **{confidence_report.mean_grade.name}** (score {confidence_report.mean_score:.3f})"))

Or simply remove the escape codes if bold formatting is not critical:

-    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
+    print(f"Average confidence: {confidence_report.mean_grade.name} (score {confidence_report.mean_score:.3f})")
🤖 Prompt for AI Agents
notebooks/use-cases/document-conversion-standard.ipynb around line 251: the
print uses ANSI escape codes which may not render in many notebook/export
environments; replace the ANSI formatting by either using
IPython.display.display with Markdown to render bold (e.g., construct a Markdown
string with **...**) or simply remove the escape sequences and print plain text;
ensure you import IPython.display.Display/Markdown when using display and format
the score to three decimals as before.

" \n",
" low_score_pages = []\n",
" for page in confidence_report.pages:\n",
" page_confidence_report = confidence_report.pages[page]\n",
" if page_confidence_report.mean_score < confidence_report.mean_score:\n",
" low_score_pages.append(page)\n",
"\n",
" print(f\"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages) or 'none'}\")\n",
" \n",
" print()"
]
},
{
"cell_type": "markdown",
"id": "56dc2c21",
"metadata": {},
"source": [
"## 🍩 Additional resources\n",
"\n",
"For additional example notebooks related to Data Processing, check the [Open Data Hub Data Processing](https://github.com/opendatahub-io/odh-data-processing/) repository on GitHub.\n",
"\n",
"### Any Feedback?\n",
"\n",
"We'd love to hear if you have any feedback on this or any other notebook in this series! Please [open an issue](https://github.com/opendatahub-io/odh-data-processing/issues) and help us improve our demos."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}