opendatahub-io · alimaredia · Oct 10, 2025 · Oct 3, 2025 · Oct 8, 2025 · Oct 9, 2025
@@ -0,0 +1,369 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7c83150f-fa8b-42a1-8974-ce9483912fba",
+   "metadata": {},
+   "source": [
+    "# Data Processing: Document Conversion with Standard Docling\n",
+    "\n",
+    "This notebook uses **standard (non-VLM)** [Docling](https://docling-project.github.io/docling/) techniques to convert PDF documents into markdown and the [Docling Document](https://docling-project.github.io/docling/concepts/docling_document/) format, a structured representation of the original document that can be exported as JSON.\n",
+    "\n",
+    "Conversions using the standard pipeline options generally yield good and fast results for most documents. In some cases, however, alternative conversion pipelines can lead to better outcomes. For instance, forcing OCR is effective for scanned documents or images that contain text to be extracted and analyzed. In cases where relevant information is contained within formulas, code, or pictures; enrichments might be useful. All these use cases are supported by this notebook."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d58fb60e",
+   "metadata": {},
+   "source": [
+    "## 📦 Installation\n",
+    "\n",
+    "Install the [Docling](https://docling-project.github.io/docling/) package into this notebook environment. Run this once per session, it may take a minute. If you restart the kernel or change runtimes, re-run this cell before continuing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8cf4340c-cfd4-418c-955b-be8c0d544e67",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -qq docling"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f5e2659-e626-4235-97fc-f311adf8f5b7",
+   "metadata": {},
+   "source": [
+    "## 🔧 Configuration\n",
+    "\n",
+    "### Set files to convert\n",
+    "\n",
+    "Set the list of PDF files to convert. You can mix public web URLs and local file paths, each entry will be processed in order. Replace the examples with your own documents as needed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8d12166-8b40-4c46-9147-27cfc1c8b09a",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "parameters"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "files = [\n",
+    "    \"https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/001-trivial/minimal-document.pdf\",\n",
+    "    \"https://github.com/docling-project/docling/raw/v2.43.0/tests/data/pdf/2203.01017v2.pdf\"\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "20fa56da",
+   "metadata": {},
+   "source": [
+    "### Set output directory\n",
+    "\n",
+    "Choose where to save results. This notebook creates the folder if it doesn’t exist and writes one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` file per source file, using the source's base name."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "46dd4e6e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "\n",
+    "output_dir_name = \"document-conversion-standard/output\"\n",
+    "\n",
+    "output_dir = Path(output_dir_name)\n",
+    "output_dir.mkdir(parents=True, exist_ok=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39d4e3ef-2fdb-45a3-a217-b07638a35363",
+   "metadata": {},
+   "source": [
+    "### Configure conversion pipelines\n",
+    "\n",
+    "Next we create the configuration options for the conversion pipelines supported by this notebook. \n",
+    "\n",
+    "The next cell configures two combinations of pipeline options: the **default (standard)** options, and a variant that **forces OCR** on the entire document. In a later cell you'll choose either the `standard` or `ocr` pipeline options depending on which conversion technique you'd like to use.\n",
+    "\n",
+    "Note: OCR requires the Tesseract binary to run. Please refer to the Docling [installation](https://docling-project.github.io/docling/installation/) docs if you're not running this notebook from a Workbench image that has it installed already. \n",
+    "\n",
+    "For additional customization and a complete reference of Docling's conversion pipeline configuration, check the [official documentation](https://docling-project.github.io/docling/examples/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9be47eb2-8e2d-445c-a5de-fcdd17ef7097",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions\n",
+    "from docling.document_converter import DocumentConverter, PdfFormatOption\n",
+    "from docling.datamodel.base_models import InputFormat\n",
+    "from docling.datamodel.pipeline_options import (\n",
+    "    TesseractOcrOptions,\n",
+    "    PdfPipelineOptions,\n",
+    ")\n",
+    "from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\n",
+    "\n",
+    "def create_standard_options() -> PdfPipelineOptions:\n",
+    "    \"\"\"Create base pipeline options with standard settings.\"\"\"\n",
+    "    pipeline_options = PdfPipelineOptions()\n",
+    "    pipeline_options.generate_picture_images = True\n",
+    "    pipeline_options.do_table_structure = True\n",
+    "    pipeline_options.table_structure_options.do_cell_matching = True\n",
+    "    pipeline_options.accelerator_options = AcceleratorOptions(\n",
+    "        num_threads=4, device=AcceleratorDevice.AUTO\n",
+    "    )\n",
+    "    return pipeline_options\n",
+    "\n",
+    "# Standard converter\n",
+    "standard_options = create_standard_options()\n",
+    "\n",
+    "# OCR converter: force OCR on the entire page\n",
+    "%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata\n",
+    "ocr_options = create_standard_options()\n",
+    "ocr_options.do_ocr = True\n",
+    "ocr_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n",
+    "\n",
+    "def get_pipeline_options(pipeline_options_name: str = \"standard\") -> PdfPipelineOptions:\n",
+    "    \"\"\"Get the configured pipeline options based on name.\n",
+    "\n",
+    "    Args:\n",
+    "        pipeline_options_name: One of \"standard\" or \"ocr\"\n",
+    "    \n",
+    "    Returns:\n",
+    "        PdfPipelineOptions instance\n",
+    "        \n",
+    "    Raises:\n",
+    "        ValueError: If pipeline_options_name is not recognized\n",
+    "    \"\"\"\n",
+    "    pipeline_options = {\n",
+    "        \"standard\": standard_options,\n",
+    "        \"ocr\": ocr_options\n",
+    "    }\n",
+    "\n",
+    "    if pipeline_options_name not in pipeline_options:\n",
+    "        raise ValueError(\n",
+    "            f\"Unknown pipeline options name: '{pipeline_options_name}'. \"\n",
+    "            f\"Choose from {list(pipeline_options.keys())}\"\n",
+    "        )\n",
+    "    \n",
+    "    return pipeline_options[pipeline_options_name]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "89396e34",
+   "metadata": {},
+   "source": [
+    "### Choose a conversion pipeline\n",
+    "\n",
+    "Next we choose the conversion pipeline to be used in the conversion.\n",
+    "\n",
+    "The **standard** pipeline generally yield good and fast results for the majority of documents. However, if you didn't get good results with the standard pipeline, or if you're converting scanned documents or ones that contain relevant text information within images, consider using the **ocr** pipeline. Just set `pipeline_to_use` to either `standard` or `ocr` accordingly."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "621f644c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Set the pipeline to use (either \"standard\" or \"ocr\")\n",
+    "pipeline_to_use = \"standard\"\n",
+    "\n",
+    "pipeline_options = get_pipeline_options(pipeline_to_use)\n",
+    "\n",
+    "print(f\"✓ Using '{pipeline_to_use}' pipeline\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b286e6f2",
+   "metadata": {},
+   "source": [
+    "### Configure enrichments\n",
+    "\n",
+    "Depending on the characteristics of the documents being converted, you may benefit from the use of enrichments.\n",
+    "\n",
+    "Docling supports the enrichment of conversion pipelines with additional steps that'll process specific document components like code blocks, formulas, and pictures.\n",
+    "\n",
+    "All enrichment features are disabled by default, but you may enable them individually by setting one or more of the enrichment options in the next cell to `True`. Note that the extra steps usually require extra models executions, which may increase the processing time consistently.\n",
+    "\n",
+    " * `do_code_enrichment`: Performs advanced parsing of code blocks and sets the code language of each block accordingly.\n",
+    " * `do_formula_enrichment`: Analyzes equations and extracts their LaTeX representation.\n",
+    " * `do_picture_description`: Generates captions for pictures with a vision model.\n",
+    " * `do_picture_classification`: Classifies pictures, for example chart types, flow diagrams, logos, or signatures.\n",
+    "  \n",
+    "For additional customization and a complete reference of Docling's enrichments, check the [official documentation](https://docling-project.github.io/docling/usage/enrichments/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9d5bf3df",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sets code and formula enrichments and picture description and classification\n",
+    "pipeline_options.do_code_enrichment = False\n",
+    "pipeline_options.do_formula_enrichment = False\n",
+    "pipeline_options.do_picture_description = False\n",
+    "pipeline_options.do_picture_classification = False\n",
+    "\n",
+    "# If you enable enrichments, you may benefit from increasing the image scale (e.g. to 2)\n",
+    "pipeline_options.images_scale = 1\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78ab035e-e05e-41e8-be90-23527b5d4bc4",
+   "metadata": {},
+   "source": [
+    "## ✨ Conversion\n",
+    "\n",
+    "Finally, use the pipeline options we configured to convert every document into one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` (markdown), which will be stored in the output directory configured earlier."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b2a45e16-8fa0-4223-9890-7d75f6869aeb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from docling_core.types.doc import ImageRefMode\n",
+    "\n",
+    "# Create the document converter\n",
+    "converter = DocumentConverter(\n",
+    "        format_options={\n",
+    "            InputFormat.PDF: PdfFormatOption(\n",
+    "                pipeline_options=pipeline_options,\n",
+    "                backend=DoclingParseV4DocumentBackend,\n",
+    "            )\n",
+    "        }\n",
+    "    )\n",
+    "\n",
+    "confidence_reports = {}\n",
+    "\n",
+    "if not files:\n",
+    "    raise ValueError(\"No input files specified. Please set the 'files' list above.\")\n",
+    "\n",
+    "for file in files:\n",
+    "    # Convert the file\n",
+    "    print(f\"Converting {file}...\")\n",
+    "\n",
+    "    result = converter.convert(file)\n",
+    "    document = result.document\n",
+    "    dictionary = document.export_to_dict()\n",
+    "\n",
+    "    # Calculate conversion confidence\n",
+    "    confidence_reports[file] = result.confidence\n",
+    "\n",
+    "    file_path = Path(file)\n",
+    "\n",
+    "    # Export the document to JSON\n",
+    "    json_output_path = (output_dir / f\"{file_path.stem}.json\")\n",
+    "    with open(json_output_path, \"w\", encoding=\"utf-8\") as f:\n",
+    "        json.dump(dictionary, f)\n",
+    "        print(f\"✓ Path of JSON output is: {json_output_path.resolve()}\")\n",
+    "\n",
+    "    # Export the document to Markdown\n",
+    "    md_output_path = output_dir / f\"{file_path.stem}.md\"\n",
+    "    with open(md_output_path, \"w\", encoding=\"utf-8\") as f:\n",
+    "        markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)\n",
+    "        f.write(markdown)\n",
+    "        print(f\"✓ Path of markdown output is: {md_output_path.resolve()}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "13802403-9342-4e33-96a9-b04b4fedd070",
+   "metadata": {},
+   "source": [
+    "### Conversion confidence\n",
+    "\n",
+    "When converting a document, Docling can calculate how confident it is in the quality of the conversion. This *confidence* is expressed as both a *score* and a *grade*. The score is a numeric value between 0 and 1, and the grade is a label that can be **poor**, **fair**, **good**, or **excellent**. If Docling is unable to calculate a confidence grade, the value will be marked as *unspecified*.\n",
+    "\n",
+    "If your document receives a low score (for example, below 0.8) and a grade of *poor* or *fair*, you'll probably benefit from using a different conversion technique. In that case, go back to the *Conversion* section and try selecting a different approach (e.g. forcing OCR) and compare the results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "32624c0e-f32c-48c9-85ba-6ab6a3d4cebc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for file, confidence_report in confidence_reports.items():\n",
+    "    print(f\"✓ Conversion confidence for {file}:\")\n",
+    "    \n",
+    "    print(f\"Average confidence: \\x1b[1m{confidence_report.mean_grade.name}\\033[0m (score {confidence_report.mean_score:.3f})\")\n",
+    "    \n",
+    "    low_score_pages = []\n",
+    "    for page in confidence_report.pages:\n",
+    "        page_confidence_report = confidence_report.pages[page]\n",
+    "        if page_confidence_report.mean_score < confidence_report.mean_score:\n",
+    "            low_score_pages.append(page)\n",
+    "\n",
+    "    print(f\"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages) or 'none'}\")\n",
+    "    \n",
+    "    print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56dc2c21",
+   "metadata": {},
+   "source": [
+    "## 🍩 Additional resources\n",
+    "\n",
+    "For additional example notebooks related to Data Processing, check the [Open Data Hub Data Processing](https://github.com/opendatahub-io/odh-data-processing/) repository on GitHub.\n",
+    "\n",
+    "### Any Feedback?\n",
+    "\n",
+    "We'd love to hear if you have any feedback on this or any other notebook in this series! Please [open an issue](https://github.com/opendatahub-io/odh-data-processing/issues) and help us improve our demos."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}