|
9 | 9 | "\n", |
10 | 10 | "This notebook uses **standard (non-VLM)** [Docling](https://docling-project.github.io/docling/) techniques to convert PDF documents into markdown and the [Docling Document](https://docling-project.github.io/docling/concepts/docling_document/) format, a structured representation of the original document that can be exported as JSON.\n", |
11 | 11 | "\n", |
12 | | - "The standard pipeline options generally yield good and fast results for most documents. In some cases, however, alternative conversion pipelines can lead to better outcomes. For instance, forcing OCR is effective for scanned documents or images that contain text to be extracted and analyzed. In cases where relevant information is contained within formulas, code, or pictures; enrichment and picture description and classification might be useful. All these use cases are supported by this notebook." |
| 12 | + "Conversions using the standard pipeline options generally yield good and fast results for most documents. In some cases, however, alternative conversion pipelines can lead to better outcomes. For instance, forcing OCR is effective for scanned documents or images that contain text to be extracted and analyzed. In cases where relevant information is contained within formulas, code, or pictures; enrichments might be useful. All these use cases are supported by this notebook." |
13 | 13 | ] |
14 | 14 | }, |
15 | 15 | { |
|
84 | 84 | "source": [ |
85 | 85 | "from pathlib import Path\n", |
86 | 86 | "\n", |
87 | | - "output_dir = Path(\"document-conversion-standard/output\")\n", |
| 87 | + "output_dir_name = \"document-conversion-standard/output\"\n", |
| 88 | + "\n", |
| 89 | + "output_dir = Path(output_dir_name)\n", |
88 | 90 | "output_dir.mkdir(parents=True, exist_ok=True)" |
89 | 91 | ] |
90 | 92 | }, |
|
93 | 95 | "id": "39d4e3ef-2fdb-45a3-a217-b07638a35363", |
94 | 96 | "metadata": {}, |
95 | 97 | "source": [ |
96 | | - "### Configure conversion pipeline\n", |
| 98 | + "### Configure conversion pipelines\n", |
97 | 99 | "\n", |
98 | | - "Next we set the configuration options for our conversion pipeline. \n", |
| 100 | + "Next we create the configuration options for the conversion pipelines supported by this notebook. \n", |
99 | 101 | "\n", |
100 | | - "The next cell contains three combinations of pipeline options: the default (standard) options, a variant that forces OCR on the entire document, and another one which enables code, formula, and picture enrichments. Later in the *Conversion* section, you'll set the converter to either `standard_converter`, `ocr_converter`, or `enrichment_converter` depending on which conversion technique you'd like to use.\n", |
| 102 | + "The next cell configures two combinations of pipeline options: the **default (standard)** options, and a variant that **forces OCR** on the entire document. In a later cell you'll choose either the `standard` or `ocr` pipeline options depending on which conversion technique you'd like to use.\n", |
101 | 103 | "\n", |
102 | 104 | "Note: OCR requires the Tesseract binary to run. Please refer to the Docling [installation](https://docling-project.github.io/docling/installation/) docs if you're not running this notebook from a Workbench image that has it installed already. \n", |
103 | 105 | "\n", |
|
118 | 120 | " TesseractOcrOptions,\n", |
119 | 121 | " PdfPipelineOptions,\n", |
120 | 122 | ")\n", |
121 | | - "from docling.pipeline.vlm_pipeline import VlmPipeline\n", |
122 | 123 | "from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend\n", |
123 | 124 | "\n", |
124 | | - "# Standard pipeline options\n", |
125 | | - "standard_pipeline_options = PdfPipelineOptions()\n", |
126 | | - "standard_pipeline_options.generate_picture_images = True\n", |
127 | | - "standard_pipeline_options.do_table_structure = True\n", |
128 | | - "standard_pipeline_options.table_structure_options.do_cell_matching = True\n", |
129 | | - "standard_converter = DocumentConverter(\n", |
130 | | - " format_options={\n", |
131 | | - " InputFormat.PDF: PdfFormatOption(\n", |
132 | | - " pipeline_options=standard_pipeline_options,\n", |
133 | | - " backend=DoclingParseV4DocumentBackend,\n", |
134 | | - " )\n", |
135 | | - " }\n", |
136 | | - ")\n", |
| 125 | + "def create_standard_options() -> PdfPipelineOptions:\n", |
| 126 | + " \"\"\"Create base pipeline options with standard settings.\"\"\"\n", |
| 127 | + " pipeline_options = PdfPipelineOptions()\n", |
| 128 | + " pipeline_options.generate_picture_images = True\n", |
| 129 | + " pipeline_options.do_table_structure = True\n", |
| 130 | + " pipeline_options.table_structure_options.do_cell_matching = True\n", |
| 131 | + " pipeline_options.accelerator_options = AcceleratorOptions(\n", |
| 132 | + " num_threads=4, device=AcceleratorDevice.AUTO\n", |
| 133 | + " )\n", |
| 134 | + " return pipeline_options\n", |
137 | 135 | "\n", |
138 | | - "# Force OCR on the entire page\n", |
| 136 | + "# Standard converter\n", |
| 137 | + "standard_options = create_standard_options()\n", |
| 138 | + "\n", |
| 139 | + "# OCR converter: force OCR on the entire page\n", |
139 | 140 | "%env TESSDATA_PREFIX=/usr/share/tesseract/tessdata\n", |
140 | | - "ocr_pipeline_options = PdfPipelineOptions()\n", |
141 | | - "ocr_pipeline_options.generate_picture_images = True\n", |
142 | | - "ocr_pipeline_options.do_table_structure = True\n", |
143 | | - "ocr_pipeline_options.table_structure_options.do_cell_matching = True\n", |
144 | | - "ocr_pipeline_options.do_ocr = True\n", |
145 | | - "ocr_pipeline_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n", |
146 | | - "ocr_pipeline_options.accelerator_options = AcceleratorOptions(\n", |
147 | | - " num_threads=4, device=AcceleratorDevice.AUTO\n", |
148 | | - ")\n", |
149 | | - "ocr_converter = DocumentConverter(\n", |
150 | | - " format_options={\n", |
151 | | - " InputFormat.PDF: PdfFormatOption(\n", |
152 | | - " pipeline_options=ocr_pipeline_options,\n", |
153 | | - " backend=DoclingParseV4DocumentBackend,\n", |
154 | | - " )\n", |
| 141 | + "ocr_options = create_standard_options()\n", |
| 142 | + "ocr_options.do_ocr = True\n", |
| 143 | + "ocr_options.ocr_options = TesseractOcrOptions(force_full_page_ocr=True)\n", |
| 144 | + "\n", |
| 145 | + "def get_pipeline_options(pipeline_options_name: str = \"standard\") -> PdfPipelineOptions:\n", |
| 146 | + " \"\"\"Get the configured pipeline options based on name.\n", |
| 147 | + "\n", |
| 148 | + " Args:\n", |
| 149 | + " pipeline_options_name: One of \"standard\" or \"ocr\"\n", |
| 150 | + " \n", |
| 151 | + " Returns:\n", |
| 152 | + " PdfPipelineOptions instance\n", |
| 153 | + " \n", |
| 154 | + " Raises:\n", |
| 155 | + " ValueError: If pipeline_options_name is not recognized\n", |
| 156 | + " \"\"\"\n", |
| 157 | + " pipeline_options = {\n", |
| 158 | + " \"standard\": standard_options,\n", |
| 159 | + " \"ocr\": ocr_options\n", |
155 | 160 | " }\n", |
156 | | - ")\n", |
157 | 161 | "\n", |
158 | | - "# Code and formula enrichments and picture description and classification\n", |
159 | | - "enrichment_pipeline_options = PdfPipelineOptions()\n", |
160 | | - "enrichment_pipeline_options.generate_picture_images = True\n", |
161 | | - "enrichment_pipeline_options.do_table_structure = True\n", |
162 | | - "enrichment_pipeline_options.table_structure_options.do_cell_matching = True\n", |
163 | | - "enrichment_pipeline_options.do_code_enrichment = True\n", |
164 | | - "enrichment_pipeline_options.do_formula_enrichment = True\n", |
165 | | - "enrichment_pipeline_options.do_picture_description = True\n", |
166 | | - "enrichment_pipeline_options.images_scale = 2\n", |
167 | | - "enrichment_pipeline_options.do_picture_classification = True\n", |
168 | | - "enrichment_pipeline_options.accelerator_options = AcceleratorOptions(\n", |
169 | | - " num_threads=4, device=AcceleratorDevice.AUTO\n", |
170 | | - ")\n", |
171 | | - "enrichment_converter = DocumentConverter(\n", |
172 | | - " format_options={\n", |
173 | | - " InputFormat.PDF: PdfFormatOption(\n", |
174 | | - " pipeline_options=enrichment_pipeline_options,\n", |
175 | | - " backend=DoclingParseV4DocumentBackend,\n", |
| 162 | + " if pipeline_options_name not in pipeline_options:\n", |
| 163 | + " raise ValueError(\n", |
| 164 | + " f\"Unknown pipeline options name: '{pipeline_options_name}'. \"\n", |
| 165 | + " f\"Choose from {list(pipeline_options.keys())}\"\n", |
176 | 166 | " )\n", |
177 | | - " }\n", |
178 | | - ")" |
| 167 | + " \n", |
| 168 | + " return pipeline_options[pipeline_options_name]" |
| 169 | + ] |
| 170 | + }, |
| 171 | + { |
| 172 | + "cell_type": "markdown", |
| 173 | + "id": "89396e34", |
| 174 | + "metadata": {}, |
| 175 | + "source": [ |
| 176 | + "### Choose a conversion pipeline\n", |
| 177 | + "\n", |
| 178 | + "Next we choose the conversion pipeline to be used in the conversion.\n", |
| 179 | + "\n", |
| 180 | + "The **standard** pipeline generally yield good and fast results for the majority of documents. However, if you didn't get good results with the standard pipeline, or if you're converting scanned documents or ones that contain relevant text information within images, consider using the **ocr** pipeline. Just set `pipeline_to_use` to either `standard` or `ocr` accordingly." |
| 181 | + ] |
| 182 | + }, |
| 183 | + { |
| 184 | + "cell_type": "code", |
| 185 | + "execution_count": null, |
| 186 | + "id": "621f644c", |
| 187 | + "metadata": {}, |
| 188 | + "outputs": [], |
| 189 | + "source": [ |
| 190 | + "# Set the pipeline to use (either \"standard\" or \"ocr\")\n", |
| 191 | + "pipeline_to_use = \"standard\"\n", |
| 192 | + "\n", |
| 193 | + "pipeline_options = get_pipeline_options(pipeline_to_use)\n", |
| 194 | + "\n", |
| 195 | + "print(f\"✓ Using '{pipeline_to_use}' pipeline\")" |
| 196 | + ] |
| 197 | + }, |
| 198 | + { |
| 199 | + "cell_type": "markdown", |
| 200 | + "id": "b286e6f2", |
| 201 | + "metadata": {}, |
| 202 | + "source": [ |
| 203 | + "### Configure enrichments\n", |
| 204 | + "\n", |
| 205 | + "Depending on the characteristics of the documents being converted, you may benefit from the use of enrichments.\n", |
| 206 | + "\n", |
| 207 | + "Docling supports the enrichment of conversion pipelines with additional steps that'll process specific document components like code blocks, formulas, and pictures.\n", |
| 208 | + "\n", |
| 209 | + "All enrichment features are disabled by default, but you may enable them individually by setting one or more of the enrichment options in the next cell to `True`. Note that the extra steps usually require extra models executions, which may increase the processing time consistently.\n", |
| 210 | + "\n", |
| 211 | + " * `do_code_enrichment`: Performs advanced parsing of code blocks and sets the code language of each block accordingly.\n", |
| 212 | + " * `do_formula_enrichment`: Analyzes equations and extracts their LaTeX representation.\n", |
| 213 | + " * `do_picture_description`: Generates captions for pictures with a vision model.\n", |
| 214 | + " * `do_picture_classification`: Classifies pictures, for example chart types, flow diagrams, logos, or signatures.\n", |
| 215 | + " \n", |
| 216 | + "For additional customization and a complete reference of Docling's enrichments, check the [official documentation](https://docling-project.github.io/docling/usage/enrichments/)." |
| 217 | + ] |
| 218 | + }, |
| 219 | + { |
| 220 | + "cell_type": "code", |
| 221 | + "execution_count": null, |
| 222 | + "id": "9d5bf3df", |
| 223 | + "metadata": {}, |
| 224 | + "outputs": [], |
| 225 | + "source": [ |
| 226 | + "# Sets code and formula enrichments and picture description and classification\n", |
| 227 | + "pipeline_options.do_code_enrichment = False\n", |
| 228 | + "pipeline_options.do_formula_enrichment = False\n", |
| 229 | + "pipeline_options.do_picture_description = False\n", |
| 230 | + "pipeline_options.do_picture_classification = False\n", |
| 231 | + "\n", |
| 232 | + "# If you enable enrichments, you may benefit from increasing the image scale (e.g. to 2)\n", |
| 233 | + "pipeline_options.images_scale = 1\n" |
179 | 234 | ] |
180 | 235 | }, |
181 | 236 | { |
|
185 | 240 | "source": [ |
186 | 241 | "## ✨ Conversion\n", |
187 | 242 | "\n", |
188 | | - "Finally, convert every document into one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` (markdown). If you'd like to change the conversion technique, set `converter` to either `standard_converter`, `ocr_converter`, or `enrichment_converter`." |
| 243 | + "Finally, use the pipeline options we configured to convert every document into one `json` ([Docling Document](https://docling-project.github.io/docling/concepts/docling_document/)) and one `md` (markdown), which will be stored in the output directory configured earlier." |
189 | 244 | ] |
190 | 245 | }, |
191 | 246 | { |
|
198 | 253 | "import json\n", |
199 | 254 | "from docling_core.types.doc import ImageRefMode\n", |
200 | 255 | "\n", |
201 | | - "confidence_reports = dict()\n", |
| 256 | + "# Create the document converter\n", |
| 257 | + "converter = DocumentConverter(\n", |
| 258 | + " format_options={\n", |
| 259 | + " InputFormat.PDF: PdfFormatOption(\n", |
| 260 | + " pipeline_options=pipeline_options,\n", |
| 261 | + " backend=DoclingParseV4DocumentBackend,\n", |
| 262 | + " )\n", |
| 263 | + " }\n", |
| 264 | + " )\n", |
| 265 | + "\n", |
| 266 | + "confidence_reports = {}\n", |
| 267 | + "\n", |
| 268 | + "if not files:\n", |
| 269 | + " raise ValueError(\"No input files specified. Please set the 'files' list above.\")\n", |
202 | 270 | "\n", |
203 | 271 | "for file in files:\n", |
204 | | - " # Set the converter to use (standard_converter, ocr_converter, or enrichment_converter)\n", |
205 | | - " converter = standard_converter\n", |
206 | | - " \n", |
207 | 272 | " # Convert the file\n", |
208 | 273 | " print(f\"Converting {file}...\")\n", |
| 274 | + "\n", |
209 | 275 | " result = converter.convert(file)\n", |
210 | 276 | " document = result.document\n", |
211 | 277 | " dictionary = document.export_to_dict()\n", |
212 | 278 | "\n", |
213 | | - " file_path = Path(file)\n", |
214 | | - "\n", |
215 | 279 | " # Calculate conversion confidence\n", |
216 | 280 | " confidence_reports[file] = result.confidence\n", |
217 | 281 | "\n", |
| 282 | + " file_path = Path(file)\n", |
| 283 | + "\n", |
218 | 284 | " # Export the document to JSON\n", |
219 | 285 | " json_output_path = (output_dir / f\"{file_path.stem}.json\")\n", |
220 | 286 | " with open(json_output_path, \"w\", encoding=\"utf-8\") as f:\n", |
221 | 287 | " json.dump(dictionary, f)\n", |
222 | | - " print(f\"Path of JSON output is: {Path(json_output_path).resolve()}\")\n", |
| 288 | + " print(f\"✓ Path of JSON output is: {json_output_path.resolve()}\")\n", |
223 | 289 | "\n", |
224 | 290 | " # Export the document to Markdown\n", |
225 | 291 | " md_output_path = output_dir / f\"{file_path.stem}.md\"\n", |
226 | 292 | " with open(md_output_path, \"w\", encoding=\"utf-8\") as f:\n", |
227 | 293 | " markdown = document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)\n", |
228 | 294 | " f.write(markdown)\n", |
229 | | - " print(f\"Path of markdown output is: {Path(md_output_path).resolve()}\")" |
| 295 | + " print(f\"✓ Path of markdown output is: {md_output_path.resolve()}\")" |
230 | 296 | ] |
231 | 297 | }, |
232 | 298 | { |
|
249 | 315 | "outputs": [], |
250 | 316 | "source": [ |
251 | 317 | "for file, confidence_report in confidence_reports.items():\n", |
252 | | - " print(f\"Conversion confidence for {file}:\")\n", |
| 318 | + " print(f\"✓ Conversion confidence for {file}:\")\n", |
253 | 319 | " \n", |
254 | 320 | " print(f\"Average confidence: \\x1b[1m{confidence_report.mean_grade.name}\\033[0m (score {confidence_report.mean_score:.3f})\")\n", |
255 | 321 | " \n", |
|
281 | 347 | ], |
282 | 348 | "metadata": { |
283 | 349 | "kernelspec": { |
284 | | - "display_name": "Python 3 (ipykernel)", |
| 350 | + "display_name": ".venv", |
285 | 351 | "language": "python", |
286 | 352 | "name": "python3" |
287 | 353 | }, |
|
0 commit comments