Issue with Chunk Retrieval after Overriding PdfDecoder and Using OCR #1016

JoelAuren · 2025-02-28T11:06:32Z

JoelAuren
Feb 28, 2025

I am using Memory Kernel and have overridden the PdfDecoder (https://github.com/microsoft/kernel-memory/blob/main/service/Core/DataFormats/Pdf/PdfDecoder.cs) to handle scanned documents. Instead of using ContentOrderTextExtractor.GetText(page), in these cases, I am using Document Intelligence Read to extract the text with OCR. Then, I index the content in the same way into chunks:

result.Sections.Add(new Chunk(pageContent, page.Number, Chunk.Meta(sentencesAreComplete: false)));

The problem I encounter is that when I ask MK some questions about these OCR-processed documents, I sometimes get INFO NOT FOUND, as it seems that the correct chunks with the relevant information are not being retrieved, causing the AI to be unable to respond.

Interestingly, if I use the original method ContentOrderTextExtractor.GetText(page) on the same documents and ask the same questions, I do get correct answers.

I suspect there may be differences in the format of the text extracted with OCR (e.g., line breaks, spaces, structure in a single row, etc.), which could be affecting the indexing and subsequent retrieval of information.

Could the extracted text format affect how chunks are indexed and later retrieved? Is there any recommendation to normalize the text before indexing it to improve search accuracy?

Here is my override of PdfDecoder:

    public async Task<FileContent> DecodeCustomAsync(byte[] pdfBytes, CancellationToken cancellationToken = default)
    {
        var result = new FileContent(MimeTypes.PlainText);
        using PdfDocument? pdfDocument = PdfDocument.Open(pdfBytes);
        if (pdfDocument == null) { return result; }

        foreach (Page? page in pdfDocument.GetPages().Where(x => x != null))
        {
            string pageContent = string.Empty;

            pageContent = ContentOrderTextExtractor.GetText(page).NormalizeNewlines(false) ?? string.Empty;

            // escaned pdf
            if (String.IsNullOrEmpty(pageContent))
            {
                pageContent = (await read.FormRecognizerReadFile(pdfBytes, page.Number)).NormalizeNewlines(false) ?? string.Empty;
            }

            result.Sections.Add(new Chunk(pageContent, page.Number, Chunk.Meta(sentencesAreComplete: false)));
        }
        return result;
    }

I appreciate any guidance on how to address this issue.

dluc · 2025-02-28T19:00:00Z

dluc
Feb 28, 2025
Maintainer

@JoelAuren when the text is extracted form the PDF file, it is saved in a TXT file, inside the workspace. Did you compare the two TXT files, generated by the two different approaches? If you are using Azure blobs, the workspace will be a virtual folder under the container dedicated to KM.

1 reply

JoelAuren Mar 3, 2025
Author

Hi, yes, the first thing I did was compare both outputs to see the results of two different approaches.

The approach using the original PDFDecoder extracts the text while preserving the original page format of the PDF (lines start and end the same way, paragraphs remain intact, etc). On the other hand, using the OCR reader approach through Document Intelligence (prebuilt-read) extracts the text correctly, but with a different format. I’m sharing a screenshot of the extracted text segment that KM is unable to respond to with the second approach.

Original PDF:

Text extract KM with Approach 1 (Original PDFDecoder). contrato.pdf.partition.2

Text extract KM with Approach 2 (My PDF decoder with Document Intelligence) contrato.pdf.partition.2

When I ask the question "¿Que dice sobre la CESIÓN DEL CONTRATO Y SUBCONTRATACIÓN?", approach 1 obtains a perfect answer, while approach 2 returns INFO_NOT_FOUND.

I managed to solve it by adding this piece of code to my PDF decoder when retrieving the response from Document Intelligence.

AnalyzeDocumentOperation operation = await client.AnalyzeDocumentAsync(WaitUntil.Completed, "prebuilt-read", pdfStream, options);
var result = operation.Value;

 //Add this to obtain the text with the original format
 if (result.Pages != null && result.Pages.Any())
 {
     var sb = new StringBuilder();
     foreach (var page in result.Pages)
     {
         foreach (var line in page.Lines)
         {
             sb.AppendLine(line.Content);
         }
     }
     return sb.ToString();
 }
 else if (!string.IsNullOrEmpty(result.Content))
 {
     return result.Content;
 }

Before, it always returned result.Content, which provides the extracted text in plain text. By adding the previous code snippet, the format is preserved almost like the original PDFDecoder, and now it successfully answers the question.

Text extracted with approach 2 after adding the new piece of code

However, I still have the doubt about why, before adding this piece of code, it couldn’t retrieve the answer. I added a log to check the chunks used to form the prompt, and the chunk containing my answer was never included with approach 2. After making this change, it now always retrieves the chunk and responds correctly.

adanbrownpaca · 2025-04-03T14:53:44Z

adanbrownpaca
Apr 3, 2025

This is a great question; I will watch for its response. In addition, I believe that Document Intelligence "retains tables" (with a pseudo HTML format) while the PdfDecoder does not seem to do it. Text generation models have a harder time responding correctly if said tables are not conserved. @JoelAuren - Have you encountered this issue?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with Chunk Retrieval after Overriding PdfDecoder and Using OCR #1016

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Issue with Chunk Retrieval after Overriding PdfDecoder and Using OCR #1016

Uh oh!

Uh oh!

JoelAuren Feb 28, 2025

Replies: 2 comments · 1 reply

Uh oh!

dluc Feb 28, 2025 Maintainer

Uh oh!

Uh oh!

JoelAuren Mar 3, 2025 Author

Uh oh!

adanbrownpaca Apr 3, 2025

JoelAuren
Feb 28, 2025

Replies: 2 comments 1 reply

dluc
Feb 28, 2025
Maintainer

JoelAuren Mar 3, 2025
Author

adanbrownpaca
Apr 3, 2025