Replies: 2 comments 1 reply
-
@JoelAuren when the text is extracted form the PDF file, it is saved in a TXT file, inside the workspace. Did you compare the two TXT files, generated by the two different approaches? If you are using Azure blobs, the workspace will be a virtual folder under the container dedicated to KM. |
Beta Was this translation helpful? Give feedback.
-
This is a great question; I will watch for its response. In addition, I believe that Document Intelligence "retains tables" (with a pseudo HTML format) while the PdfDecoder does not seem to do it. Text generation models have a harder time responding correctly if said tables are not conserved. @JoelAuren - Have you encountered this issue? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am using Memory Kernel and have overridden the PdfDecoder (https://github.com/microsoft/kernel-memory/blob/main/service/Core/DataFormats/Pdf/PdfDecoder.cs) to handle scanned documents. Instead of using ContentOrderTextExtractor.GetText(page), in these cases, I am using Document Intelligence Read to extract the text with OCR. Then, I index the content in the same way into chunks:
result.Sections.Add(new Chunk(pageContent, page.Number, Chunk.Meta(sentencesAreComplete: false)));
The problem I encounter is that when I ask MK some questions about these OCR-processed documents, I sometimes get INFO NOT FOUND, as it seems that the correct chunks with the relevant information are not being retrieved, causing the AI to be unable to respond.
Interestingly, if I use the original method ContentOrderTextExtractor.GetText(page) on the same documents and ask the same questions, I do get correct answers.
I suspect there may be differences in the format of the text extracted with OCR (e.g., line breaks, spaces, structure in a single row, etc.), which could be affecting the indexing and subsequent retrieval of information.
Could the extracted text format affect how chunks are indexed and later retrieved? Is there any recommendation to normalize the text before indexing it to improve search accuracy?
Here is my override of PdfDecoder:
I appreciate any guidance on how to address this issue.
Beta Was this translation helpful? Give feedback.
All reactions