Skip to content

ishanmehta2/Newspaper-Text-Extraction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS224V Final Project - Agents That Read Complex Documents With Vision and Language Models

This project aims to perform accurate text extraction on century-old newspaper clippings utilizing tools such as GPT-4o and Tesseract. Due to the agentic nature of the course, we aimed to create a pipeline which made calls to various agents, preserving autonomy and enhancing the accuracy of our solution. We began by running four baselines, two with Tesseract and two with GPT-4o, in which we found the importance of accurate segmentation of text. Both tools were very accurate at text extraction when given a cropped image, hitting F1 scores over 0.89. For our pipeline, we use calls to both of these tools with Tesseract primarily handing the segmentation and GPT-4o more focused on the extraction and validation of the text. On average, we were able to detect about 73 percent of all articles in a given newspaper, and when we condition our results on this number, achieved an F1 score around 0.6. This paper breaks down some of the other work done in this field, outlines the tools we used in more detail, provides details on our exact methods, and lists our results and analysis on the project.

To run this program, first download and insall everything in dependencies.py and then run main.py in the full_pipeline folder.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%