GitHub - ishanmehta2/Newspaper-Text-Extraction

CS224V Final Project - Agents That Read Complex Documents With Vision and Language Models

This project aims to perform accurate text extraction on century-old newspaper clippings utilizing tools such as GPT-4o and Tesseract. Due to the agentic nature of the course, we aimed to create a pipeline which made calls to various agents, preserving autonomy and enhancing the accuracy of our solution. We began by running four baselines, two with Tesseract and two with GPT-4o, in which we found the importance of accurate segmentation of text. Both tools were very accurate at text extraction when given a cropped image, hitting F1 scores over 0.89. For our pipeline, we use calls to both of these tools with Tesseract primarily handing the segmentation and GPT-4o more focused on the extraction and validation of the text. On average, we were able to detect about 73 percent of all articles in a given newspaper, and when we condition our results on this number, achieved an F1 score around 0.6. This paper breaks down some of the other work done in this field, outlines the tools we used in more detail, provides details on our exact methods, and lists our results and analysis on the project.

To run this program, first download and insall everything in dependencies.py and then run main.py in the full_pipeline folder.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
baselines		baselines
data		data
full_pipeline		full_pipeline
newspaper_images		newspaper_images
README.md		README.md
Writeup.pdf		Writeup.pdf
dependencies.py		dependencies.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

ishanmehta2/Newspaper-Text-Extraction

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages