GitHub - BigData-Fall2024-Team4/Assignment2

Assignment2

The goal of this project is to automate the extraction of text from PDF files and provide a user-friendly interface for interacting with the extracted data. The project integrates automated data extraction using Airflow, PyPDF2, and Azure Document Intelligence, with a client-facing application built using Streamlit and FastAPI.

Key technologies involved include: Streamlit: Interactive UI framework for data exploration and user input.

FastAPI: Backend framework for managing user authentication, API communication, and serving processed data.
PyPDF2: An open-source library for extracting text from simple PDF files.
Azure Document Intelligence: Used for advanced text extraction from complex PDFs containing tables and structured data.
Airflow: Orchestrates the PDF processing pipelines, automating extraction and data uploads.
Google Cloud Platform (GCP):
- Google Cloud Storage (GCS): Stores extracted text and processed files.
- GCP SQL (MySQL Database): Stores metadata of processed PDFs and user credentials for authentication.

Project Resources

Google collab notebook: https://colab.research.google.com/drive/1H78r4BZynBK4jxVNtZfMn8BgzBlUjeuh?usp=sharing

Google codelab: [https://codelabs-preview.appspot.com/?file_id=1r22xjHpWOK1GBYjrgJxDEu2EVpfbIeNmms62R1VADps#0]

Demo Video URL: https://drive.google.com/file/d/15uZEUIzM380tWLgTcy5BQN5SA_6WAFyi/view?usp=drive_link

Airflow: http://35.243.155.116:8080 Fastapi: http://34.138.117.80:8000 Streamlit: http://34.138.117.80:8501

Tech Stack

Architecture diagram

Project Flow

Part 1: Automating Text Extraction and Database Population

Airflow Pipelines:
- clone_repo: Clones the repository containing the GAIA dataset and processing scripts.
- filter_pdfs: Filters relevant PDF files from the dataset for processing.
- process_pdfs: Uses PyPDF2 and Azure Document Intelligence to extract text from the PDFs.
- upload_files: Uploads the extracted text and results to Google Cloud Storage (GCS).
Storage in GCP SQL:
- Extracted metadata is stored in a MySQL database hosted on GCP SQL, allowing efficient querying and data management.

Part 2: Client-Facing Application using Streamlit and FastAPI

FastAPI:
- Handles user registration and login using JWT authentication.
- Serves as the backend for processing user queries and interacting with OpenAI.
- Exposes API endpoints for managing PDF processing and fetching results.
Streamlit Application:
- Provides a registration and login interface.
- Allows users to select PDF files for analysis and view extracted content.
- Facilitates comparison of OpenAI responses with the expected answers from the GAIA dataset.
- Displays visualizations to assess the accuracy of model responses.
Deployment:
- Both Streamlit and FastAPI are containerized using Docker for consistent deployment.
- The applications are hosted on a public cloud, ensuring scalability and ease of access.

Contributions

Name	Percentage Contribution
Sarthak Somvanshi	33% (Basic chatbot development, Agentic architecture design, Document handler, Canvas Post Agent integration & testing, Final system testing)
Yuga Kanse	33% (OpenAI Integration, FastAPI Integration, Summary and Submit Page)
Tanvi Inchanalkar	33% (Pypdf and Azure AI Document Intelligence Airflow Pipelines, Documentation)

Additional Notes

WE ATTEST THAT WE HAVEN’T USED ANY OTHER STUDENTS’ WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.devcontainer		.devcontainer
.vscode		.vscode
Airflow		Airflow
Diagrams		Diagrams
data_ingestion		data_ingestion
fastapi		fastapi
streamlit		streamlit
.gitignore		.gitignore
PDF Extraction API Evaluation Template_Team4.docx		PDF Extraction API Evaluation Template_Team4.docx
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Assignment2

Project Resources

Tech Stack

Architecture diagram

Project Flow

Part 1: Automating Text Extraction and Database Population

Part 2: Client-Facing Application using Streamlit and FastAPI

Contributions

Additional Notes

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

BigData-Fall2024-Team4/Assignment2

Folders and files

Latest commit

History

Repository files navigation

Assignment2

Project Resources

Tech Stack

Architecture diagram

Project Flow

Part 1: Automating Text Extraction and Database Population

Part 2: Client-Facing Application using Streamlit and FastAPI

Contributions

Additional Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages