Skip to content

This Azure Document Intelligence extraction tool converts unstructured documents into structured data. It accurately extracts text and tables, making your information ready for LLMs and immediate use.

Notifications You must be signed in to change notification settings

mfbcat/data-extraction-tool-with-AI-document-intelligence

Repository files navigation

Structify-AI

Overview

Unstructured Data Extraction Tool. Last updated June 16th 2025 - V 0.2 - Streamlit App

This service applies advanced machine learning to extract text, key-value pairs, tables, and structures from documents automatically and accurately. The prototype focuses on converting unstructured data into a structured format suitable for LLMs, turning documents into usable data and shifting the focus to acting on information rather than compiling it. It explores either adopting external tools or building a custom solution tailored to your documents. A key feature is ensuring compatibility with diverse data types and formats for smooth integration within existing systems, both on-premises and in the cloud.

Architecture

The prototype is built around a core processing engine that leverages advanced NLP techniques to maximize the accuracy of information extraction. Machine learning algorithms are incorporated to enhance the tool's adaptability and optimize its performance over time.

Data Ingestion: Handles various document types and formats. Extraction Core: The engine that uses NLP and ML models to extract text, key-value pairs, and structures. This can be configured to use prebuilt models or custom models. Structuring & Output: Formats the extracted, raw information into clean, usable data suitable for LLMs and other downstream systems. Frontend: A Streamlit application provides the user interface for interaction.

Branching model

This project utilizes the Gitflow branching model for a structured development workflow. The main branch reflects the latest stable release. All development happens on the develop branch. When a release is planned, a release/* branch is created from develop for final preparations. Once ready, it is merged into both main and back into develop. Feature development occurs in separate feature/* branches.

Technologies

Python: Core programming language. Streamlit: For the user interface application. Natural Language Processing (NLP): For information extraction. Machine Learning: To enhance adaptability and performance. Can be used with SDKs for integration.

Setup

Clone the repository: Generated bash git clone https://github.com/mfbcat/structify-AI.git Use code with caution. Bash Navigate to the project directory and install dependencies: Generated bash cd structify-AI pip install -r requirements.txt Use code with caution. Bash Run the Streamlit application: Generated bash streamlit run app.py Use code with caution. Bash

Testing

The application includes unit tests to ensure the quality and stability of the extraction engine. Unit Tests: Located in the src/test/java/ or tests/ directories. Focus on testing individual components. To run the tests: Navigate to the directory containing the tests. Run the test suite via the designated test runner (e.g., pytest).

Contributing

Contributions are welcome! Please follow these guidelines: Fork the repository. Create a new branch for your feature or bug fix: git checkout -b feature/my-new-feature Make your changes and commit them with clear and descriptive commit messages. Test your changes thoroughly. Push your branch to your forked repository: git push origin feature/my-new-feature Create a pull request to the develop branch of the original repository.

License

MIT License Copyright (c) 2025 Marc Farssac, email: [email protected] https://www.mfb.cat

About

This Azure Document Intelligence extraction tool converts unstructured documents into structured data. It accurately extracts text and tables, making your information ready for LLMs and immediate use.

Topics

Resources

Stars

Watchers

Forks