Unstructured Data Extraction Tool. Last updated June 16th 2025 - V 0.2 - Streamlit App
This service applies advanced machine learning to extract text, key-value pairs, tables, and structures from documents automatically and accurately. The prototype focuses on converting unstructured data into a structured format suitable for LLMs, turning documents into usable data and shifting the focus to acting on information rather than compiling it. It explores either adopting external tools or building a custom solution tailored to your documents. A key feature is ensuring compatibility with diverse data types and formats for smooth integration within existing systems, both on-premises and in the cloud.
The prototype is built around a core processing engine that leverages advanced NLP techniques to maximize the accuracy of information extraction. Machine learning algorithms are incorporated to enhance the tool's adaptability and optimize its performance over time.
Data Ingestion: Handles various document types and formats. Extraction Core: The engine that uses NLP and ML models to extract text, key-value pairs, and structures. This can be configured to use prebuilt models or custom models. Structuring & Output: Formats the extracted, raw information into clean, usable data suitable for LLMs and other downstream systems. Frontend: A Streamlit application provides the user interface for interaction.
This project utilizes the Gitflow branching model for a structured development workflow. The main branch reflects the latest stable release. All development happens on the develop branch. When a release is planned, a release/* branch is created from develop for final preparations. Once ready, it is merged into both main and back into develop. Feature development occurs in separate feature/* branches.
Python: Core programming language. Streamlit: For the user interface application. Natural Language Processing (NLP): For information extraction. Machine Learning: To enhance adaptability and performance. Can be used with SDKs for integration.
Clone the repository: Generated bash git clone https://github.com/mfbcat/structify-AI.git Use code with caution. Bash Navigate to the project directory and install dependencies: Generated bash cd structify-AI pip install -r requirements.txt Use code with caution. Bash Run the Streamlit application: Generated bash streamlit run app.py Use code with caution. Bash
The application includes unit tests to ensure the quality and stability of the extraction engine. Unit Tests: Located in the src/test/java/ or tests/ directories. Focus on testing individual components. To run the tests: Navigate to the directory containing the tests. Run the test suite via the designated test runner (e.g., pytest).
Contributions are welcome! Please follow these guidelines: Fork the repository. Create a new branch for your feature or bug fix: git checkout -b feature/my-new-feature Make your changes and commit them with clear and descriptive commit messages. Test your changes thoroughly. Push your branch to your forked repository: git push origin feature/my-new-feature Create a pull request to the develop branch of the original repository.
MIT License Copyright (c) 2025 Marc Farssac, email: [email protected] https://www.mfb.cat