SPARKNLP-1260 Introducing Reader2Table Annotator #14640

danilojsl · 2025-07-31T17:42:51Z

Description

This PR introduces the new Reader2Table annotator to spark-nlp, providing a streamlined and user-friendly interface for interacting with Spark NLP readers and integrating with spark-nlp pipelines.

Key Improvements:

Simplifies integration with Spark NLP readers through a unified interface
Adds flexibility by enabling more reader-specific configurations
Enhances the maintainability and scalability of data loading workflows
Supported formats include:

HTML
Word (.doc/.docx)
Excel (.xls/.xlsx)
PowerPoint (.ppt/.pptx)
Markdown (.md)
CSV (.csv)

Motivation and Context

The current approach to interfacing with Spark NLP readers is fragmented and lacks flexibility, often requiring custom code for handling various input sources and options. This makes onboarding harder for new users and hinders reuse across pipelines.

The Reader2Table component abstracts these complexities by:

Unifying access patterns for multiple readers
Reducing boilerplate code in reader configuration
Making it easier to scale and switch between different data sources
Output only content inside a table
This feature is part of the requirements described in issue Reader Annotators #14624

How Has This Been Tested?

Screenshots (if appropriate):

Local Tests
Google Colab notebook
Databricks notebook

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…er2Table

danilojsl self-assigned this Jul 31, 2025

danilojsl added 5 commits July 31, 2025 12:44

[SPARKNLP-1260] Adding non-asci and xml support to MarkdownReader

813f405

[SPARKNLP-1260] Introducing Reader2Table Annotator

d361966

[SPARKNLP-1260] Adding outputformat parameter and demo notebook

40068aa

[SPARKNLP-1260] Adding support for mixed files to Reader2Doc and Read…

fa40424

…er2Table

[SPARKNLP-1260] Updating demo notebooks to Reader2Doc and Reader2Table

3b21d4d

danilojsl force-pushed the feature/SPARKNLP-1260-Implement-Reader2Table-Annotator branch from c8543a7 to 3b21d4d Compare July 31, 2025 17:50

[SPARKNLP-1260] Fix python test files path

e3da196

danilojsl marked this pull request as ready for review August 1, 2025 13:13

danilojsl requested review from DevinTDHa and mehmetbutgul August 1, 2025 13:13

danilojsl added the new-feature Introducing a new feature label Aug 2, 2025

DevinTDHa changed the base branch from master to release/611-release-candidate August 4, 2025 15:37

DevinTDHa approved these changes Aug 5, 2025

View reviewed changes

DevinTDHa merged commit aaca781 into release/611-release-candidate Aug 5, 2025
4 checks passed

DevinTDHa mentioned this pull request Aug 5, 2025

Spark NLP Release 6.1.1 #14643

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARKNLP-1260 Introducing Reader2Table Annotator #14640

SPARKNLP-1260 Introducing Reader2Table Annotator #14640

Uh oh!

danilojsl commented Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SPARKNLP-1260 Introducing Reader2Table Annotator #14640

SPARKNLP-1260 Introducing Reader2Table Annotator #14640

Uh oh!

Conversation

danilojsl commented Jul 31, 2025

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants