Skip to content

Conversation

@danilojsl
Copy link
Contributor

Description

This PR introduces the new Reader2Table annotator to spark-nlp, providing a streamlined and user-friendly interface for interacting with Spark NLP readers and integrating with spark-nlp pipelines.

Key Improvements:

Simplifies integration with Spark NLP readers through a unified interface
Adds flexibility by enabling more reader-specific configurations
Enhances the maintainability and scalability of data loading workflows
Supported formats include:

  • HTML
  • Word (.doc/.docx)
  • Excel (.xls/.xlsx)
  • PowerPoint (.ppt/.pptx)
  • Markdown (.md)
  • CSV (.csv)

Motivation and Context

The current approach to interfacing with Spark NLP readers is fragmented and lacks flexibility, often requiring custom code for handling various input sources and options. This makes onboarding harder for new users and hinders reuse across pipelines.

The Reader2Table component abstracts these complexities by:

  • Unifying access patterns for multiple readers
  • Reducing boilerplate code in reader configuration
  • Making it easier to scale and switch between different data sources
  • Output only content inside a table
    This feature is part of the requirements described in issue Reader Annotators #14624

How Has This Been Tested?

Screenshots (if appropriate):

  • Local Tests
  • Google Colab notebook
  • Databricks notebook

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl self-assigned this Jul 31, 2025
@danilojsl danilojsl force-pushed the feature/SPARKNLP-1260-Implement-Reader2Table-Annotator branch from c8543a7 to 3b21d4d Compare July 31, 2025 17:50
@danilojsl danilojsl marked this pull request as ready for review August 1, 2025 13:13
@danilojsl danilojsl added the new-feature Introducing a new feature label Aug 2, 2025
@DevinTDHa DevinTDHa changed the base branch from master to release/611-release-candidate August 4, 2025 15:37
@DevinTDHa DevinTDHa merged commit aaca781 into release/611-release-candidate Aug 5, 2025
4 checks passed
@DevinTDHa DevinTDHa mentioned this pull request Aug 5, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature Introducing a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants