Skip to content

Problems with various getters and setters in the Python Spark NLP API #952

@C-K-Loan

Description

@C-K-Loan

The following Spark NLP annotators have buggy behavior when getting and setting certain parameters.

Steps to Reproduce

  1. Open terminal and Python shell
  2. Copy Paste the following imports into your python shell
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.base import *
  1. Create Document Assembler, getting parameter throws error
spark = sparknlp.start()
document_assembler = DocumentAssembler()
sentence_detector.getInputCols()   # crash
sentence_detector.getOutputCols()    # crash
sentence_detector.getSplitLength() # returns nothing, intended behaviour?
  1. After using setter method to set input and output columns, the getter returns None
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")
sentence_detector.getInputCols()   #  now returns None
sentence_detector.getOutputCols()    # now returns None  

  1. The exact same behavior on the getters/setters for InputCols and OutputCols can be seen for the following Annotators and can be reproduced in exactly the same way as demonstrated with the sentence detector :
  • tokenizer, NorvigSweetingApproach, SPellChecker, ContextSpellChecker, dependencyParser, Typed Dependency Parser, Sentiment Detector, ViveknSentimentDetector, POSTagger, DeepSentenceDetector, SentenceDetector, DateMatcher, NGramGenerator, Chunker, Textmatcher, RegexMatcher, StopWOrdCleaner, Lemmatizer, Stemmer, Normalizer, Tokenizer
  1. Alternative buggy behavior:
  • Some annotators don't crash when getting their parameter, but they still return None after setting them.
  • elmo : When getting the parameter and it has not been set before it returns None. After setting parameters and getting them None is still being returned

elmo = ElmoEmbeddings.pretrained() 
print(elmo.getInputCols())
print(elmo.getOutputCol())
elmo.setInputCols(["token", "document"]).setOutputCol("elmo") 
print(elmo.getInputCols())
print(elmo.getOutputCol())

  • exactly the same behavior as demonstrated with elmo has been tested and reproduced with Xlnet, Bert, Albert, UniversalSentenceEncoder, SentenceEmbeddings, chunkEmbeddings, ClassifierDL, SentimentDL, language detector

Third alternation behavior alternations :

  • NER CRF Tagger returns None when getting InputCol. Ner CRF Tagger crashed when getting OutputCol.
  • NER DL crashes for both, Input and Output Col when getting and returns None even after setting.
    I tested every annotator

Annotators unaffected by this bug :

  • DocumentAssembler,
  • Not tested on pretrained pipes

Your Environment

  • Spark NLP version: tested on 2.5.1 and 2.5.2
  • Java version (java -version): openjdk version "1.8.0_252"
  • Setup and installation (Pypi, Conda, Maven, etc.): via pip. Also tested on Databricks
  • Operating System and version: Manjaro Linux

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions