-
Notifications
You must be signed in to change notification settings - Fork 736
Labels
Description
The following Spark NLP annotators have buggy behavior when getting and setting certain parameters.
Steps to Reproduce
- Open terminal and Python shell
- Copy Paste the following imports into your python shell
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.base import *
- Create Document Assembler, getting parameter throws error
spark = sparknlp.start()
document_assembler = DocumentAssembler()
sentence_detector.getInputCols() # crash
sentence_detector.getOutputCols() # crash
sentence_detector.getSplitLength() # returns nothing, intended behaviour?
- After using setter method to set input and output columns, the getter returns None
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
sentence_detector.getInputCols() # now returns None
sentence_detector.getOutputCols() # now returns None
- The exact same behavior on the getters/setters for InputCols and OutputCols can be seen for the following Annotators and can be reproduced in exactly the same way as demonstrated with the sentence detector :
- tokenizer, NorvigSweetingApproach, SPellChecker, ContextSpellChecker, dependencyParser, Typed Dependency Parser, Sentiment Detector, ViveknSentimentDetector, POSTagger, DeepSentenceDetector, SentenceDetector, DateMatcher, NGramGenerator, Chunker, Textmatcher, RegexMatcher, StopWOrdCleaner, Lemmatizer, Stemmer, Normalizer, Tokenizer
- Alternative buggy behavior:
- Some annotators don't crash when getting their parameter, but they still return None after setting them.
- elmo : When getting the parameter and it has not been set before it returns None. After setting parameters and getting them None is still being returned
elmo = ElmoEmbeddings.pretrained()
print(elmo.getInputCols())
print(elmo.getOutputCol())
elmo.setInputCols(["token", "document"]).setOutputCol("elmo")
print(elmo.getInputCols())
print(elmo.getOutputCol())
- exactly the same behavior as demonstrated with elmo has been tested and reproduced with Xlnet, Bert, Albert, UniversalSentenceEncoder, SentenceEmbeddings, chunkEmbeddings, ClassifierDL, SentimentDL, language detector
Third alternation behavior alternations :
- NER CRF Tagger returns None when getting InputCol. Ner CRF Tagger crashed when getting OutputCol.
- NER DL crashes for both, Input and Output Col when getting and returns None even after setting.
I tested every annotator
Annotators unaffected by this bug :
- DocumentAssembler,
- Not tested on pretrained pipes
Your Environment
- Spark NLP version: tested on 2.5.1 and 2.5.2
- Java version (java -version): openjdk version "1.8.0_252"
- Setup and installation (Pypi, Conda, Maven, etc.): via pip. Also tested on Databricks
- Operating System and version: Manjaro Linux