Is there any optimal size for the data to be processed while using Spark NLP for Healthcare? #302

JustHeroo · 2021-08-24T06:45:21Z

JustHeroo
Aug 24, 2021
Maintainer

I have got my pipeline running with
sbert_resolver_pipeline = Pipeline(
stages = [
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
c2doc,
sbert_embedder,
icd10_resolver])
The question I have is .. Is there any optimal size for the data to be processed??? i.e. 1 Note at a time .. or a DF of 50 notes etc?? I had a DF with 2000 notes with notes column named to "text"
clinical_note_df=tdf.select(F.explode('notes_data').alias("rec")).select("rec.event_id","rec.blob_length","rec.updt_dt_tm","rec.valid_from_dt_tm","rec.Note").withColumnRenamed("Note", "text")
I transformed it and extracted relevant information
icd10_sdf = sbert_models.transform(clinical_note_df)
icd10_sdf2=icd10_sdf.select(icd10_sdf.event_id, icd10_sdf.blob_length,icd10_sdf.updt_dt_tm,icd10_sdf.valid_from_dt_tm,icd10_sdf.text,F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","icd10cm_code.result")).alias("icd10cm_code")).select( "event_id", "blob_length","updt_dt_tm","valid_from_dt_tm","text",F.expr("icd10cm_code['1'].entity").alias("entity"),F.expr("icd10cm_code['0']").alias("chunk"),F.expr("icd10cm_code['2']").alias("icd10_code"))
and wrote it to a parquet file
icd10_sdf2.write.parquet("/tmp/test/" + fname)
The whole process for 2000 records took over 3 hours. Our daily count of notes is much much higher than that.. Is there a shorter and faster pipeline recommended to get ICD10 CM codes .. or more efficient way of processing .. or may optimal size for processing at one time??
We have till end of month to do a working POC, so that we can go ahead and approach John Snow Labs to sign a licensing contract. Any help or guidance would be appreciated.
P.S. - I am running local spark (3.02), Python(3.6.8), Spark NLP and JSL (3.02) on one server with 96 cpu cores and 256GB of memory..

Answered by JustHeroo

Aug 24, 2021

Can you try writing the raw clinical_note_df to disk as parquet and the read as parquet and then transform with the pipeline to get the resolutions ? The most expensive part in your pipeline is the sbert_embedder stage where you collect the embeddings from sbert.. we just released much lighter versions of sbert embedders tonight but no compatible resolver released yet.. so please try writing the raw clinical_note_df to disk as parquet and then monitor your cpu usage while resolving (transform)

View full answer

JustHeroo · 2021-08-24T06:45:56Z

JustHeroo
Aug 24, 2021
Maintainer Author

Can you try writing the raw clinical_note_df to disk as parquet and the read as parquet and then transform with the pipeline to get the resolutions ? The most expensive part in your pipeline is the sbert_embedder stage where you collect the embeddings from sbert.. we just released much lighter versions of sbert embedders tonight but no compatible resolver released yet.. so please try writing the raw clinical_note_df to disk as parquet and then monitor your cpu usage while resolving (transform)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there any optimal size for the data to be processed while using Spark NLP for Healthcare? #302

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there any optimal size for the data to be processed while using Spark NLP for Healthcare? #302

Uh oh!

JustHeroo Aug 24, 2021 Maintainer

Replies: 1 comment

Uh oh!

JustHeroo Aug 24, 2021 Maintainer Author

JustHeroo
Aug 24, 2021
Maintainer

JustHeroo
Aug 24, 2021
Maintainer Author