Is there any optimal size for the data to be processed while using Spark NLP for Healthcare? #302
-
I have got my pipeline running with |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Can you try writing the raw clinical_note_df to disk as parquet and the read as parquet and then transform with the pipeline to get the resolutions ? The most expensive part in your pipeline is the sbert_embedder stage where you collect the embeddings from sbert.. we just released much lighter versions of sbert embedders tonight but no compatible resolver released yet.. so please try writing the raw clinical_note_df to disk as parquet and then monitor your cpu usage while resolving (transform) |
Beta Was this translation helpful? Give feedback.
Can you try writing the raw clinical_note_df to disk as parquet and the read as parquet and then transform with the pipeline to get the resolutions ? The most expensive part in your pipeline is the sbert_embedder stage where you collect the embeddings from sbert.. we just released much lighter versions of sbert embedders tonight but no compatible resolver released yet.. so please try writing the raw clinical_note_df to disk as parquet and then monitor your cpu usage while resolving (transform)