-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10809] [MLlib] Single-document topicDistributions method for LocalLDAModel #9484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #45087 has finished for PR 9484 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove the doc ID? It's not necessary for a single doc, and removing it will make this more Java-friendly.
|
Test build #45202 has finished for PR 9484 at commit
|
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <[email protected]> Closes #9513 from jkbradley/lda-pipelines. (cherry picked from commit e281b87) Signed-off-by: Joseph K. Bradley <[email protected]>
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <[email protected]> Closes #9513 from jkbradley/lda-pipelines.
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [apache/spark#9484], but I'll try to merge [apache/spark#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <[email protected]> Closes #9513 from jkbradley/lda-pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this line (no doc ID)
|
@hhbyyh Sorry again for the delay, but we can get this merged now |
|
@jkbradley It's quite all right. Thanks for reviewing. Update sent. |
|
Test build #48895 has finished for PR 9484 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Scala doc for this line is not generated correctly. Can you try removing the argument and just writing [[topicDistributions]] instead?
|
Sorry for the late response. Update sent |
|
Jenkins, retest this please. |
|
Test build #49109 has finished for PR 9484 at commit
|
|
Getting many TimeoutException. |
|
Test build #49124 has finished for PR 9484 at commit
|
|
LGTM |
jira: https://issues.apache.org/jira/browse/SPARK-10809
We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.
add some missing assert too.