[SPARK-19791] [ML] Add doc and example for fpgrowth #17130

hhbyyh · 2017-03-01T23:59:19Z

What changes were proposed in this pull request?

Add a new section for fpm
Add Example for FPGrowth in scala and Java

updated: Rewrite transform to be more compact.

How was this patch tested?

local doc generation.

srowen · 2017-03-02T00:08:25Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

        val itemset = items.toSet
-        brRules.value.flatMap(rule =>
-          if (items != null && rule._1.forall(item => itemset.contains(item))) {
+        brRules.value.flatMap { rule =>


Nit, while we're here -- why change this bit?
Or if simplifying, what about

brRules.value.filter(_._1_forall(itemset.contains)).flatMap(_._2.filter(!itemset.contains(_)))

The change was about a style comment from the orginal PR that I missed. But it's great to see your suggestion. I'll run some test to confirm the performance.

SparkQA · 2017-03-02T00:13:26Z

Test build #73719 has finished for PR 17130 at commit fdce240.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaFPGrowthExample

BryanCutler

Just some minor comments. I also ran the examples and output looks good.

BryanCutler · 2017-03-02T19:17:30Z

examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala

+ * }}}
+ */
+object FPGrowthExample {
+


nit: remove blank line

BryanCutler · 2017-03-02T19:17:55Z

examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala

+object FPGrowthExample {
+
+  def main(args: Array[String]): Unit = {
+


nit: remove blank line

Thanks, I'll remove this line.

BryanCutler · 2017-03-02T19:18:17Z

examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala

+    import spark.implicits._
+
+    // $example on$
+    // Loads data.


I think this comment is pretty obvious, you can probably remove it

BryanCutler · 2017-03-02T19:19:43Z

examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala

+      "1 2 5",
+      "1 2 3 5",
+      "1 2")
+    ).map(t => t.split(" ")).toDF("features")


~~I think it is better to explicitly declare the data instead of manipulating strings, that way it is very clear what the input data is for the example.~~ On second thought, never mind this comment - it's pretty clear the way it is

BryanCutler · 2017-03-02T19:20:48Z

examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala

+    val fpgrowth = new FPGrowth().setMinSupport(0.5).setMinConfidence(0.6)
+    val model = fpgrowth.fit(dataset)
+
+    // get frequent itemsets.


should say "Display frequent itemsets."

BryanCutler · 2017-03-02T19:21:06Z

examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala

+    // get frequent itemsets.
+    model.freqItemsets.show()
+
+    // get generated association rules.


same as comment above

BryanCutler · 2017-03-02T19:22:38Z

examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala

+    // transform examines the input items against all the association rules and summarize the
+    // consequents as prediction
+    model.transform(dataset).show()
+


nit: remove blank line

BryanCutler · 2017-03-02T19:24:07Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

  /**
-   * Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and
-   * partition number of the input dataset is used.
+   * Number of partitions (positive) used by parallel FP-growth. By default the param is not set,


why change to "positive", I think it was clearer before

I presume to fix a javadoc error because angle brackets are read as opening an HTML tag

Yes. That's the reason. But still I'm getting some java doc error after merging code. Looking into it..

I don't think the error is related to this change.

Let's just use ">=1" but figure out how to escape the characters for javadoc. We'll want to do that long-term.

BryanCutler · 2017-03-02T19:28:09Z

examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala

+      "1 2")
+    ).map(t => t.split(" ")).toDF("features")
+
+    // Trains a FPGrowth model.


nit: technically it is just the line below calling fit that trains the model, I would move this comment down or just take it out

hhbyyh · 2017-03-02T22:14:44Z

ping @jkbradley since we're changing the FPGrowth transform.
Sean made a great suggestion to simplify transform code.

SparkQA · 2017-03-02T23:02:24Z

Test build #73790 has finished for PR 17130 at commit ca12877.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-11T07:57:31Z

Test build #74383 has started for PR 17130 at commit 9ce0093.

jkbradley · 2017-03-13T16:18:02Z

The updated transform looks good; thanks for pinging!

SparkQA · 2017-03-13T18:09:00Z

Test build #3599 has finished for PR 17130 at commit 9ce0093.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-03-14T06:25:30Z

Thanks for the review. I'll wait for #17283 to be merged first and resolve the conflict.

SparkQA · 2017-03-14T23:39:44Z

Test build #74565 has finished for PR 17130 at commit fa4c734.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-16T02:44:25Z

Test build #74636 has finished for PR 17130 at commit de1bfc8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-03-16T04:55:26Z

Refined some comments and minor things. This should be ready for review. Thanks.

hhbyyh · 2017-03-17T01:11:01Z

~~Please hold on merging this until #17321 is resolved.~~
Updated with the ItemsCol.

SparkQA · 2017-03-21T19:34:46Z

Test build #74994 has finished for PR 17130 at commit 9fef280.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-03-23T18:10:07Z

I'll be happy to help get this merged now that the column renaming is done

jkbradley · 2017-03-23T20:58:48Z

Noting here: Please check out the "Issue this PR brought up" here: #17218

It may affect this PR. Thanks!

SparkQA · 2017-03-23T21:06:08Z

Test build #75112 has finished for PR 17130 at commit 9e908d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BenFradet

a few remarks regarding the documentation otherwise it looks good 👍

BenFradet · 2017-03-28T14:44:02Z

docs/ml-frequent-pattern-mining.md

+After the second step, the frequent itemsets can be extracted from the FP-tree.
+In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
+as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
+PFP distributes the work of growing FP-trees based on the suffices of transactions,


BenFradet · 2017-03-28T14:44:15Z

docs/ml-frequent-pattern-mining.md

+In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
+as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
+PFP distributes the work of growing FP-trees based on the suffices of transactions,
+and hence more scalable than a single-machine implementation.


is more scalable

BenFradet · 2017-03-28T14:47:08Z

docs/ml-frequent-pattern-mining.md

+* `minSupport`: the minimum support for an itemset to be identified as frequent.
+  For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
+* `minConfidence`: minimum confidence for generating Association Rule. The parameter will not affect the mining
+  for frequent itemsets,, but specify the minimum confidence for generating association rules from frequent itemsets.


It might be good to give an example for confidence as well since one has been given for support

also, there are two commas after itemsets

BenFradet · 2017-03-28T14:47:43Z

docs/ml-frequent-pattern-mining.md

+* `minConfidence`: minimum confidence for generating Association Rule. The parameter will not affect the mining
+  for frequent itemsets,, but specify the minimum confidence for generating association rules from frequent itemsets.
+* `numPartitions`: the number of partitions used to distribute the work. By default the param is not set, and
+  partition number of the input dataset is used.


the number of partitions of the input dataset

BenFradet · 2017-03-28T15:12:01Z

docs/ml-frequent-pattern-mining.md

+* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of 
+  DataFrame("antecedent"[Array], "consequent"[Array], "confidence"[Double]).
+* `transform`: The transform method examines the input items in `itemsCol` against all the association rules and
+  summarize the consequents as prediction. The prediction column has the same data type as the


I don't think this really explains what transform does or maybe it's just me?

I would have said something like:

The transform method will produce predictionCol containing all the consequents of the association rules containing the items in itemsCol as their antecedents. The prediction column...

Thanks for the suggestion. I do wish to have a better illustration here. But the two containing in your version make it not that straightforward, and actually it should be items in itemsCol contains the antecedents for association rules.

I extend it to a longer version,

For each record in itemsCol, the transform method will compare its items against the antecedents of each association rule. If the record contains all the antecedents of a specific association rule, the rule will be considered as applicable and its consequents will be added to the prediction result. The transform method will summarize the consequents from all the applicable rules as prediction.

even better 👍

SparkQA · 2017-03-31T01:13:27Z

Test build #75409 has finished for PR 17130 at commit e9b090a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-04-09T12:37:50Z

examples/src/main/python/ml/fpgrowth_example.py

+
+    fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
+    fpGrowthModel = fpGrowth.fit(df)
+


Can we associationRules example? After all this is the biggest advantage for the Python user.

definitely. Thanks.

SparkQA · 2017-04-11T21:03:54Z

Test build #75714 has finished for PR 17130 at commit 0fb5a87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-19T16:35:44Z

right, how are we on this? let's get this ready soon and merge?
could you also add reference to the R example which is merged yesterday.

felixcheung · 2017-04-19T16:36:16Z

docs/ml-frequent-pattern-mining.md

+Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.fpm.FPGrowth) for more details.
+
+{% include_example python/ml/fpgrowth_example.py %}
+</div>


add R please

Sure. Added reference to R example. Manually checked on generated doc.

SparkQA · 2017-04-19T19:20:05Z

Test build #75947 has finished for PR 17130 at commit 2b1efb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-25T08:04:07Z

docs/ml-frequent-pattern-mining.md

+* `freqItemsets`: frequent itemsets in the format of DataFrame("items"[Array], "freq"[Long])
+* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of 
+  DataFrame("antecedent"[Array], "consequent"[Array], "confidence"[Double]).
+* `transform`: For each transaction in itemsCol, the `transform` method will compare its items against the antecedents


itemsCol?

Please refer to https://issues.apache.org/jira/browse/SPARK-19899

I mean style it as code with backtick

felixcheung · 2017-04-25T08:06:05Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

  /**
-   * Minimal confidence for generating Association Rule.
-   * Note that minConfidence has no effect during fitting.
+   * Minimal confidence for generating Association Rule. MinConfidence will not affect the mining


lower case minConfidence?

felixcheung · 2017-04-25T08:07:12Z

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

-            Seq.empty
-          }).distinct
+        brRules.value.filter(_._1.forall(itemset.contains))
+          .flatMap(_._2.filter(!itemset.contains(_))).distinct


are we aware of code changes here

so we don't need to handle null item?

Hi @felixcheung , can you be more specific about the null item that concerns you? Thanks.

right, 2 things - first just calling out while the PR says doc changes there is this one code change here.
second, before this code was checking items != null do we need not consider that now?

items != null is already checked at two lines above.
Please refer to the comments in the PR for the history of the code change. I can update title to include the code change.

let's update the PR/JIRA if code change is required for the doc change.
otherwise, let's leave code change as a separate PR?

I guess that's the right way. I will revert the code change.

hhbyyh · 2017-04-28T23:12:43Z

@felixcheung, reverted the code change of transform as requested. Please check the update. Thanks.

jkbradley · 2017-04-28T23:34:12Z

LGTM, thanks for adding this! @felixcheung OK with merging?

SparkQA · 2017-04-29T00:09:55Z

Test build #76287 has finished for PR 17130 at commit ea3b973.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LGTM

felixcheung · 2017-04-29T17:52:06Z

merged to master/2.2

## What changes were proposed in this pull request? Add a new section for fpm Add Example for FPGrowth in scala and Java updated: Rewrite transform to be more compact. ## How was this patch tested? local doc generation. Author: Yuhao Yang <[email protected]> Closes #17130 from hhbyyh/fpmdoc. (cherry picked from commit add9d1b) Signed-off-by: Felix Cheung <[email protected]>

## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-20670 As suggested by Sean Owen in apache#17130, the transform code in FPGrowthModel can be simplified. As I tested on some public dataset http://fimi.ua.ac.be/data/, the performance of the new transform code is even or better than the old implementation. ## How was this patch tested? Existing unit test. Author: Yuhao Yang <[email protected]> Closes apache#17912 from hhbyyh/fpgrowthTransform.

fpm doc

fdce240

srowen reviewed Mar 2, 2017

View reviewed changes

BryanCutler reviewed Mar 2, 2017

View reviewed changes

change transform to filter

ca12877

YY-OnCall added 2 commits March 10, 2017 23:37

Merge remote-tracking branch 'upstream/master' into fpmdoc

4223d94

merge and move

9ce0093

merge conflict

fa4c734

Merge remote-tracking branch 'upstream/master' into fpmdoc

0a5dbb2

comments and doc refine

de1bfc8

hhbyyh mentioned this pull request Mar 15, 2017

[SPARK-19939] [ML] Add support for association rules in ML #17280

Closed

YY-OnCall added 2 commits March 21, 2017 10:51

Merge remote-tracking branch 'upstream/master' into fpmdoc

d4828b7

adapt to itemsCol

9fef280

zero323 mentioned this pull request Mar 22, 2017

[SPARK-19825][R][ML] spark.ml R API for FPGrowth #17170

Closed

resolve conflict

9e908d0

Merge remote-tracking branch 'upstream/master' into fpmdoc

16f845c

BenFradet reviewed Mar 28, 2017

View reviewed changes

YY-OnCall added 2 commits March 30, 2017 16:41

Merge remote-tracking branch 'upstream/master' into fpmdoc

8d0ccb1

docs update

e9b090a

zero323 reviewed Apr 9, 2017

View reviewed changes

zero323 mentioned this pull request Apr 11, 2017

[SPARK-20208][R][DOCS] Document R fpGrowth support #17557

Closed

YY-OnCall added 2 commits April 11, 2017 12:03

resolve conflict

99530f1

refine python example

0fb5a87

felixcheung requested changes Apr 19, 2017

View reviewed changes

YY-OnCall added 3 commits April 19, 2017 10:15

Merge remote-tracking branch 'upstream/master' into fpmdoc

170c31e

add R example

2b1efb3

Merge remote-tracking branch 'upstream/master' into fpmdoc

45139cd

felixcheung reviewed Apr 25, 2017

View reviewed changes

YY-OnCall added 2 commits April 28, 2017 14:38

Merge remote-tracking branch 'upstream/master' into fpmdoc

af0b755

remove code change per comments

ea3b973

felixcheung approved these changes Apr 29, 2017

View reviewed changes

asfgit closed this in add9d1b Apr 29, 2017

hhbyyh mentioned this pull request May 9, 2017

[SPARK-20670] [ML] Simplify FPGrowth transform #17912

Closed

		object FPGrowthExample {

		def main(args: Array[String]): Unit = {


		fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
		fpGrowthModel = fpGrowth.fit(df)

[SPARK-19791] [ML] Add doc and example for fpgrowth #17130

[SPARK-19791] [ML] Add doc and example for fpgrowth #17130

Uh oh!

Conversation

hhbyyh commented Mar 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Mar 2, 2017

Uh oh!

SparkQA commented Mar 2, 2017

Uh oh!

SparkQA commented Mar 11, 2017

Uh oh!

jkbradley commented Mar 13, 2017

Uh oh!

SparkQA commented Mar 13, 2017

Uh oh!

hhbyyh commented Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 14, 2017

Uh oh!

SparkQA commented Mar 16, 2017

Uh oh!

hhbyyh commented Mar 16, 2017

Uh oh!

hhbyyh commented Mar 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 21, 2017

Uh oh!

jkbradley commented Mar 23, 2017

Uh oh!

jkbradley commented Mar 23, 2017

Uh oh!

SparkQA commented Mar 23, 2017

Uh oh!

BenFradet left a comment

Choose a reason for hiding this comment

hhbyyh commented Mar 1, 2017 •

edited

Loading

BryanCutler Mar 2, 2017 •

edited

Loading

hhbyyh commented Mar 14, 2017 •

edited

Loading

hhbyyh commented Mar 17, 2017 •

edited

Loading