Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Mar 1, 2017

What changes were proposed in this pull request?

Add a new section for fpm
Add Example for FPGrowth in scala and Java

updated: Rewrite transform to be more compact.

How was this patch tested?

local doc generation.

val itemset = items.toSet
brRules.value.flatMap(rule =>
if (items != null && rule._1.forall(item => itemset.contains(item))) {
brRules.value.flatMap { rule =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, while we're here -- why change this bit?
Or if simplifying, what about

brRules.value.filter(_._1_forall(itemset.contains)).flatMap(_._2.filter(!itemset.contains(_)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change was about a style comment from the orginal PR that I missed. But it's great to see your suggestion. I'll run some test to confirm the performance.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73719 has finished for PR 17130 at commit fdce240.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class JavaFPGrowthExample

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments. I also ran the examples and output looks good.

* }}}
*/
object FPGrowthExample {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove blank line

object FPGrowthExample {

def main(args: Array[String]): Unit = {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll remove this line.

import spark.implicits._

// $example on$
// Loads data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is pretty obvious, you can probably remove it

"1 2 5",
"1 2 3 5",
"1 2")
).map(t => t.split(" ")).toDF("features")
Copy link
Member

@BryanCutler BryanCutler Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to explicitly declare the data instead of manipulating strings, that way it is very clear what the input data is for the example. On second thought, never mind this comment - it's pretty clear the way it is

val fpgrowth = new FPGrowth().setMinSupport(0.5).setMinConfidence(0.6)
val model = fpgrowth.fit(dataset)

// get frequent itemsets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should say "Display frequent itemsets."

// get frequent itemsets.
model.freqItemsets.show()

// get generated association rules.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as comment above

// transform examines the input items against all the association rules and summarize the
// consequents as prediction
model.transform(dataset).show()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove blank line

/**
* Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and
* partition number of the input dataset is used.
* Number of partitions (positive) used by parallel FP-growth. By default the param is not set,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change to "positive", I think it was clearer before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume to fix a javadoc error because angle brackets are read as opening an HTML tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's the reason. But still I'm getting some java doc error after merging code. Looking into it..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the error is related to this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use ">=1" but figure out how to escape the characters for javadoc. We'll want to do that long-term.

"1 2")
).map(t => t.split(" ")).toDF("features")

// Trains a FPGrowth model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: technically it is just the line below calling fit that trains the model, I would move this comment down or just take it out

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 2, 2017

ping @jkbradley since we're changing the FPGrowth transform.
Sean made a great suggestion to simplify transform code.

@SparkQA
Copy link

SparkQA commented Mar 2, 2017

Test build #73790 has finished for PR 17130 at commit ca12877.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 11, 2017

Test build #74383 has started for PR 17130 at commit 9ce0093.

@jkbradley
Copy link
Member

The updated transform looks good; thanks for pinging!

@SparkQA
Copy link

SparkQA commented Mar 13, 2017

Test build #3599 has finished for PR 17130 at commit 9ce0093.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 14, 2017

Thanks for the review. I'll wait for #17283 to be merged first and resolve the conflict.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74565 has finished for PR 17130 at commit fa4c734.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2017

Test build #74636 has finished for PR 17130 at commit de1bfc8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 16, 2017

Refined some comments and minor things. This should be ready for review. Thanks.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 17, 2017

Please hold on merging this until #17321 is resolved.
Updated with the ItemsCol.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74994 has finished for PR 17130 at commit 9fef280.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I'll be happy to help get this merged now that the column renaming is done

@jkbradley
Copy link
Member

Noting here: Please check out the "Issue this PR brought up" here: #17218

It may affect this PR. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 23, 2017

Test build #75112 has finished for PR 17130 at commit 9e908d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@BenFradet BenFradet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few remarks regarding the documentation otherwise it looks good 👍

After the second step, the frequent itemsets can be extracted from the FP-tree.
In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
PFP distributes the work of growing FP-trees based on the suffices of transactions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suffixes

In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
PFP distributes the work of growing FP-trees based on the suffices of transactions,
and hence more scalable than a single-machine implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is more scalable

* `minSupport`: the minimum support for an itemset to be identified as frequent.
For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
* `minConfidence`: minimum confidence for generating Association Rule. The parameter will not affect the mining
for frequent itemsets,, but specify the minimum confidence for generating association rules from frequent itemsets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to give an example for confidence as well since one has been given for support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, there are two commas after itemsets

* `minConfidence`: minimum confidence for generating Association Rule. The parameter will not affect the mining
for frequent itemsets,, but specify the minimum confidence for generating association rules from frequent itemsets.
* `numPartitions`: the number of partitions used to distribute the work. By default the param is not set, and
partition number of the input dataset is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of partitions of the input dataset

* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of
DataFrame("antecedent"[Array], "consequent"[Array], "confidence"[Double]).
* `transform`: The transform method examines the input items in `itemsCol` against all the association rules and
summarize the consequents as prediction. The prediction column has the same data type as the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this really explains what transform does or maybe it's just me?

I would have said something like:

The transform method will produce predictionCol containing all the consequents of the association rules containing the items in itemsCol as their antecedents. The prediction column...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I do wish to have a better illustration here. But the two containing in your version make it not that straightforward, and actually it should be items in itemsCol contains the antecedents for association rules.

I extend it to a longer version,

For each record in itemsCol, the transform method will compare its items against the antecedents of each association rule. If the record contains all the antecedents of a specific association rule, the rule will be considered as applicable and its consequents will be added to the prediction result. The transform method will summarize the consequents from all the applicable rules as prediction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even better 👍

@SparkQA
Copy link

SparkQA commented Mar 31, 2017

Test build #75409 has finished for PR 17130 at commit e9b090a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
fpGrowthModel = fpGrowth.fit(df)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we associationRules example? After all this is the biggest advantage for the Python user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely. Thanks.

@SparkQA
Copy link

SparkQA commented Apr 11, 2017

Test build #75714 has finished for PR 17130 at commit 0fb5a87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

right, how are we on this? let's get this ready soon and merge?
could you also add reference to the R example which is merged yesterday.

Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.fpm.FPGrowth) for more details.

{% include_example python/ml/fpgrowth_example.py %}
</div>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add R please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Added reference to R example. Manually checked on generated doc.

@SparkQA
Copy link

SparkQA commented Apr 19, 2017

Test build #75947 has finished for PR 17130 at commit 2b1efb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* `freqItemsets`: frequent itemsets in the format of DataFrame("items"[Array], "freq"[Long])
* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of
DataFrame("antecedent"[Array], "consequent"[Array], "confidence"[Double]).
* `transform`: For each transaction in itemsCol, the `transform` method will compare its items against the antecedents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itemsCol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean style it as code with backtick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

/**
* Minimal confidence for generating Association Rule.
* Note that minConfidence has no effect during fitting.
* Minimal confidence for generating Association Rule. MinConfidence will not affect the mining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower case minConfidence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

Seq.empty
}).distinct
brRules.value.filter(_._1.forall(itemset.contains))
.flatMap(_._2.filter(!itemset.contains(_))).distinct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we aware of code changes here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we don't need to handle null item?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @felixcheung , can you be more specific about the null item that concerns you? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, 2 things - first just calling out while the PR says doc changes there is this one code change here.
second, before this code was checking items != null do we need not consider that now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

items != null is already checked at two lines above.
Please refer to the comments in the PR for the history of the code change. I can update title to include the code change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update the PR/JIRA if code change is required for the doc change.
otherwise, let's leave code change as a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's the right way. I will revert the code change.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 28, 2017

@felixcheung, reverted the code change of transform as requested. Please check the update. Thanks.

@jkbradley
Copy link
Member

LGTM, thanks for adding this! @felixcheung OK with merging?

@SparkQA
Copy link

SparkQA commented Apr 29, 2017

Test build #76287 has finished for PR 17130 at commit ea3b973.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@felixcheung
Copy link
Member

merged to master/2.2

asfgit pushed a commit that referenced this pull request Apr 29, 2017
## What changes were proposed in this pull request?

Add a new section for fpm
Add Example for FPGrowth in scala and Java

updated: Rewrite transform to be more compact.

## How was this patch tested?

local doc generation.

Author: Yuhao Yang <[email protected]>

Closes #17130 from hhbyyh/fpmdoc.

(cherry picked from commit add9d1b)
Signed-off-by: Felix Cheung <[email protected]>
@asfgit asfgit closed this in add9d1b Apr 29, 2017
ghost pushed a commit to dbtsai/spark that referenced this pull request May 10, 2017
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-20670
As suggested by Sean Owen in apache#17130, the transform code in FPGrowthModel can be simplified.

As I tested on some public dataset http://fimi.ua.ac.be/data/, the performance of the new transform code is even or better than the old implementation.

## How was this patch tested?

Existing unit test.

Author: Yuhao Yang <[email protected]>

Closes apache#17912 from hhbyyh/fpgrowthTransform.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-20670
As suggested by Sean Owen in apache#17130, the transform code in FPGrowthModel can be simplified.

As I tested on some public dataset http://fimi.ua.ac.be/data/, the performance of the new transform code is even or better than the old implementation.

## How was this patch tested?

Existing unit test.

Author: Yuhao Yang <[email protected]>

Closes apache#17912 from hhbyyh/fpgrowthTransform.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants