Skip to content

Commit e9b090a

Browse files
committed
docs update
1 parent 8d0ccb1 commit e9b090a

File tree

3 files changed

+20
-12
lines changed

3 files changed

+20
-12
lines changed

docs/ml-frequent-pattern-mining.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -27,27 +27,32 @@ explicitly, which are usually expensive to generate.
2727
After the second step, the frequent itemsets can be extracted from the FP-tree.
2828
In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
2929
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
30-
PFP distributes the work of growing FP-trees based on the suffices of transactions,
31-
and hence more scalable than a single-machine implementation.
30+
PFP distributes the work of growing FP-trees based on the suffixes of transactions,
31+
and hence is more scalable than a single-machine implementation.
3232
We refer users to the papers for more details.
3333

3434
`spark.ml`'s FP-growth implementation takes the following (hyper-)parameters:
3535

3636
* `minSupport`: the minimum support for an itemset to be identified as frequent.
3737
For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
38-
* `minConfidence`: minimum confidence for generating Association Rule. The parameter will not affect the mining
39-
for frequent itemsets,, but specify the minimum confidence for generating association rules from frequent itemsets.
38+
* `minConfidence`: minimum confidence for generating Association Rule. Confidence is an indication of how often an
39+
association rule has been found to be true. For example, if in the transactions itemset `X` appears 4 times, `X`
40+
and `Y` co-occur only 2 times, the confidence for the rule `X => Y` is then 2/4 = 0.5. The parameter will not
41+
affect the mining for frequent itemsets, but specify the minimum confidence for generating association rules
42+
from frequent itemsets.
4043
* `numPartitions`: the number of partitions used to distribute the work. By default the param is not set, and
41-
partition number of the input dataset is used.
44+
number of partitions of the input dataset is used.
4245

4346
The `FPGrowthModel` provides:
4447

4548
* `freqItemsets`: frequent itemsets in the format of DataFrame("items"[Array], "freq"[Long])
4649
* `associationRules`: association rules generated with confidence above `minConfidence`, in the format of
4750
DataFrame("antecedent"[Array], "consequent"[Array], "confidence"[Double]).
48-
* `transform`: The transform method examines the input items in `itemsCol` against all the association rules and
49-
summarize the consequents as prediction. The prediction column has the same data type as the
50-
`itemsCol` and does not contain existing items in the `itemsCol`.
51+
* `transform`: For each transaction in itemsCol, the `transform` method will compare its items against the antecedents
52+
of each association rule. If the record contains all the antecedents of a specific association rule, the rule
53+
will be considered as applicable and its consequents will be added to the prediction result. The transform
54+
method will summarize the consequents from all the applicable rules as prediction. The prediction column has
55+
the same data type as `itemsCol` and does not contain existing items in the `itemsCol`.
5156

5257

5358
**Examples**

docs/mllib-frequent-pattern-mining.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ explicitly, which are usually expensive to generate.
2424
After the second step, the frequent itemsets can be extracted from the FP-tree.
2525
In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
2626
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
27-
PFP distributes the work of growing FP-trees based on the suffices of transactions,
27+
PFP distributes the work of growing FP-trees based on the suffixes of transactions,
2828
and hence more scalable than a single-machine implementation.
2929
We refer users to the papers for more details.
3030

mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -227,9 +227,12 @@ class FPGrowthModel private[ml] (
227227

228228
/**
229229
* The transform method first generates the association rules according to the frequent itemsets.
230-
* Then for each association rule, it will examine the input items against antecedents and
231-
* summarize the consequents as prediction. The prediction column has the same data type as the
232-
* input column(Array[T]) and will not contain existing items in the input column. The null
230+
* Then for each transaction in itemsCol, the transform method will compare its items against the
231+
* antecedents of each association rule. If the record contains all the antecedents of a
232+
* specific association rule, the rule will be considered as applicable and its consequents
233+
* will be added to the prediction result. The transform method will summarize the consequents
234+
* from all the applicable rules as prediction. The prediction column has the same data type as
235+
* the input column(Array[T]) and will not contain existing items in the input column. The null
233236
* values in the itemsCol columns are treated as empty sets.
234237
* WARNING: internally it collects association rules to the driver and uses broadcast for
235238
* efficiency. This may bring pressure to driver memory for large set of association rules.

0 commit comments

Comments
 (0)