@@ -27,27 +27,32 @@ explicitly, which are usually expensive to generate.
2727After the second step, the frequent itemsets can be extracted from the FP-tree.
2828In ` spark.mllib ` , we implemented a parallel version of FP-growth called PFP,
2929as described in [ Li et al., PFP: Parallel FP-growth for query recommendation] ( http://dx.doi.org/10.1145/1454008.1454027 ) .
30- PFP distributes the work of growing FP-trees based on the suffices of transactions,
31- and hence more scalable than a single-machine implementation.
30+ PFP distributes the work of growing FP-trees based on the suffixes of transactions,
31+ and hence is more scalable than a single-machine implementation.
3232We refer users to the papers for more details.
3333
3434` spark.ml ` 's FP-growth implementation takes the following (hyper-)parameters:
3535
3636* ` minSupport ` : the minimum support for an itemset to be identified as frequent.
3737 For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
38- * ` minConfidence ` : minimum confidence for generating Association Rule. The parameter will not affect the mining
39- for frequent itemsets,, but specify the minimum confidence for generating association rules from frequent itemsets.
38+ * ` minConfidence ` : minimum confidence for generating Association Rule. Confidence is an indication of how often an
39+ association rule has been found to be true. For example, if in the transactions itemset ` X ` appears 4 times, ` X `
40+ and ` Y ` co-occur only 2 times, the confidence for the rule ` X => Y ` is then 2/4 = 0.5. The parameter will not
41+ affect the mining for frequent itemsets, but specify the minimum confidence for generating association rules
42+ from frequent itemsets.
4043* ` numPartitions ` : the number of partitions used to distribute the work. By default the param is not set, and
41- partition number of the input dataset is used.
44+ number of partitions of the input dataset is used.
4245
4346The ` FPGrowthModel ` provides:
4447
4548* ` freqItemsets ` : frequent itemsets in the format of DataFrame("items"[ Array] , "freq"[ Long] )
4649* ` associationRules ` : association rules generated with confidence above ` minConfidence ` , in the format of
4750 DataFrame("antecedent"[ Array] , "consequent"[ Array] , "confidence"[ Double] ).
48- * ` transform ` : The transform method examines the input items in ` itemsCol ` against all the association rules and
49- summarize the consequents as prediction. The prediction column has the same data type as the
50- ` itemsCol ` and does not contain existing items in the ` itemsCol ` .
51+ * ` transform ` : For each transaction in itemsCol, the ` transform ` method will compare its items against the antecedents
52+ of each association rule. If the record contains all the antecedents of a specific association rule, the rule
53+ will be considered as applicable and its consequents will be added to the prediction result. The transform
54+ method will summarize the consequents from all the applicable rules as prediction. The prediction column has
55+ the same data type as ` itemsCol ` and does not contain existing items in the ` itemsCol ` .
5156
5257
5358** Examples**
0 commit comments