-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API #3091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
davies
commented
Nov 4, 2014
cc @mengxr |
Test build #22882 has started for PR 3091 at commit
|
Test build #22882 has finished for PR 3091 at commit
|
Test FAILed. |
Test build #22886 has started for PR 3091 at commit
|
Test build #22886 has finished for PR 3091 at commit
|
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if r
is JavaArray
or JavaList
but not pickleable? Are we expecting that downstream can handle it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller will handle it. The JavaArray/JavaList is iterable in Python, caller can access the internal objects in this array/list.
Test build #22913 has started for PR 3091 at commit
|
Test build #22913 has finished for PR 3091 at commit
|
Test PASSed. |
LGTM. Merged into master and branch-1.2. Thanks @davies ! |
``` pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None) :: Experimental :: If `observed` is Vector, conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution, or againt the uniform distribution (by default), with each category having an expected frequency of `1 / len(observed)`. (Note: `observed` cannot contain negative values) If `observed` is matrix, conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0. If `observed` is an RDD of LabeledPoint, conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical. :param observed: it could be a vector containing the observed categorical counts/relative frequencies, or the contingency matrix (containing either counts or relative frequencies), or an RDD of LabeledPoint containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. :param expected: Vector containing the expected categorical counts/relative frequencies. `expected` is rescaled if the `expected` sum differs from the `observed` sum. :return: ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis. ``` Author: Davies Liu <[email protected]> Closes #3091 from davies/his and squashes the following commits: 145d16c [Davies Liu] address comments 0ab0764 [Davies Liu] fix float 5097d54 [Davies Liu] add Hypothesis test Python API (cherry picked from commit c8abddc) Signed-off-by: Xiangrui Meng <[email protected]>