Skip to content

Conversation

davies
Copy link
Contributor

@davies davies commented Nov 4, 2014

pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
    :: Experimental ::

    If `observed` is Vector, conduct Pearson's chi-squared goodness
    of fit test of the observed data against the expected distribution,
    or againt the uniform distribution (by default), with each category
    having an expected frequency of `1 / len(observed)`.
    (Note: `observed` cannot contain negative values)

    If `observed` is matrix, conduct Pearson's independence test on the
    input contingency matrix, which cannot contain negative entries or
    columns or rows that sum up to 0.

    If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
    test for every feature against the label across the input RDD.
    For each feature, the (feature, label) pairs are converted into a
    contingency matrix for which the chi-squared statistic is computed.
    All label and feature values must be categorical.

    :param observed: it could be a vector containing the observed categorical
                     counts/relative frequencies, or the contingency matrix
                     (containing either counts or relative frequencies),
                     or an RDD of LabeledPoint containing the labeled dataset
                     with categorical features. Real-valued features will be
                     treated as categorical for each distinct value.
    :param expected: Vector containing the expected categorical counts/relative
                     frequencies. `expected` is rescaled if the `expected` sum
                     differs from the `observed` sum.
    :return: ChiSquaredTest object containing the test statistic, degrees
             of freedom, p-value, the method used, and the null hypothesis.

@davies
Copy link
Contributor Author

davies commented Nov 4, 2014

cc @mengxr

@davies davies changed the title [SPARK-3694] [MLlib] [PySpark] add Hypothesis test Python API [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API Nov 4, 2014
@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22882 has started for PR 3091 at commit 5097d54.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22882 has finished for PR 3091 at commit 5097d54.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Matrices(object):
    • class ChiSqTestResult(JavaModelWrapper):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22882/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22886 has started for PR 3091 at commit 0ab0764.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22886 has finished for PR 3091 at commit 0ab0764.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Matrices(object):
    • class ChiSqTestResult(JavaModelWrapper):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22886/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if r is JavaArray or JavaList but not pickleable? Are we expecting that downstream can handle it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller will handle it. The JavaArray/JavaList is iterable in Python, caller can access the internal objects in this array/list.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22913 has started for PR 3091 at commit 145d16c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 5, 2014

Test build #22913 has finished for PR 3091 at commit 145d16c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Matrices(object):
    • class ChiSqTestResult(JavaModelWrapper):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22913/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 5, 2014

LGTM. Merged into master and branch-1.2. Thanks @davies !

asfgit pushed a commit that referenced this pull request Nov 5, 2014
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
    :: Experimental ::

    If `observed` is Vector, conduct Pearson's chi-squared goodness
    of fit test of the observed data against the expected distribution,
    or againt the uniform distribution (by default), with each category
    having an expected frequency of `1 / len(observed)`.
    (Note: `observed` cannot contain negative values)

    If `observed` is matrix, conduct Pearson's independence test on the
    input contingency matrix, which cannot contain negative entries or
    columns or rows that sum up to 0.

    If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
    test for every feature against the label across the input RDD.
    For each feature, the (feature, label) pairs are converted into a
    contingency matrix for which the chi-squared statistic is computed.
    All label and feature values must be categorical.

    :param observed: it could be a vector containing the observed categorical
                     counts/relative frequencies, or the contingency matrix
                     (containing either counts or relative frequencies),
                     or an RDD of LabeledPoint containing the labeled dataset
                     with categorical features. Real-valued features will be
                     treated as categorical for each distinct value.
    :param expected: Vector containing the expected categorical counts/relative
                     frequencies. `expected` is rescaled if the `expected` sum
                     differs from the `observed` sum.
    :return: ChiSquaredTest object containing the test statistic, degrees
             of freedom, p-value, the method used, and the null hypothesis.
```

Author: Davies Liu <[email protected]>

Closes #3091 from davies/his and squashes the following commits:

145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API

(cherry picked from commit c8abddc)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in c8abddc Nov 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants