Skip to content

Conversation

@mengxr
Copy link
Contributor

@mengxr mengxr commented May 21, 2015

The previous default is {gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}. The default pattern is hard to understand. This PR changes the default to {gaps: true, pattern: "\\s+"}. @jkbradley

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33281 has finished for PR 6330 at commit 5ee7cde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
My only comment is that I'd prefer to have the Python doc (including doc embedded in Param) match the Scala doc, rather than being a short version of it. I also like having the embedded doc match the Scala/Python generated doc so that explainParam prints out more info.

asfgit pushed a commit that referenced this pull request May 22, 2015
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <[email protected]>

Closes #6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings

(cherry picked from commit f5db4b4)
Signed-off-by: Xiangrui Meng <[email protected]>
@asfgit asfgit closed this in f5db4b4 May 22, 2015
@mengxr
Copy link
Contributor Author

mengxr commented May 22, 2015

Agree. Right now it has huge overhead of keeping the docs in sync. It would be nice if this could be done automatically, e.g., geting the Scala Param.doc via Py4J. I merged this into master and branch-1.4.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <[email protected]>

Closes apache#6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <[email protected]>

Closes apache#6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <[email protected]>

Closes apache#6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants