Adding PennTreeBank Tokenizer #3

yooper · 2013-08-12T22:39:16Z

Hello,

I have a NLP project in PHP. I want to combine our efforts into building a complete suite of NLP tools in PHP. My github project can be found at https://github.com/yooper/text-analysis. I will gladly contribute to your project since your project is much more mature. This is one of many pull requests I can put in to further expand the functionality and feature richness of the PHP NLP Tools project.

angeloskath · 2013-08-13T09:45:55Z

Hi,

First of all thanks a ton for the contribution, I am glad you want to help out.

Do not take what follows the wrong way, I really appreciate your work and your help, I just want to discuss what I feel I should merge and why, not because I know better but because I don't and I want your opinion.

Topics

Flexible design
PennTreeBank tokenizer implementation
Exceptions
PHP unit

Flexible Design

I propose the creation of a base class RegexTokenizer that could accept two parameters. An array of regexes and
a constant defining how it will tokenize based on the array of regexes. For example it could be replacements then passed from another tokenizer, keeping all the matches or splitting on the matches.

PennTreeBank tokenizer implementation

Most of the following do not make any sense if we are to follow the above proposition (which I believe we should).

Pattern-replacement pairs should be initialized in the constructor. Firstly the cost of initialization is incurred only once and secondly the pairs are not added twice, thrice, ... in the array $patternsAndReplacements
Why not move the implementation of the execute function inside the tokenize function and mark the two step procedure with comments // apply replacements .... // tokenize on whitespace
See exceptions below
I believe we should enrich our comments a bit. E.g.Calls internal functions to handle data processing -> Applies the PennTreeBank replacements and the tokenizes on whitespace.
@param type $string should be @param string $string
Although simple, we could add descriptions of the parameters e.g. @param string $string The text to be tokenized

Exceptions

Now this is a bit I am totally confused about what my position should be. I have not seen the need for custom exceptions in the library yet. I am developing in an as needed basis. I believe that since all patterns are applied to any text that there is no need for strict error checking (since the error would show on testing), let alone a custom exception for it.

I would really like to hear your position about it.

PHP unit

This is a much needed addition and I believe that you are pushing the library to the right direction (thanks).

I am not a PHP unit expert but I am not sure we need a phpunit.xml file or even a bootstrap one. I just see them as another file to maintain while each test case could include the autoloader provided with the library and phpunit would gladly execute all files ending in Test found in a directory.

Conclusion

I hope I am not being a kill joy. I would like to hear your opinion.

Thanks,
Angelos

angeloskath · 2013-08-13T10:25:55Z

I just wrote the regex tokenizer I proposed above. If you'd like I will add some tests, push it and then you can base your Tokenizer on it.

yooper · 2013-08-13T12:23:46Z

Where is the implementation of the Regex Tokenizer, you implemented?

Thank you.

angeloskath · 2013-08-13T13:47:20Z

I just pushed it.

For the PennTreeBank implementation I suggest extending RegexTokenizer and simply passing all the regexes to the parent constructor and as a last regex the WhitespaceTokenizer::PATTERN constant.

yooper · 2013-08-13T15:00:41Z

Thank you for the feedback!

Flexible Design

I agree with you and your approach.

PennTreeBank tokenizer implementation

I agree with you. That needs to be fixed.
That is a reasonable request. I tend to break up functionality into small code blocks because they are easier to isolate and test. For this case, I will combine the methods.
Since the project makes heavy use of regular expressions it would be nice to have a wrapper to deal with situations where the regex and the replacement are not tested prior to use. An example might be:
- A set/list of regular expressions are used by a class to modify, extract or, parse wordforms. The exception will accurately identify which regular expression generated the error. Whereas without the exception an error with no description will be returned
I agree with you, I will update that.
I agree with you, I will update that.
I agree with you, I will update that.

PHPUnit

The phpunit.xml file is required by phpunit to run. The bootstrap file performs any required setup, ie

sets up an autoloader for test files
sets up an autoloader for the classes that are being tested
pre-loads larger datasets that get used across the unit tests
cleans up any leftover testing artifacts before running the unit tests

These files are standard in all projects that use phpunit.

The point you make "while each test case could include the autoloader" violates the Don't Repeat Yourself (DRY) principle in software engineering. Using a bootstrap, that implements an autoloader removes extraneous require statements from the test classes.

angeloskath · 2013-08-13T17:05:09Z

I agree with you about PHPUnit. It is easier to maintain a config file than to remember to call the command correctly.

Also totally agree about the bootstrap file. Sometimes I seem to forget tests are also code. :-)

I need to move all of the library's testing to PHPUnit and the bootstrap file and phpunit.xml will be added to that commit. If it is not too much of a trouble change your test to work without them (like I did with mine RegexTokenizerTest.php).

Regarding the exceptions, since the regexes are actually run in the RegexTokenizer whether or not we throw an exception on error should be done in that class. I tend to be quiet about errors throughout the library which is not very good. In any way I believe an STL RuntimeException would do since we have no custom functionality in InvalidExpression exception.

yooper · 2013-08-13T18:05:58Z

I accept the decision to exclude the custom exception, InvalidExpressionException. I do not accept the structure of the unit tests. You need to namespace your unit tests just like you namespace your code libraries. This keeps everything nicely organized. If another developer wants to see how to use a specific class then they can easily locate its equivalent in the testing code. Also, you are bound to run into naming issues eventually, it is just better to namespace everything, to avoid naming collisions. It is not necessary to have the require statement to your autoloader code in each test file, the bootloader.php code fixes this issue if your test code is properly namespaced.

angeloskath · 2013-08-13T18:48:18Z

I agreed with you on that matter. I just proposed it be a part of the whole
shift of the test code to PHPUnit.

To simplify, I will change the testing code to use the phpunit.xml config
file and of course a bootstrap file. Just not on this commit. I would like
this commit to stay conceptually about the addition, which is the
tokenizer, and another commit to be responsible for the changes in the
testing code.
On Aug 13, 2013 9:05 PM, "yooper" [email protected] wrote:

I accept the decision to exclude the custom exception,
InvalidExpressionException. I do not accept the structure of the unit
tests. You need to namespace your unit tests just like you namespace your
code libraries. This keeps everything nicely organized. If another
developer wants to see how to use a specific class then they can easily
locate its equivalent in the testing code. Also, you are bound to run into
naming issues eventually, it is just better to namespace everything, to
avoid naming collisions. It is not necessary to have the require statement
to your autoloader code in each test file, the bootloader.php code fixes
this issue if your test code is properly namespaced.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3#issuecomment-22585124
.

yooper · 2013-08-14T10:06:23Z

Thanks. I should have some time this evening to make the changes to my pull request.

yooper · 2013-08-15T10:08:40Z

I am going to stick with the route of internalizing the preg_replace functionality within the PennTreeBank tokenizer and extending the WhitespaceTokenizer. The RegexTokenizer does not keep it simple (KIS). It packs 3 different uses,

Split
Match
Replace

It should only offer the ability to use a regular expressions to split a string into tokens, see https://github.com/yooper/text-analysis/blob/master/src/TextAnalysis/Tokenizers/RegexTokenizer.php as an example.

After further review, I insist we need to use exceptions and not rely on errors, because they do not provide enough information.

angeloskath · 2013-08-16T05:45:58Z

You are right regarding the RegexTokenizer, it doesn't keep it simple. It was a rushed move from my part.

I will be merging your code in Monday or Tuesday. I want to go about it more cautiously and I am not at a place right now where I have the time and stable internet connection to do it.

A couple of last things.

Your code follows the PSR-2 style guide, which is a good thing, but accepting it as is, means I need to change the rest of the code to follow the style guide too in order to keep it consistent.
Why do you add the slashes and 's' modifier to the pattern each time (line 42) and not once in addPatternAndReplacement? In addition to concatenating each time, this way in the error reporting we wouldn't know for example that the s modifier was used.

Finally, after this first merge I believe we will be able to work together a lot faster in next contributions, it is just that there were a lot of things to be sorted out (styles, preferences etc).

angeloskath · 2013-08-16T05:53:34Z

Double posting but I just though about it.

Is the only reason for not passing two arrays to preg_replace (patterns and replacements) the fact that we want to know which pattern failed if it ever fails?

yooper · 2013-08-19T12:26:36Z

I just prefer to use the objects instead of arrays.

DEPRECATED preg_split(): Passing null to parameter angeloskath#3 ($limit) of type int is deprecated in vendor/nlp-tools/nlp-tools/src/NlpTools/Tokenizers/WhitespaceTokenizer.php on line 17.

yooper added 2 commits August 12, 2013 18:33

Adding in PennTreeBank tokenizer

4d9455f

updating author name

a86873c

yooper added 2 commits August 14, 2013 07:36

Merge remote-tracking branch 'upstream/master'

100123d

Made changes based on feedback.

e596416

angeloskath added a commit that referenced this pull request Aug 19, 2013

Merged pull request #3 from yooper/master.

576c698

angeloskath merged commit e596416 into angeloskath:master Aug 19, 2013

mferrara mentioned this pull request Mar 27, 2023

Fix for deprecated use of null as parameter where int is expected #80

Open

mferrara mentioned this pull request Dec 1, 2023

Fix for deprecated use of null as parameter where int is expected mferrara/php-nlp-tools#1

Merged

Adding PennTreeBank Tokenizer #3

Adding PennTreeBank Tokenizer #3

Uh oh!

Conversation

yooper commented Aug 12, 2013

Uh oh!

angeloskath commented Aug 13, 2013

Topics

Flexible Design

PennTreeBank tokenizer implementation

Exceptions

PHP unit

Conclusion

Uh oh!

angeloskath commented Aug 13, 2013

Uh oh!

yooper commented Aug 13, 2013

Uh oh!

angeloskath commented Aug 13, 2013

Uh oh!

yooper commented Aug 13, 2013

Flexible Design

PennTreeBank tokenizer implementation

PHPUnit

Uh oh!

angeloskath commented Aug 13, 2013

Uh oh!

yooper commented Aug 13, 2013

Uh oh!

angeloskath commented Aug 13, 2013

Uh oh!

yooper commented Aug 14, 2013

Uh oh!

yooper commented Aug 15, 2013

Uh oh!

angeloskath commented Aug 16, 2013

Uh oh!

angeloskath commented Aug 16, 2013

Uh oh!

yooper commented Aug 19, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants