-
Notifications
You must be signed in to change notification settings - Fork 155
Adding PennTreeBank Tokenizer #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi, First of all thanks a ton for the contribution, I am glad you want to help out. Do not take what follows the wrong way, I really appreciate your work and your help, I just want to discuss what I feel I should merge and why, not because I know better but because I don't and I want your opinion. Topics
Flexible DesignI propose the creation of a base class RegexTokenizer that could accept two parameters. An array of regexes and PennTreeBank tokenizer implementationMost of the following do not make any sense if we are to follow the above proposition (which I believe we should).
ExceptionsNow this is a bit I am totally confused about what my position should be. I have not seen the need for custom exceptions in the library yet. I am developing in an as needed basis. I believe that since all patterns are applied to any text that there is no need for strict error checking (since the error would show on testing), let alone a custom exception for it. I would really like to hear your position about it. PHP unitThis is a much needed addition and I believe that you are pushing the library to the right direction (thanks). I am not a PHP unit expert but I am not sure we need a phpunit.xml file or even a bootstrap one. I just see them as another file to maintain while each test case could include the autoloader provided with the library and phpunit would gladly execute all files ending in Test found in a directory. ConclusionI hope I am not being a kill joy. I would like to hear your opinion. Thanks, |
I just wrote the regex tokenizer I proposed above. If you'd like I will add some tests, push it and then you can base your Tokenizer on it. |
Where is the implementation of the Regex Tokenizer, you implemented? Thank you. |
I just pushed it. For the PennTreeBank implementation I suggest extending RegexTokenizer and simply passing all the regexes to the parent constructor and as a last regex the WhitespaceTokenizer::PATTERN constant. |
Thank you for the feedback! Flexible DesignI agree with you and your approach. PennTreeBank tokenizer implementation
PHPUnitThe phpunit.xml file is required by phpunit to run. The bootstrap file performs any required setup, ie
These files are standard in all projects that use phpunit. The point you make "while each test case could include the autoloader" violates the Don't Repeat Yourself (DRY) principle in software engineering. Using a bootstrap, that implements an autoloader removes extraneous require statements from the test classes. |
I agree with you about PHPUnit. It is easier to maintain a config file than to remember to call the command correctly. Also totally agree about the bootstrap file. Sometimes I seem to forget tests are also code. :-) I need to move all of the library's testing to PHPUnit and the bootstrap file and phpunit.xml will be added to that commit. If it is not too much of a trouble change your test to work without them (like I did with mine RegexTokenizerTest.php). Regarding the exceptions, since the regexes are actually run in the RegexTokenizer whether or not we throw an exception on error should be done in that class. I tend to be quiet about errors throughout the library which is not very good. In any way I believe an STL RuntimeException would do since we have no custom functionality in InvalidExpression exception. |
I accept the decision to exclude the custom exception, InvalidExpressionException. I do not accept the structure of the unit tests. You need to namespace your unit tests just like you namespace your code libraries. This keeps everything nicely organized. If another developer wants to see how to use a specific class then they can easily locate its equivalent in the testing code. Also, you are bound to run into naming issues eventually, it is just better to namespace everything, to avoid naming collisions. It is not necessary to have the require statement to your autoloader code in each test file, the bootloader.php code fixes this issue if your test code is properly namespaced. |
I agreed with you on that matter. I just proposed it be a part of the whole To simplify, I will change the testing code to use the phpunit.xml config
|
Thanks. I should have some time this evening to make the changes to my pull request. |
I am going to stick with the route of internalizing the preg_replace functionality within the PennTreeBank tokenizer and extending the WhitespaceTokenizer. The RegexTokenizer does not keep it simple (KIS). It packs 3 different uses,
It should only offer the ability to use a regular expressions to split a string into tokens, see https://github.com/yooper/text-analysis/blob/master/src/TextAnalysis/Tokenizers/RegexTokenizer.php as an example. After further review, I insist we need to use exceptions and not rely on errors, because they do not provide enough information. |
You are right regarding the RegexTokenizer, it doesn't keep it simple. It was a rushed move from my part. I will be merging your code in Monday or Tuesday. I want to go about it more cautiously and I am not at a place right now where I have the time and stable internet connection to do it. A couple of last things.
Finally, after this first merge I believe we will be able to work together a lot faster in next contributions, it is just that there were a lot of things to be sorted out (styles, preferences etc). |
Double posting but I just though about it. Is the only reason for not passing two arrays to preg_replace (patterns and replacements) the fact that we want to know which pattern failed if it ever fails? |
I just prefer to use the objects instead of arrays. |
DEPRECATED preg_split(): Passing null to parameter angeloskath#3 ($limit) of type int is deprecated in vendor/nlp-tools/nlp-tools/src/NlpTools/Tokenizers/WhitespaceTokenizer.php on line 17.
Hello,
I have a NLP project in PHP. I want to combine our efforts into building a complete suite of NLP tools in PHP. My github project can be found at https://github.com/yooper/text-analysis. I will gladly contribute to your project since your project is much more mature. This is one of many pull requests I can put in to further expand the functionality and feature richness of the PHP NLP Tools project.