Skip to content

Conversation

MRudolph
Copy link
Contributor

This pull request extends the command line tool for training and evaluation (iitb.Segment.Segment) by making downcasing of tokens optional.

This seems to be a destructive action, because it's done before the features are generated.
Some languages (e.g. german) depend on capitalisation for distinguishing words, so this might be a valuable resource which should not removed.

For not breaking existing setups, there are new methods which can handle the optional downcasing.
It's on by default, but can switched off by adding "lowercase=false" to the configuration.

Tests are included and succeed (they are modified copies of the tests for the original tests).

Running the applications with the sample dataset also seems to work fine.

@witgo
Copy link
Owner

witgo commented Oct 21, 2014

@MRudolph
Sorry for late reply.
This is a big change, I need more testing, may need more time to review the code

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Four spaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants