Skip to content

Conversation

tpeng
Copy link
Contributor

@tpeng tpeng commented May 26, 2014

sometime the training data maybe plain text, instead of using python-crfsuite or any other CRF package, i still prefer to use webstruct because it has sklearn pipeline and some evaluation tools out of box.

the input text annotated text is similar to GATE: e.g. this is a <NER>test</NER>. the entities are surrounded by <> tags. the rest of the change just moving the generic code to a more proper place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data is not necessarily annotated: HtmlLoader is used to load raw data

@kmike
Copy link
Member

kmike commented May 26, 2014

My main concern in Token class and TextTokenizer thing. Creating Token instances looks like a total overkill - why would anyone need to wrap text token in Token instance and to keep reference to all other tokens in the text there? Also, there is already a text_tokenizers module, so this adds to confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants