Skip to content

Conversation

chekunkov
Copy link
Contributor

@chekunkov chekunkov commented Jun 7, 2014

Tokenizer from #15 had issues like not splitting a dot at the end of a sentence as a separate token

40006,40007c40017
< community
< .

---
> community.
41148,41149c41158
< Reserved
< .

---
> Reserved.

Now this issue should be fixed.

Also I've refactored code and added span_tokenize method (@kmike I remember you said it would be nice to have this method)

Performance wasn't hurt

X, y = webstruct.HtmlTokenizer().tokenize(trees)

CPU times: user 3.42 s, sys: 32 ms, total: 3.46 s
Wall time: 3.45 s

kmike and others added 26 commits May 21, 2014 14:59
Dropping it gives a nice speedup because computations are now in Cython.
…simplify code and make it faster.

If needed, it can be implemented as a global feature.
@kmike
Copy link
Member

kmike commented Nov 25, 2016

@chekunkov do you by chance recall why wasn't this PR merged?

@chekunkov
Copy link
Contributor Author

@kmike nope, have no idea why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants