As ChemDataExtractor processes documents, it adds each unique word that it encounters to the
Lexicon as a
Lexeme stores various word features, so they don't have to be re-calculated for every occurrence of that word.
You can access the Lexeme for a token using the
>>> s = Sentence('Sulphur and Oxygen.') >>> s.tokens Token('Sulphur', 0, 7) >>> s.tokens.lex.normalized 'sulfur' >>> s.tokens.lex.is_hyphenated False >>> s.tokens.lex.cluster '11011101100110'