ChemDataExtractor is a tool for automatically extracting chemical information from scientific documents, created by Matthew Swain at the University of Cambridge.
Give it a journal article, and it will extract chemical names, properties, and spectra from the text so they can be imported into a database or spreadsheet.
ChemDataExtractor uses state-of-the-art natural language processing algorithms to interpret the English language text that makes up the majority of scientific documents.
Machine-learning methods such as conditional random fields are used in combination with custom dictionaries and rule-based parsing grammars to extract valuable information from each sentence.
By processing each document as a whole, ChemDataExtractor is able to resolve data interdependencies, for example to determine when different names and identifiers refer to the same compound.
As a result, it produces a full compound record containing identifiers, properties, and spectra for each unique chemical entity in the document.
Huge amounts of important data are locked away in document tables.
ChemDataExtractor provides specialized parsers that extract data from tables and integrate it with information from the rest of the document.
ChemDataExtractor is available as an open source python package that you can download and use for free.
Check out the documentation for help getting started.