Species name recognition and normalization software.
LINNAEUS is a general-purpose dictionary matching software, capable of processing multiple types of document formats in the biomedical domain (MEDLINE, PMC, BMC, OTMI, text, etc.). It can produce multiple types of output (XML, HTML, tab-separated-value file, or save to a database). It also contains methods for acting as a server (including load balancing across several servers), allowing clients to request matching over a network. A package with files for recognizing and identifying species names is available for LINNAEUS, showing 94% recall and 97% precision compared to LINNAEUS-species-corpus.
LINNAEUS can be run in two different ways: using an internal dictionary, or using an external dictionary. The external dictionaries are available for download below. The internal dictionaries (subsets of the external dictionaries, containing the 10,000 most frequently mentioned species in MEDLINE, representing ~99% of mentions) are contained in the Java .jar archive, and do not need any configuration. Due to the small size of the internal dictionaries, they require very little memory.
LINNAEUS is the subject of the following paper: Gerner M., Nenadic, G. and Bergman, C. M. (2010) LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics 11:85.
For questions, suggestions or bug reports, please contact Martin Gerner.
To navigate back: Martin Gerner's personal page, the Nenadic group or the Bergman lab.
The files on this webpage can also accessed from this project's SourceForge project page.
- July 22nd, 2011: Release of LINNAEUS 2.0 (adding among other things internal dictionaries) and updated species (v. 1.2) and species-proxy entity packs (v. 1.2).
- September 10th, 2010: Release of genera-species-proxy-1.0: an entity pack for mapping genus names (e.g. Drosophila) to their most frequently mentioned species (e.g. Drosophila melanogaster), useful for e.g. gene normalization.
- May 11th, 2010: Release of LINNAEUS 1.5, adding additional input document parsers, output formats, and performance improvements.
- linnaeus-2.0.tar.gz (5.8 MB): LINNAEUS software .jar archives for performing species NER dictionary matching. Also contains source code, example guides, javadoc documentation and libraries required for compilation.
- Javadoc: documentation for the source code. The main class is uk.ac.man.entitytagger.EntityTagger.
- LINNAEUS UIMA wrapper: (14 KB) A LINNAEUS wrapper for the UIMA tool integration system. Thanks to Móra György for providing it.
Entity type dictionary packs
- species-1.2.tar.gz (20 MB): Dictionary and post-processing files for identifying species names in biomedical texts.
- species-proxy-1.2.tar.gz (20 MB): Dictionary and post-processing files for identifying species names and species clues, useful for e.g. gene NER. For example, 'Homo sapiens', 'HeLa cells' and 'patient' will all be linked to human. Cell-line data from Sarntivijai et al. (2008).
- genera-species-proxy-1.0.tar.gz (171 kB): Genus names, linked to the most frequently mentioned member species (in MEDLINE), useful for e.g. gene NER. For example, 'Drosophila' will be linked to D. melanogaster.
Remote web service availability
LINNAEUS can be accessed remotely through its web service. It can be accessed either as a SOAP endpoint (WSDL
) or as a RESTFUL service by posting data as a 'text' argument to this location
). The XML output should be self-explanatory, but for any questions, don't hesitate to contact me.
- manual-corpus-species-1.0.tar.gz (1 MB): A set of open access documents in text format, manually annotated for species mention tags.
- The other evaluation corpora used in the project are available on request, subject to licensing restrictions being fulfilled.
Pre-tagged article tag sets
- tags-pmcoa.tsv.gz (38 MB): Species mention tags for all articles in the open access subset of PubMed Central, published up to December 31st, 2008.
- tags-medline.csv.gz: Species mention tags for all articles in MEDLINE, published up to December 31st, 2008. (available on request, subject to licensing restrictions being fulfilled)
- pmc-sample.tar.gz (15 MB): A sample set of 961 PMC open access documents useful for testing the software and the XML dtd files required to parse them.
- Taxongrab: rule-based species name recognizer.
- Whatizit: dictionary-based entity recognizer and normalizer (matches, among other things, species).
- uBio tools: A set of species recognition and normalization tools hosted by the uBio project
- OrganismTagger: GATE-based species and strain name recognizer.
- NCBI Taxonomy: The database used for normalization of species mentions
- uBio: an initiative for creating a catalog of all living (or extinct) species names
- Catalogue of Life: Another database aiming to catalogue species and their names, similar in style to uBio.
- Document sources
- MEDLINE: >18 million article entries, ~10 million abstracts.
- PubMed Central: ~1 million open access full-text articles.
- Open text mining initiative: Initiative to bypass copyright issues when releasing full-text articles for text-mining purposes (currently not maintained).
Last updated: August 20th, 2011.