SPECIES: a standalone command line application capable of identifying taxonomic mentions in documents and mapping them to corresponding NCBI Taxonomy database entries.
Given a folder with plain text files, SPECIES based on its taxonomic name and synonym dictionary reports the taxonomic mentions (start, end position in each document), the detected term and the corresponding NCBI Taxonomy database record identifier.
Besides binomials following the Linnaean naming convention, recognised taxonomic mentions include acronyms, common names and abbreviations, as well as misspellings and the rest of the naming types supported by the NCBI Taxonomy.
Described in: The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. Pafilis E, Frankild SP, Fanini L, Faulwetter S, Pavloudi C, et al. (2013) PLoS ONE 8(6): e65390. [HTML] [PDF]
S800 Corpus: a novel abstract-based manually annotated corpus. S800 comprises 800 PubMed abstracts in which organism mentions were identified and mapped to the corresponding NCBI Taxonomy identifiers.
To increase the corpus taxonomic mention diversity the S800 abstracts were collected by selecting 100 abstracts from the following 8 categories: bacteriology, botany, entomology, medicine, mycology, protistology, virology and zoology. S800 has been annotated with a focus at the species level; however, higher taxa mentions (such as genera, families and orders) have also been considered.
Availability: the tagger software (under BSD license) along with its species-level and complete taxonomic dictionaries and associated stopword lists (both under CC-BY license) are available here. The species-level S800 corpus (subject to Medline restrictions) can be downloaded from here.
Sister Projects: ORGANISMS, a web resource providing access to the tagging results of all abstract from the Medline database, including all taxonomic levels.
Team: Evangelos Pafilis#, Sune Frankild*, Lucia Fanini, Sarah Faulwetter, Christina Pavloudi, Katerina Vasileiadou, Christos Arvanitidis, Lars Juhl Jensen*# (*: main software developers, #: correspondence)
Maintained: at the Novo Nordisk Foundation Center for Protein Research (NNFCPR), Denmark, and the Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR) Crete, Greece