Documentation for S800 corpus revision
Revision steps
1. Decouple recognition from normalization, address overall consistency
The original S800 corpus data only annotated mentions that could be normalized to a particular (2013?) version of the NCBI Taxonomy. From the perspective of recognizing mentions of species names (regardless of normalization), this caused the annotation to appear incomplete in places. In the first revision step, annotation was added for scientific and common names of species regardless of whether they could be normalized, a revision pass addressing the overall consistency of annotation was performed, and annotated names of genera, families and other levels of taxonomy above species were marked Out-of-scope.
Relevant commits: 59fe5be, 0165f52
2. Revise annotated spans for boundary consistency
The original evaluation using the S800 corpus data applied relaxed boundary matching criteria:
[…] we used flexible boundary matching of species names, meaning that taggers would receive a true positive if it produced a tag that overlapped with an annotated substring and had the correct assigned taxonomic identifier.
Because of this matching criterion, the exact boundaries of annotated species mentions had little impact on the evaluation, and the annotation effort did not establish precise guidelines detailing which tokens should be included in annotated spans. Consequently, from the perspective of exact matching – the criterion used e.g. by the popular conlleval
evaluation script and most recent machine learning experiments on the corpus – the boundaries of annotated mentions were inconsistently annotated in many places. To address this issue, we created detailed guidelines on how to determine annotation boundaries and made a revision pass addressing related issues in the data. This revision also included a focused review of the annotation of virus mentions, which had comparatively frequent annotation boundary issues.
Relevant commits: 112f0b2, 7012e42
3. Separate strain mentions from species mentions
The original S800 corpus annotation only involves a single annotated mention type (species) that is used to annotate mentions of species names as well as mentions of more specific names such as those of strains. In the revision, we introduced a separate Strain
type and revised all strain name mention to use this type, also revising the spans of species annotation to exclude strain names when the two occurred together in text.
Relevant commits: 5212baf, 489fdf1
4. Introduce annotation for upper taxonomic ranks
The original S800 corpus annotation included partial annotation for mentions of names at taxonomic ranks above species, in particular in a number of cases where these names were used in an imprecise way to refer to species (e.g. Drosophila for Drosophila melanogaster). As the initial revision step addressing annotation consistency had marked these as Out-of-scope
, the coverage of the revised annotation was in some aspects reduced from that of the original corpus. In this revision step, annotation for mentions of names at taxonomic ranks above species was reintroduced in a systematic way, adding e.g. Genus
as a distinct annotated type.
Relevant commits: 20ccfb7
Original Curator Guidelines for S800 corpus
The annotation guidelines as described in the original publication of the S800 corpus
The guidelines to curators were to annotate all substrings, which can meaningfully be identified as referring to a taxon. While the main focus was on annotating species mentions, strings referring to any taxonomic level, (e.g.kingdoms, orders, genera, strains) were also considered. These data were very likely never released.
The main guidelines were:
- All document substrings must be evaluated and all mentions including repetitions should be listed in the order of appearance in the text.
- The annotated name types among others should include: Linnaean binomials, common names, strain names, author defined acronyms.
- For each annotated string, curators must record the name as it appeared in text and report the corresponding NCBITaxonomy database identifier.
- Special cases of adjectives being used to indicate a taxon, misspellings, typographic or other errors and enumerations were indicated as such.
- Taxonomic mentions that did not correspond to an existing NCBI Taxonomy database entry were also indicated.
Inconsistencies detected that initially seem to contrast the above guidelines
- Some fairly obvious species names lack annotation for unidentified reasons. For examples, see mentions of Escherichia coli in 20971900 and 21324422, Pseudomonas aeruginosa in 21075931, and Flavobacterium sinopsychrotolerans in 20118284.
- Inconsistencies in tagging of strains (e.g. strains are not tagged in 20118285
- genera etc. tended not to have any annotation, e.g. 4/7 first documents (numeric order):
- There appears to be some inconsistency around the tagging of upper-level taxonomical terms such as virus and viral, which are tagged in 20980513 but not in over 100 other cases.
- The adjectival form pneumococcal is annotated in 7/12 cases in S800 although the general rule appears to be that adjectival forms are not annotated (e.g. murine, bovine, cyanobacterial) in contrast to guidelines.
- There is also an apparently inconsistent treatment of common species names, with e.g. mouse, swine, trout and rat partially annotated, but e.g. goat, horse, crab, and rats never receiving annotation in S800. (rat vs. rats being a particularly clear case)
A complete list of inconsistencies can be viewed in this table if it is filtered down to terms annonated in the S800 corpus. See also Comparison of annotations between S800 and LINNEAUS.
Other inconsistencies in the annotation
These are mainly annotation boundary issues. Correct annotation boundaries was not a requirement in the original corpus annotation guidelines and was not taken into account when reporting F-scores on the datasets (“For the mention-level statistics we used flexible boundary matching of species names, meaning that taggers would receive a true positive if it produced a tag that overlapped with an annotated substring and had the correct assigned taxonomic identifier. For example, if the string “E. coli K12” is annotated in S800 and the tagger matches only the string “E. coli”, it will be counted as a true positive (provided the taxonomic identifier is also correct).”).
- The expressions sp. nov. and gen. nov., sp. nov. following a species name are included in annotated spans in some but not all cases. For example, these are part of the annotated spans at the start of documents 20118285, 20139281, and 20154326, but not 19667393, 20139283 and 20139284
- The string (T) appearing at the end of strain names (superscript T in the original text) is included in annotated spans in some but not all cases. For examples, compare e.g. 19667393 and 20118285.
- When the name of the person who first named a species is mentioned in parenthesis after the species name (e.g. Diabrotica virgifera virgifera (LeConte)), this is annotated in some but not all cases. Also, the same question arises for the non-parenthesized forms such as Pseudacteon tricuspis Borgmeier.
- tomato vs. tomato plant(s)
- HCV vs. anti-HCV
- niger vs. A. niger strain
- galaxias vs. native galaxias
- Salmonella enterica vs. Salmonella enterica serovar Typhimurium
- (this example is extra) Influenza vs. Influenza vaccine
Additional guidelines for reannotation of the S800 corpus
- The first resource that is trusted to resolve issues is NCBI Taxonomy. If there is still not enough information there then the Catalogue of Life is advised
Span consistency guidelines
- Do not include the expressions sp. nov. and gen. nov., sp. nov. in the species name, since these are supposedly used only the first time a genus and/or a species/subspecies is described to denote that it’s new, so they are not part of the scientific name and shouldn’t be found anywhere else other than the first paper describing them.
- An annotated span should not end with sp. or spp.
- Superscript T to denote type strain should not be included in species’ names
- The person’s name should not be included in the species name, especially when it is in parentheses. The non-parenthesized form is a bit more complex (at least in the example above Pseudacteon tricuspis Borgmeier is a valid name shown as a synonym for Pseudacteon tricuspis in NCBI taxonomy). For annotation consistency the suggestion is to drop these names in all appearances. (The confusion with subspecies can be avoided because of the capital letter at the start of the second word, e.g. Ursus arctos arctos would be easy to distinguish from Ursus arctos Linneaus and then drop the name for the latter.)
- Do not include common head nouns such as “plants” in annotation span
- Do not include adjectival premodifiers such as “native” in annotation span
- Model words like SCID mouse should be excluded from annotations
- “species complex” should not be part of a species name, e.g. from 20682355
The splicing activity of the PRP8 intein from the B. dermatitidis, E. parva and P. brasiliensis species complex was demonstrated in a non-native protein context in Escherichia coli. T5 Species 50 65 B. dermatitidis T6 Species 67 75 E. parva T7 Species 80 95 P. brasiliensis T8 Species 164 180 Escherichia coli
- f. sp. (forma specialis) should be included in the annotated mention (e.g. Blumeria graminis f. sp. tritici)
- Do not include nouns identifying levels of taxonomy in annotation spans. For example, the words strain, serotype, serovar, and serogroup should be excluded from the spans of annotated Strain mentions. e.g from 20154326
Strain GSW-R14(T) exhibited 97.6 % 16S rRNA gene sequence similarity ... T1 Strain 7 14 GSW-R14
- Annotate antibodies e.g anti-HCV with species annotation for the organism (HCV) and Note: “anti-“ prefix
general annotation guidelines
- Preprocessing errors (e.g. & amp;) should be fixed
- Clade mentions will receive Clade normalizations and will be assigned type according to nearest non-Clade ancestor
- Similarly, no rank mentions will receive no rank normalizations and will be assigned type according to nearest ranked ancestor
- Mentions that are not monophyletic (e.g. fish) should be annotated as OOS with Note: not monophyletic
- Adjectival forms like murine (taxid:10090), bovine (taxid:9913), pneumococcal (taxid:1313) that map to a specific species should be annotated as such
- Adjectival forms of Phyla (e.g. cyanobacterial: taxid:1117) can only be annotated as OOS or not be annotated at all
- Adjectival forms of Kingdoms (e.g. viral, bacterial) can only be annotated as OOS or not be annotated at all
- Introduce a flag-attribute for cannot be normalized for cases that are not full names (but only understandable as references in context) e.g. strips of types O, A and Asia 1
- Non-name mentions (e.g. woman) and species clues (e.g. patients, children, men, women) should not be annotated. This includes the non-name mention man which should not be annotated as a synonym for Homo sapiens (taxid: 9606)
- The role in which common species names are mentioned should not be taken into account and all species names mentions should be annotated (e.g. rice mentioned as food or tobacco as cigarettes should still be annotated).
- Genus or higher level mentions (e.g. Arabidopsis, yeast) should only be annotated as the real taxinomic level (i.e. genus, phylum) and not as synonyms of species names. (e.g. Arabidopsis should be annotated as
Genus
and assigned the genus taxid:3701)The second face of a known player: Arabidopsis silencing suppressor AtXRN4 acts organ-specifically T1 Genus 35 46 Arabidopsis N1 Reference T1 Taxonomy:3701 Arabidopsis
- Former Species annotations that belong to the following taxinomic ranks: Class, Order, Family and Genus have been annotated as the latter in the corpus. Ranks higher than Class (e.g. Phylum, Kingdom) should receive an Out-of-scope annotation (NCBI Taxonomy Ranking adopted from Schoch, et. al, 2020
- For annotations above Species only the “coarse” ranks should be considered, thus mapping mentions at fine-grained levels to their coarse equivalents, e.g. Subgenus -> Genus, Subfamily -> Family etc. Some examples are given below:
- Subfamily: Plusiinae –> Family
- Superfamily: butterflies –> Family
- Infraorder: snakes –> Order
- Infraorder: planthoppers –> Order
- Suborder: true bugs –> Order
- Suborder: Heteroptera –> Order
- Infraorder: dragonflies –> Order
- Tribe and Subtribe are also normalized to Family, while Cohort is normalized to Class
- For Subspecies mentions: when a subspecies name immediately follows a species name the entire mention is simply annotated as one slightly longer Species mention, e.g. Phocoenoides dalli dalli annotated as Species + taxid: 9745 (Rank: Subspecies).
- Biotypes should be treated the same way as Subspecies
- common name (scientific name)” mentions should be annotated as two mentions e.g from 21054435:
We studied seasonal dynamics in delta^1^3C of CO2 efflux (delta^1^3C(E)) from non-leafy branches, upper and lower trunks and coarse roots of adult trees, comparing deciduous Fagus sylvatica (European beech) with evergreen Picea abies (Norway spruce). T1 Species 174 189 Fagus sylvatica T2 Species 191 205 European beech T3 Species 222 233 Picea abies T4 Species 235 248 Norway spruce
- Forms identified by place names, like ecotype, are not annotated.
For investigating cadmium uptake, we incubated protoplasts obtained from leaves of Thlaspi caerulescens (Ganges ecotype) with a Cd-specific fluorescent dye. T1 Species 83 103 Thlaspi caerulescens
- Cultivars should be annotated as OOS.
The physiological traits underlying the apparent drought resistance of 'Tomatiga de Ramellet' (TR) cultivars. T1 Out-of-scope 72 92 Tomatiga de Ramellet
- Rootstocks should be annotated as OOS e.g. in 20837155
- Non-taxonomic groupings such as Gram-positive/negative bacteria, marine bacteria or enteric bacteria should not be annotated. e.g.
The redox-sensitive transcription factor SoxR in enteric bacteria senses and regulates the cellular response to superoxide and nitric oxide.
- The last rule includes cases like the following from 21037015
Oscillochloris trichoides is a mesophilic, filamentous, photoautotrophic, nonsulfur, diazotrophic bacterium which is capable of carbon dioxide fixation via the reductive pentose phosphate cycle and possesses no assimilative sulfate reduction. T1 Species 0 25 Oscillochloris trichoides
- tree and bush are non-taxonomic mentions and thus not annotated or annotated as OOS + Note: non-taxonomic
- Species names in noun phrase premodifier positions (e.g. Arabidopsis EDR1, Aspergillus nidulans cells) also in cases where they appear as part of the name of an entity of a non-organism type (e.g. human epidermal growth factor receptor 2 (HER2)) are annotated.
- Species names are annotated when they are part of hyphenated compound words (e.g. human-infecting) but NOT when they appear as a substring in a word not separated by a boundary such as a hyphen (e.g. nonhuman)
- Abbreviations are marked if the abbreviation stands for an organism mention in scope of the annotation, but not if the full form merely includes an organism mention e.g. in modifier position. For example, the H in HER2 is not annotated despite it standing for human.
- Standalone alga (algae, microalgae, macroalgae): OOS + Note: non-taxonomic e.g. Algae is an informal term for a large and diverse group of photosynthetic eukaryotic organisms.
- protist (any eukaryotic organism that is not an animal, plant, or fungus) is a non-taxonomical expression and will be annotated as OOS + Note: non-taxonomic e.g.
- protozoa will also be annotated as OOS + Note: non-taxonomic
- methanotroph is a non-taxonomical expression and will be annotated as OOS + no taxid e.g.
- methanogen(s) over 50 Archaea species: annotated as OOS with Note: not monophyletic
- prokaryotes includes Bacteria and Archaea in the current three-domain system, so this will be annotated as OOS + no taxid, despite the fact that eukaryotes will be annotated as OOS + taxid:2759 e.g. and e.g.
- heterokonts and alveolates are clades of microorganisms and will be annotated as OOS + taxid:33634 and taxid:33630 respectively e.g. and e.g.
- cyanobacteria, eubacteria and the like should be annotated as OOS unless it’s clear from context that the reference is definitely to the genus Cyanobacterium or Eubacterium respectively.
- Young animals (e.g. chicks, calfs etc) should NOT receive an annotation or should receive an OOS annotation
- Non-taxonomic groupings of organisms by their behaviour (e.g. herbivores, predators, parasites) are OOS in the annotation
- actinorhiza(l), mycorrhiza(l), ectomycorrhiza: OOS + Note: non-taxonomic
- species complex and clonal complex rank: OOS
Strains
- Strain aliases such as CC-12301(T) (=DSM 45298(T) =CCM 7727(T)) should be annotated in all instances as type Strain.
- name strain mentions should be annotated as two mentions of Species+Strain, e.g. from 20154326
Strain GSW-R14(T) exhibited 97.6 % 16S rRNA gene sequence similarity to F. gelidilacus LMG 21477(T) and similarities of 91.2-95.2 % to other members of the genus Flavobacterium T1 Species 7 14 GSW-R14 T2 Species 72 86 F. gelidilacus T3 Strain 87 96 LMG 21477
- mentions of the form [Genus] sp. [Strain], should have a separate Genus and Strain annotation e.g.
- descriptive references to Strains using gene names are not annotated as organisms e.g. 21097612
Viruses
- Viruses (or other taxonomic units) that have species level of entry as “unidentified” (e.g. “retrovirus” taxid:31931 (“unidentified retrovirus” equivalent: “retrovirus”) or “adenovirus” taxid:10535 (“unidentified adenovirus” equivalent: “adenovirus”)) should NOT be annotated in the species level, but should be annotated at the higher taxonomic rank that better describes them (e.g. family rank for Retroviridae, Adenoviridae etc)
- “virus”/”viral” OOS+taxid:10239 “Viruses” superkingdom
- “retrovirus” Family+taxid:11632 “Retroviridae” family
- “influenza virus” Family+taxid:11308 “Orthomyxoviridae” family
- “herpesvirus” Family+taxid:10292 “Herpesviridae” family
- “adenovirus” Family+taxid:10508 “Adenoviridae” family
- “baculovirus” Family+taxid:10442 “Baculoviridae” family
- “reovirus” Family+taxid:10880 “Reoviridae” family
- “norovirus” Genus+taxid:142786 “Norovirus” genus
- “ebola virus” Genus+taxid:186536 “Ebolavirus” genus
- “cytomegalovirus” Genus+taxid:10358 “Cytomegalovirus” genus
- dengue: dengue is synonym for dengue fever (disease), annotate as OOS + no taxid unless dengue virus is mentioned when it should be annotated as taxid:12637 (species)
- smallpox: smallpox is synonym for smallpox disease, annotate as OOS + no taxid unless smallpox virus is mentioned when it should be annotated as taxid:10255 (species)
- influenza: influenza is synonym for the flu (disease), annotate as OOS + no taxid unless influenza X virus is mentioned when it should be annotated as Species. EXCEPTION: standalone influenza may be marked when organism sense is clear from context (e.g. infulenza strains)
- human adenovirus (or similar cases): when a mention cannot be normalized in an “identified” virus species it should be annotated e.g. as Species+taxid:9606 (Homo sapiens) for human and Family+taxid:10508 (Adenoviridae) for _adenovirus
Yeasts
- Discontinuous entities should be annotated as such (e.g. http://ann.turkunlp.org:8088/index.xhtml#/S800/20933017?focus=610~643)
- all text spans including “yeast” should have an Out-of-scope annotation if the taxonomy level is higher than Species:
- standalone yeast: OOS+taxid:147537 (“true yeast” subphylum) (Note: an even higher level may be included)
- black yeast: Order+taxid:34395 (“black yeast” order)
- budding yeast: Order+taxid:4892 (“budding yeasts” order)
- fission yeast: Family+taxid:4894 (“fission yeasts” family)
- truffle: Genus + taxid:36048 (Tuber genus)
Amoebae
- All amoebae instances have been revised to resolve confusion of non-taxonomical expression amoebae (type of cell or unicellular organism which has the ability to alter its shape), of taxid:554915 (OOS: Clade: Amoebozoa), and taxid:55774 (Genus: Amoebae). Most of the cases were non-taxonomical expressions (OOS + no taxid)
- testate amoebae: very common combination of mentions, which means shelled amoebae, which explains the form of microorganism(s): OOS + no taxid
- Interesting article 21112814, where both non-taxonomical and Genus amoebae are mentioned (only one real “amoebae” Genus in the corpus)
Common names
- In general, when a species and a higher-level entry in the taxonomy (e.g. genus) share a common name or synonym, the species interpretation should be preferred when it is not clear from context which is intended.
- Common names like human, goat, horse, and rats should be always annotated.
- Common names that should not be annotated in the species level:
- fire ant: Genus and taxid:13685 (Solenopsis); Note: red fire ant, little fire ant, black fire ant etc should be tagged as the corresponding species)
- ant(s): Family+taxid:36668 (Formicidae)
- insect(s): when standalone assign Class+taxid:50557 (Insecta)
- sunflower: Genus+taxid:4231 (Helianthus)
- galaxias : Genus+taxid:51242 (Galaxias)
- mite: Class+taxid:6933 (Acari subclass)
- trout: several species of fish, annotate as OOS + no taxid
- leafminer and leaf miner: insects that eat the tissue of plants, annotate as OOS + Note: non-taxonomic
- fishes: OOS (Clade-like concept, non-tetrapoda vertebrata)
- bug: OOS + Note: non-taxonomic
- field cricket: OOS + Note: non-taxonomic
- mirid bug: Family+taxid:30084 (Miridae)
- clownfish: Family+taxid:30863 (Pomacentridae)
- elephant: 3 species, not monophyletic (both Elephas and Loxodonta genera), annotate as OOS + no taxid
- crab: infraorder containing 850 species, so it should be annotated as Order + taxid:6752 (Brachyura)
- grass: Family+taxid:4479 (Poaceae)
- seabird(s): OOS with Note: non-taxonomic
- marsupial (animals carry the young in a pouch) is a mammalian clade, e.g. and will be annotated as OOS + taxid:9263
- coral(s): Hexacorallia + Octocorallia, but paraphyletic because sea anemones are also part of Hexacorallia: annotated as OOS with Note: not monophyletic
- DNA viruses, RNA viruses map to no rank entries: annotated as Kingdom + normalization to 2080735 and 2559587, respectively
- dsRNA mycoviruses: OOS with Note: non-taxonomic
- cereal: OOS with Note: non-taxonomic
- kittiwake: OOS and Note: non-taxonomic
- Common names that should be annotated in the species level (but could be annotated in a higher taxonomic level)
- rat: synonym for Rattus norvegicus and Rattus. Should be annotated as Rattus norvegicus (taxid:10116), unless explicitly referring to a different taxonomic unit (e.g. cotton rat: Genus + taxid:42414 (Sigmodon))
- fruit fly: synonym for Drosophila melanogaster and Drosophila genus and Tephritidae family. Should be annotated as Drosophila melanogaster (taxid:7227), unless explicitly referring to a different taxonomic unit
- bee: synonym for Apis mellifera, and Apoidea superfamily. Should be annotated as Apis mellifera (taxid:7460), unless explicitly referring to a different taxonomic unit (e.g. bumble bee)
- duck: synonym for Anas platyrhynchos, but can be a synonym for other Anatidae. Should be annotated as Anas platyrhynchos (taxid:8839), unless explicitly referring to a different taxonomic unit
- midge: synonym for Chironomus thummi, but can refer to several species of flies. Should be annotated as Chironomus thummi (taxid:7154), unless explicitly referring to a different taxonomic unit
Very specific distinctions
- 4 mentions of “astomes” in this document 21398102 are OOS
- Astome ciliates in this document 21398102 are also OOS
- But some types of astome ciliates had been established as an order Astomatida
- FGSC should not be annotated as it refers to something that’s out of scope, namely Fusarium graminearum complex 22004876
- Mentions of carnivores in 21323921 have been annotated as OOS, interpreting these to refer generally to meat-eating animals rather than the mammalian order Carnivora
- human and primates in a context of non-human primates are annotated as two mentions 21295520
- Dictyoptera in 19257902 is a clade including two orders Blattodea (cockroaches) and Mantodea (mantids). This has been annotated as OOS + taxid:6970 (Dictyoptera clade)
- termite in 19257902 might be a family Termitoidae (termites), even though that’s no rank in NCBI taxonomy. Below Blattoidea superfamily, other sister nodes are family. I decided to annotate as Family + taxid:1912919 following the taxonomy presented in Cataloue of life
- 2435057 is discussing about retroviruses, but terminology there is quite old (published in 1987). ICTV (International Committee on Taxonomy of Viruses) was used to figure out how those viruses are called/classified in that period tracing its history.
- GII.4 in 20980508 has been annotated as species, following the general rule about Clade mentions
- arbuscular mycorrhizal fungi (AMF) e.g. in 20880038 OOS + Note: non-taxonomic
- tropical japonica rice (e.g. 20946420): following rule about no rank entries: normalization to 1736656 and type Species
Guidelines pending to be applied/decided upon
- mutant species annotation 21054438
- Enterohemorrhagic Escherichia coli in 21148732
- panther, panthers
Corpus expansion (S1000)
Positive classes
- Use UniProt/Swiss-Prot annotations to identify categories of articles aligning with the original S800 categories that mention at least one genus or species that is not already annotated in S800. This will also include all genera of species in the S800 corpus which will be retrieved from mapping of species to their parental ranks in NCBI taxonomy.
- Process for filtering out “known” species/genera, to get the unique taxids and their corresponding NCBI Taxonomy scientific names from the current iteration of the annotation
wget 'http://ann.turkunlp.org:8088/ajax.cgi?action=downloadCollection&collection=%2FS800%2F&include_conf=1&protocol=1' -O S800.tar.gz
tar xvzf S800.tar.gz
cat S800/*.ann | egrep '^N' | cut -f 2 | perl -pe 's/^Reference T\d+ Taxonomy:// or die' | sort -n | uniq > unique-taxids.txt
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xvzf taxdump.tar.gz
cut -f 1,3,7 names.dmp | egrep $'\t''scientific name' | cut -f 1,2 > scientific_names.tsv
cut -f 1,5 nodes.dmp > ranks.tsv
paste scientific_names.tsv ranks.tsv | cut -f 1,2,4 > scientific_names_and_ranks.tsv
egrep '('$(tr '\n' '|' < unique-taxids.txt | perl -pe 's/\|$//')')'$'\t' scientific_names_and_ranks.tsv > unique_annotated_names_and_ranks.tsv
- Four taxids were in the data that were not found in this release of the taxonomy 27380, 67004, 891394, and 891400. These have been included in the final list (
unique_annotated_names_and_ranks.tsv
) - Filter down to species and genus
#get species and genera from that list
egrep "species$|genus$" unique_annotated_names_and_ranks.tsv > unique_annotated_names_and_ranks_only_species_genus.tsv
#get species mentions from the above list
egrep "species$" unique_annotated_names_and_ranks.tsv > unique_annotated_names_and_ranks_only_species.tsv
- The mapping between categories in S800 publication and NCBI Taxonomy
Category NCBI Taxonomy Name (NCBI TaxID)
Protistology All eukaryotes that are not Metazoa (includes Insects), Fungi* and Plants**
Entomology Insecta (50557) - Rank: Class
Virology Viruses (10239) - Rank: Superkingdom
Bacteriology Bacteria (2) - Rank: Superkingdom
Zoology Metazoa (33208) excluding those of Class Insecta (50557) - Rank: Kingdom
Mycology Fungi (4751) - Rank: Kingdom
Botany Viridiplantae (33090) - Rank: Kingdom
* All organisms of the clade Opisthokonta, apart from Metazoa and Fungi, are treated as Protists.
** Chlorophyta and Streptophyta are phyla of Viridiplantae, so they would go to Botany and not to Protists.
- Perl scripts and results are on Puhti
/scratch/project_2001426/stringdata/week_39
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar xvzf taxdump.tar.gz
cut -f1,3,5 nodes.dmp > id_parent_type.dmp
cut -f1,3,7 names.dmp | egrep $'\t''scientific name' | cut -f 1,2 > scientific_names.tsv
paste id_parent_type.dmp scientific_names.tsv > id_parent_type_with_name.tsv
awk -F"\t" 'NR==FNR{a[$1];next}{if($1 in a){print $0}}' unique_annotated_names_and_ranks_only_species.tsv id_parent_type_with_name.tsv> species_in_s800.tsv
cut -f2 species_in_s800.tsv > parents_of_species_in_s800.tsv
awk -F"\t" 'NR==FNR{a[$1];next}{if($1 in a && $3=="genus"){print $0}}' parents_of_species_in_s800.tsv id_parent_type_with_name.tsv> genera_names.tsv
awk -F"\t" '{printf("%s\t%s\t%s\n", $1,$5,$3)}' genera_names.tsv > genera_of_species_in_s800.tsv
#add this in unique_annotated_names_and_ranks_only_species_genus.tsv
cat unique_annotated_names_and_ranks_only_species_genus.tsv genera_of_species_in_s800.tsv > unique_annotated_species_genus_and_species_genera.tsv
sort -u unique_annotated_species_genus_and_species_genera.tsv > tmp && mv tmp unique_annotated_species_genus_and_species_genera.tsv
#get uniprot text file
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
gzip -d uniprot_sprot.dat.gz
#run perl script to generate file with PMIDs per organism
#Input files: unique_annotated_species_genus_and_species_genera.tsv, uniprot_sprot.dat
perl get_organisms_from_swissprot.pl
#Output files: uniprot_organisms_fields_papers.tsv
#get results per genus
cut -f2-4 uniprot_organisms_fields_papers.tsv > uniprot_species-taxid_category_PMIDs.tsv
sort -u uniprot_species-taxid_category_PMIDs.tsv > tmp && mv tmp uniprot_species-taxid_category_PMIDs.tsv
awk -F"\t" 'NR==FNR{a[$1]=$2;next}{if($1 in a){printf("%s\t%s\n",a[$1],$0)}}' id_parent_type_with_name.tsv uniprot_species-taxid_category_PMIDs.tsv > uniprot_parent_taxid_species-taxid_category_PMIDs.tsv
sort -u uniprot_parent_taxid_species-taxid_category_PMIDs.tsv > tmp && mv tmp uniprot_parent_taxid_species-taxid_category_PMIDs.tsv
#run perl scripts to get final list:
perl group_by_genera.pl
perl get_unique_elements_last_column.pl
Negative class
- Use the tagger outputs to sample articles in which (presumably) no species mentions occur and then tag them with a recent model to assess whether the models have a tendency to overtag due to the training data being enriched for species mentions (by comparison to a random sample of PubMed)
- Create a sample of PMIDs without organism mentions on puhti here:
/scratch/project_2001426/stringdata/week_31_2/no-organism-mention-docs/no-organism-mention-pmid-sample.txt
. - Process:
sinteractive -A project_2001426 # not on a login node!
cd /scratch/project_2001426/stringdata/week_31_2
mkdir no-organism-mention-docs
cd no-organism-mention-docs
cut -f 1,7 ../all_matches.tsv | egrep $'\t''-2$' | cut -f 1 | uniq > organism-mention-pmids.txt
cut -f 1 ../database_documents.tsv > all-pmids.txt
split -l 1000000 all-pmids.txt all-pmids-
for f in all-pmids-*; do echo $f; sort $f > sorted-$f; done
sort -m sorted-all-pmids-* > sorted-all-pmids.txt
rm all-pmids-* sorted-all-pmids-*
split -l 1000000 organism-mention-pmids.txt organism-mention-pmids-
for f in organism-mention-pmids-*; do echo $f; sort $f > sorted-$f; done
sort -m sorted-organism-mention-pmids-* > sorted-organism-mention-pmids.txt
rm organism-mention-pmids-* sorted-organism-mention-pmids-*
comm -2 -3 sorted-all-pmids.txt sorted-organism-mention-pmids.txt > no-organism-mention-pmids.txt
wc -l all-pmids.txt organism-mention-pmids.txt no-organism-mention-pmids.txt
31257225 all-pmids.txt
10936532 organism-mention-pmids.txt
20320693 no-organism-mention-pmids.txt
- i.e. 2/3 have no organism mentions detected by the tagger(!)
perl -pe '$_ = "" unless(rand()<0.001)' no-organism-mention-pmids.txt > no-organism-mention-pmid-sample.txt
wc -l no-organism-mention-pmid-sample.txt
20406 no-organism-mention-pmid-sample.txt
Experiments to automatically correct inconsistencies on the corpus
Based on a quick check, it appears that (T) is included in the span of 65 annotations in the corpus (~1.7%) and excluded from the span of 79 (~2.1%). sp. nov is included in the span of 30 annotations and excluded from the span of 21. Sampo wrote a script to add (T) and remove sp. nov. from annotations when it was present which led to 2% point improvement in F-score for the deep learning model
Original: train_batch_size 4 learning_rate 2e-5 num_train_epochs 3: 78.77
Updated: train_batch_size 24 learning_rate 2e-5 num_train_epochs 4: 80.76
Quick experiment comparing S800 performance with exact span matching to relaxed criteria. This is using the original version of the corpus to maintain comparability:
- exact matching:
precision 69.95% (603/862) recall 78.62% (603/767) F 74.03%
- left boundary:
precision 77.03% (664/862) recall 86.57% (664/767) F 81.52%
- right boundary:
precision 72.51% (625/862) recall 81.49% (625/767) F 76.73%
- overlap:
precision 80.16% (691/862) recall 88.92% (682/767) F 84.31%
Here, “left/right boundary” only requires the left/right boundaries of predicted and gold spans to match, and “overlap” accepts any overlap between a predicted and gold spans as a match. A few notes:
- “overlap” most closely resembles the criterion applied by Pafilis et al., and it’s good to note that we’re outperforming them for this criterion.
- The 10% point absolute difference (nearly 40% of the F-score error) between the exact and overlap results suggests that errors very frequently relate to mismatched entity boundaries (rather than fully missing an entity mention or predicting an entirely extra one)
- “left boundary” gives nearly as high performance as “overlap”, whereas “right boundary” is closer to the “exact” result, indicating that most of the boundary issues are at the end of the spans. This matches our manual examination identifying (T) and sp. nov. and the like
- The difference between these relaxed matching results and the exact matching results for our quick revision of the corpus suggests that there may be systematic differences in spans that we didn’t notice yet.
Comparison of annotations between S800 and LINNEAUS
-
Annotated in LINNAEUS, one or more appearances in S800, never annotated in S800: patients, patient, people, women, fly, children, flies, salmon, moth, child, rodent, murine, cyanobacterial, man, infants, participants, Patients, shrimp, crab, rats, bovine, Potato, pea, person, infant, Penaeus monodon, men, persons, calf, goat, horse, Children, Bacillus cereus group, smallpox, sorghum, guinea pigs, fission yeast, Sorghum bicolor, Blumeria graminis
-
Annotated one or more times in S800, but not in all cases (manual selection from a large list): human, Escherichia coli, HIV, yeast, rice, mice, HCV, maize, mouse, tomato, wheat, HDV, VZV, HBV, tobacco, Arabidopsis, C. neoformans, E. coli, NDV, Cryptococcus neoformans, Pseudomonas aeruginosa, FGSC, Drosophila, galaxias, pneumococcal, corn, influenza virus, U. maydis, Zea mays, Fusarium graminearum, Bdellovibrio, trout, P. aeruginosa, elephant, Solenopsis invicta, Salmonella enterica, swine, influenza A H1N1, fire ant
-
Species clues (e.g. patients, children, men, women) are annotated in LINNEAUS, but not annotated in S800.
LINNEAUS corpus tag filtering
In addition to the primary annotation file tags.tsv, the distribution includes the file filtered_tags.tsv, described in the documentation as follows:
- filtered_tags.tsv – a filtered version of tags.csv where only NCBI-taxonomy-names are included (i.e. “species clues” such as “patient”, “woman”, etc. have been removed)
Comparing this with tags.tsv gives us an idea of what the LINNAEUS authors/annotators tagged but did not consider an NCBI taxonomy name:
diff tags.tsv filtered_tags.tsv | egrep '^<' | cut -f 5 | sort | uniq -c | sort -rn
437 patients
159 patient
112 people
111 women
80 men
71 participants
66 murine
62 children
51 persons
31 calf
27 calves
22 Patients
19 person
18 guinea-pigs
18 Participants
18 MRSA
17 participant
12 guinea-pig
12 chick
10 boy
5 child
5 Women
5 Persons
5 Children
4 woman
4 People
3 Patient
3 Guinea-pigs
2 infants
2 girls
2 boys
2 Murine
2 Men
2 Child
2 Calves
1 text
1 schoolchildren
1 peoples
1 infant
1 girl
1 Person
1 Participant
1 Calf
External links to LINNEAUS and S800 corpora
Repositories for conversion to CONLL format
Other repositories relevant to this project
For information on Annodoc, see http://spyysalo.github.io/annodoc/.