next.pathogen.watch docs
  • Welcome to Pathogenwatch
  • News & Release Notes
    • Announcements
    • Release Notes 2025
    • Release Notes 2024
    • Release Notes 2023
    • Release Notes 2022
    • Release Notes 2019-2021
  • Getting Started
    • Sign in
    • A Brief Tour of Pathogenwatch
    • Interactive Collection View tutorial
    • Useful Links
  • How to use Pathogenwatch
    • Using the documentation
    • Using The Interactive Collection View
      • The Interactive Collection View
      • The Map Panel
      • The Tree Panel
        • Tree Panel
        • Generating a new tree
      • Data Tables
      • The Timeline Panel
      • Context search panel
      • Legend, Labels, and Colours
      • Searching genomes in a collection
      • Creating sub-collections
    • Genome Uploads & Folders
    • Browsing & Searching Genomes
    • Browsing Collections
    • Creating & Sharing Collections
    • Genome Reports
    • Deleting items
    • SARS-CoV-2 Tutorial
    • Tips and Tricks
  • Technical Descriptions of Analysis Tools
    • Genome Assembly
      • Short Read Assembly
      • Assembling genomes with EToKi
    • Plasmid Annotation
      • Inctyper
    • Assigning species with Speciator
    • Trees, Clustering, and Context Search
      • Core Genome Tree
        • About SNP-based trees
        • Core Assignment
        • Core Filter
        • Reference Assignment
        • Tree Construction
      • cgMLST Clustering & Context Searching
      • SARS-CoV-2 Genome Tree
      • cgMLST Tree
    • Lineage Assignment & Genotyping Methods
      • Genotyphi
      • Kleborate
      • cgMLST
      • Klebsiella LIN Codes
      • MLST
      • NG-MAST
      • Pangolin
      • PopPUNK
      • Vista
      • Finding HierCC codes with hclink
      • SARS-CoV-2 Notable Mutations
    • Serotyping
      • Kaptive
      • SeroBA
      • SISTR
      • ECTyper
    • Antimicrobial Resistance Prediction
      • Pathogenwatch AMR
      • Kleborate AMR
      • SPN-PBP-AMR
      • Resfinder
    • Virulence
      • STECFinder
      • VirulenceFinder
      • BIGSdb schemes
  • WHO bacterial priority pathogens
  • Initiatives powered by Pathogenwatch
    • PATH-SAFE
      • PATH-SAFE Sign in
      • What is the PATH-SAFE Programme?
      • PATH-SAFE powered by Pathogenwatch
      • Two-tool Serotyping with SISTR & SeqSero2
      • S. enterica SNP tree
      • PATH-SAFE analyses
  • How to cite
  • Acknowledgements
  • Privacy and Terms Of Service
  • FAQ
  • Report an Issue
Powered by GitBook
On this page
  • About
  • Library Structure
  • Method
  • Validation
  • References
  • How to cite
  • Code Repository
  1. Technical Descriptions of Analysis Tools

Assigning species with Speciator

Species assignment tool in Pathogenwatch

PreviousInctyperNextTrees, Clustering, and Context Search

Last updated 4 months ago

About

Speciator is an in-house tool for assigning a species to an assembled genome. It combines the approach developed by Anthony Underwood () for searching the database using mash with the curated library developed by Kat Holt et al for and . Speciator is able to accurately assign species for the majority of genome assemblies in just a few seconds.

Even in projects witghout great diversity of species, the species assignment is useful as a validation step, and is necessary as input to some other analyses (e.g. cgMLST).

Library Structure

Curated Library

A manually constructed set of reference genomes which provide a very accurate assignment of species. The library is based on the Kleborate library along with some in-house modifications - for a description please . This library currently best covers the Klebsiella and other Enterobacteriaceae species, as well as SARS-CoV-2. Signatures: 2557 Species: 302

Genus Finder Library

A library of references used for identifying the genus of an uploaded genome derived from the NCBI RefSeq genome database. For each species in a genus a reference is randomly selected and added to the library. Signatures: 35,654

Virus/Fungus/Genus-specific Libraries

A set of libraries that represent viruses, fungi and each bacterial genus are constructed using the available reference genomes in RefSeq (March 2020). Signatures: 196,277 Genera: 2,842 + Virus/Fungus Species: 39,268 - NB this includes a significant number of singleton species that are likely to be merged into another species on review.

"No Genus" Library

A number of RefSeq genome have not yet been assigned to a species within a known genus. This can be for a variety of reasons, including for newly identified species that have yet to be classified. They can also include genomes that are in fact part of a known species but this has not yet been recognised in the database. These are collected into a single library in the same fashion as a Genus library. Signatures: 2845

Method

  1. The query genome is searched against curated library with distance threshold (-d) of 0.04 (Kleborate default) and the nearest match used to assign the species.

  2. If no match is found, the genome is assigned to a kingdom or bacterial genus by searching the Genus Finder library with a distance threshold (-d) of 0.15 and the top 20 matches used to identify the genus.

  3. The selected genus (-d 0.05) or kingdom (-d 0.075) library is searched and the top 20 matches used to identify the species.

  4. If no genus is identified in step 2 or no species is identified in step 3 then a final search is carried out against the No Genus library with a distance threshold (-d) of 0.05 and the top 20 matches used to identify the species.

  5. If no species is assigned in the previous steps, then it is considered "unclassified".

Flow diagram for species assignment in Speciator:

Validation

Pathogenwatch Public Collections

All public collections in Pathogenwatch only container verified members of the specified species - including more than 10,000 genomes in Staphylcoccus aureus, Salmonella Typhi and Neisseria gonorrhoeae. Furthermore we have tested against examples from a range of other species including Candida auris, Zika virus, Renibacterium salmonarium, and several thousands of CoV-2 SARS versus non-CoV-2 SARS. Speciator is able to identify the correct species for these genomes with 100% accuracy without any specific interventions in the software to achieve this.

Note for PATH-SAFE initiative: Speciator has been confirmed to effectively identify Salmonella species.

SPARK Klebsiella/Raoultella Collection

EuSCAPE non-Kpn genomes

Speciator now gets 100% of the non-Kpn part of the EuSCAPE Klebsiella survey correct as well. The previous version was able to capture the K. pneumoniae part of the collection but was nearer 50% correct on the others.

Caveats

  • Not all species are well defined, and references can be incorrectly classified or the classification out of date in RefSeq.

  • Contaminated samples (genomes > 1 species) will get a single species assignment. This tool is not for metagenomics in any way.

  • There are many species that we haven't been able to test in depth and there's no gold standard data set covering the more unusual species to any depth.

References

How to cite

There isn't currently as specific publication on Speciator. It was first introduced in the following paper.

Gladstone RA, Lo SW, Goater R, et al. Visualizing variation within Global Pneumococcal Sequence Clusters (GPSCs) and country population snapshots to contextualize pneumococcal isolates. Microb Genom. 2020;6(5):e000357. doi:10.1099/mgen.0.000357

Code Repository

is used for all searches between query genomes and reference libraries using a kmer size (-k) of 21 and sketch size (-s) of 1000.

This assignment is then passed on to downstream tools that require a species identifier, such as , or which are species-specific, such as .

We were kindly given the opportunity to test our species assignments against the curated manual assignments of the collection of diverse Klebsiella. Speciator gets 100% of assignments correct.

Kleborate: .

Yersiniabactin and colibactin (ICEKp)

Aerobactin and salmochelin:

Kaptive for capsule (K) serotyping:

Kaptive for O antigen (LPS) serotyping:

The software is available under a OSS licence:

bactinspector
NCBI RefSeq
Kleborate
Bacsort
see the source notes here
Mash
MLST
Genotyphi
SPARK
Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.
Lam _et al_. A genomic surveillance framework and genotyping tool for Klebsiella pneumoniae and its related species complex. Nature Communications (2021)
Lam, MMC. et al. Genetic diversity, mobilisation and spread of the yersiniabactin-encoding mobile element ICEKp in Klebsiella pneumoniae populations. Microbial Genomics (2018).
Lam, MMC. et al. Tracking key virulence loci encoding aerobactin and salmochelin siderophore synthesis in Klebsiella pneumoniae. Genome Medicine (2018).
Wyres, KL. et al. Identification of Klebsiella capsule synthesis loci from whole genome data. Microbial Genomics (2016).
Wick, RR et. al. Kaptive Web: user-friendly capsule and lipopolysaccharide serotype prediction for Klebsiella genomes. Journal of Clinical Microbiology (2018).
https://github.com/pathogenwatch-oss/speciator
Speciator flow diagram