next.pathogen.watch docs
  • Welcome to Pathogenwatch
  • News & Release Notes
    • Announcements
    • Release Notes 2025
    • Release Notes 2024
    • Release Notes 2023
    • Release Notes 2022
    • Release Notes 2019-2021
  • Getting Started
    • Sign in
    • A Brief Tour of Pathogenwatch
    • Interactive Collection View tutorial
    • Useful Links
  • How to use Pathogenwatch
    • Using the documentation
    • Using The Interactive Collection View
      • The Interactive Collection View
      • The Map Panel
      • The Tree Panel
        • Tree Panel
        • Generating a new tree
      • Data Tables
      • The Timeline Panel
      • Context search panel
      • Legend, Labels, and Colours
      • Searching genomes in a collection
      • Creating sub-collections
    • Genome Uploads & Folders
    • Browsing & Searching Genomes
    • Browsing Collections
    • Creating & Sharing Collections
    • Genome Reports
    • Deleting items
    • SARS-CoV-2 Tutorial
    • Tips and Tricks
  • Technical Descriptions of Analysis Tools
    • Genome Assembly
      • Short Read Assembly
      • Assembling genomes with EToKi
    • Plasmid Annotation
      • Inctyper
    • Assigning species with Speciator
    • Trees, Clustering, and Context Search
      • Core Genome Tree
        • About SNP-based trees
        • Core Assignment
        • Core Filter
        • Reference Assignment
        • Tree Construction
      • cgMLST Clustering & Context Searching
      • SARS-CoV-2 Genome Tree
      • cgMLST Tree
    • Lineage Assignment & Genotyping Methods
      • Genotyphi
      • Kleborate
      • cgMLST
      • Klebsiella LIN Codes
      • MLST
      • NG-MAST
      • Pangolin
      • PopPUNK
      • Vista
      • Finding HierCC codes with hclink
      • SARS-CoV-2 Notable Mutations
    • Serotyping
      • Kaptive
      • SeroBA
      • SISTR
      • ECTyper
    • Antimicrobial Resistance Prediction
      • Pathogenwatch AMR
      • Kleborate AMR
      • SPN-PBP-AMR
      • Resfinder
    • Virulence
      • STECFinder
      • VirulenceFinder
      • BIGSdb schemes
  • WHO bacterial priority pathogens
  • Initiatives powered by Pathogenwatch
    • PATH-SAFE
      • PATH-SAFE Sign in
      • What is the PATH-SAFE Programme?
      • PATH-SAFE powered by Pathogenwatch
      • Two-tool Serotyping with SISTR & SeqSero2
      • S. enterica SNP tree
      • PATH-SAFE analyses
  • How to cite
  • Acknowledgements
  • Privacy and Terms Of Service
  • FAQ
  • Report an Issue
Powered by GitBook
On this page
  • About
  • Filtering Process
  • Paralog Filter
  • Variance Filter
  1. Technical Descriptions of Analysis Tools
  2. Trees, Clustering, and Context Search
  3. Core Genome Tree

Core Filter

About

Two filtering steps are applied to remove loci that can be problematic for tree building. Firstly, paralogues are added to the filter, and then loci that show unexpectedly high variance when compared to the nearest reference are removed.

Filtering Process

Paralog Filter

  • Any core gene that has more than one match in the core profile is added to the filter.

Variance Filter

Principle

The variation filter is used to identify and remove loci that show an unusually large (or small in more distant comparisons) number of variant sites given the mutation rate over the rest of the genome. For this we assume that the distribution of mutations amongst loci should approximate a Poisson variation, and exclude loci that fall outside of a predetermined probability threshold. "Excessively" variant loci are likely to be so due to either (a) erroneous assembly - not unlikely when dealing with significant numbers of genomes or (b) the result of lateral gene transfer. In both cases, inclusion of the locus in tree building can lead to errors in branch length and the neighbour joining algorithm.

Determining the probability threshold

In order to determine a probability threshold for marking a locus as unexpectedly variant, the following approach is applied. We assume that for a given pair of genomes, they are equally diverged from a (close) common ancestor, and so we should observe twice as many variants as have occurred in a single genome. Thus the first calculation is 1 / (2 x core families). At this point it would expected to use the number of comparisons in the calculation to further lower the threshold, since in carrying out many comparisons we would expect to see rare events occurring. However, this makes calculations between collections not directly comparable, so we use a fixed size of comparisons that takes into account the large number comparisons we expect to run in Pathogenwatch. Thus the final threshold calculation is:

1 / (1000000 x 2 x C) where C is the number of core families.

It should be noted that this filter is very conservative and only removes extremely divergent alleles. In the vast majority of genomes we don’t observe any filtered loci.

Creating The Variation Filter

During the reference assignment task, the number of differences at each locus and the total number of differences and nucleotides are counted. An overall mutation rate is calculated as differences / total nucleotides.

  1. Then for each locus an expect number of mutations is determined by multiplying the mutation rate by the locus length in nucleotides. If the expected number is below 1 a minimum value of 1 is used.

  2. The expected value is used as the mean of a poisson curve and the cumulative probability of observed number of mutations or more is determined. Or the inverse if the number of mutations is below the mean - i.e. fewer may be observed than expected, though in practice this never occurs as the mutation rate is normally low.

  3. Loci that fail to meet the threshold are noted and excluded from further comparisons.

PreviousCore AssignmentNextReference Assignment

Last updated 5 months ago