next.pathogen.watch docs
  • Welcome to Pathogenwatch
  • News & Release Notes
    • Announcements
    • Release Notes 2025
    • Release Notes 2024
    • Release Notes 2023
    • Release Notes 2022
    • Release Notes 2019-2021
  • Getting Started
    • Sign in
    • A Brief Tour of Pathogenwatch
    • Interactive Collection View tutorial
    • Useful Links
  • How to use Pathogenwatch
    • Using the documentation
    • Using The Interactive Collection View
      • The Interactive Collection View
      • The Map Panel
      • The Tree Panel
        • Tree Panel
        • Generating a new tree
      • Data Tables
      • The Timeline Panel
      • Context search panel
      • Legend, Labels, and Colours
      • Searching genomes in a collection
      • Creating sub-collections
    • Genome Uploads & Folders
    • Browsing & Searching Genomes
    • Browsing Collections
    • Creating & Sharing Collections
    • Genome Reports
    • Deleting items
    • SARS-CoV-2 Tutorial
    • Tips and Tricks
  • Technical Descriptions of Analysis Tools
    • Genome Assembly
      • Short Read Assembly
      • Assembling genomes with EToKi
    • Plasmid Annotation
      • Inctyper
    • Assigning species with Speciator
    • Trees, Clustering, and Context Search
      • Core Genome Tree
        • About SNP-based trees
        • Core Assignment
        • Core Filter
        • Reference Assignment
        • Tree Construction
      • cgMLST Clustering & Context Searching
      • SARS-CoV-2 Genome Tree
      • cgMLST Tree
    • Lineage Assignment & Genotyping Methods
      • Genotyphi
      • Kleborate
      • cgMLST
      • Klebsiella LIN Codes
      • MLST
      • NG-MAST
      • Pangolin
      • PopPUNK
      • Vista
      • Finding HierCC codes with hclink
      • SARS-CoV-2 Notable Mutations
    • Serotyping
      • Kaptive
      • SeroBA
      • SISTR
      • ECTyper
    • Antimicrobial Resistance Prediction
      • Pathogenwatch AMR
      • Kleborate AMR
      • SPN-PBP-AMR
      • Resfinder
    • Virulence
      • STECFinder
      • VirulenceFinder
      • BIGSdb schemes
  • WHO bacterial priority pathogens
  • Initiatives powered by Pathogenwatch
    • PATH-SAFE
      • PATH-SAFE Sign in
      • What is the PATH-SAFE Programme?
      • PATH-SAFE powered by Pathogenwatch
      • Two-tool Serotyping with SISTR & SeqSero2
      • S. enterica SNP tree
      • PATH-SAFE analyses
  • How to cite
  • Acknowledgements
  • Privacy and Terms Of Service
  • FAQ
  • Report an Issue
Powered by GitBook
On this page
  • About
  • Scoring Genome Pairs
  • The Dendrogram.
  1. Technical Descriptions of Analysis Tools
  2. Trees, Clustering, and Context Search
  3. Core Genome Tree

Tree Construction

PreviousReference AssignmentNextcgMLST Clustering & Context Searching

Last updated 5 months ago

About

To generate a score suitable for clustering related genomes, Pathogenwatch compares variant positions from all pairs of loci found in the two genomes, bar those excluded by the previously calculated variation filter (see ). In order to be able to analyse incomplete genome as well, the score over the partial region is scaled to an expected core size, giving an approximation of what the score would be if the genome was complete.

Scoring Genome Pairs

  1. Extract substitutions for each locus - indels are excluded as they are often the result of assembly or sequencing error and our testing found the noise from these events could overwhelm the true distances in closely related genomes.

  2. In the vast majority of cases there will only be a single locus for the family in both genomes, so these are trivially paired up. If there is more than one locus then the most similar loci are paired together. Unmatched loci are ignored.

  3. During the comparison the number of compared nucleotides is tracked. At the end this is used to scale the score to the "expected number of nucleotides". The expected number of nucleotides is calculated as the sum of the reference sequences used to identify the core.

  4. Total variant sites between the pair of genomes are calculated and then modified using the expected number of nucleotide scaling described above.

The Dendrogram.

  1. A dendrogram is then constructed by writing all scaled pairwise scores to a matrix, and running the APE package () neighbour-joining implementation.

  2. The resulting tree is then midpoint rooted using the phangorn package ().

Core Filter
Paradis et al
KP Schliep