Tree Construction

About

To generate a score suitable for clustering related genomes, Pathogenwatch compares variant positions from all pairs of loci found in the two genomes, bar those excluded by the previously calculated variation filter (see Core Filter). In order to be able to analyse incomplete genome as well, the score over the partial region is scaled to an expected core size, giving an approximation of what the score would be if the genome was complete.

Scoring Genome Pairs

Extract substitutions for each locus - indels are excluded as they are often the result of assembly or sequencing error and our testing found the noise from these events could overwhelm the true distances in closely related genomes.
In the vast majority of cases there will only be a single locus for the family in both genomes, so these are trivially paired up. If there is more than one locus then the most similar loci are paired together. Unmatched loci are ignored.
During the comparison the number of compared nucleotides is tracked. At the end this is used to scale the score to the "expected number of nucleotides". The expected number of nucleotides is calculated as the sum of the reference sequences used to identify the core.
Total variant sites between the pair of genomes are calculated and then modified using the expected number of nucleotide scaling described above.

The Dendrogram.

A dendrogram is then constructed by writing all scaled pairwise scores to a matrix, and running the APE package (Paradis et al) neighbour-joining implementation.
The resulting tree is then midpoint rooted using the phangorn package (KP Schliep).

PreviousReference Assignment NextcgMLST Clustering & Context Searching

Last updated 12 months ago