S. enterica SNP tree

Developing the SNP tree method

Salmonella enterica scheme description

The scheme consists of a manually curated library of core sequence loci and a reference set of complete genomes. The library was derived from one previously created for Salmonella enterica serovar Typhi (Argimon, Yeats et al, 2021); the core library consists of 2,837 contigs covering 3,168,347 nucleotides - approximately 60% of a typical S. enterica genome and ranging from 53-70% depending on the size of the accessory genome in that strain.

The core is built using a defined set of rules as described in previous publications. In brief, the core library is iteratively tested against a set of complete references in order to define a consistent common set of regions that occur in all gnomes. So families with paralogues or partial matches are filtered out, while overlapping genes are linked into single segments. It is normally expected that the core contigs are found in all references. However, S. enterica is unusually diverse, and with the very large number of complete genomes now available rare gene deletions are frequently observed. Using a strict criteria led to the exclusion of the majority of candidate core regions. Instead a “soft-core” approach was used that required the contig to be observed in at least 95% of the 1235 complete genomes.

The reference set of complete genomes are used by the distance calculation method to identify and filter badly-assembled or otherwise unreliable regions prior to calculating the overall distance. While using more representatives can help in ensuring an optimum reference is found, adding more references also has a significant impact on the speed and cost of running the software. In order to select a diverse but representative group, the distance matrix between 1235 candidate references was calculated and clustered using Affinity Propagation clustering to select representatives. This process led to a total of 87 references being selected.

Validation of SNP-based clustering for PATH-SAFE

The full Validation Report can be found HERE.

Code repository

See https://github.com/pathogenwatch-oss/core-fp

and https://github.com/pathogenwatch-oss/tasks/tree/main/tree

Last updated