Finding HierCC codes with hclink
HierCC = Hierarchical Clustering of CgMLST (used in Enterobase), resulting in codes representing population assignments
cgMLST = core genome Multi-Locus Sequence Typing
Purpose
This tool robustly links Pathogenwatch cgMLST profiles to HierCC clusters from EnteroBase. The purpose is two-fold:
helps bring the HierCC labelling into Pathogenwatch
allows end users to potentially discover new candidate outbreak members via EnteroBase
It allows new genomes to be matched to the nearest EnteroBase cgMLST profile and HierCC (Zhou, Charlesworth, Achtmann; 2021) cluster identifiers inferred according to the profile distances. HierCC annotations provide both a familiar cluster naming scheme, and a complementary approach to linking genomes to the Pathogenwatch clustering.
About hclink
hclink includes the complete set of cgMLST profiles and linked HierCC codes available on the day when that version was created. Given an input profile (a genome's cgMLST profile), hclink carries out a rapid search of all the available profiles to find the nearest profile according to the raw number of differences. Then the corrected HierCC distance score is calculated for this profile and used to infer the HierCC cluster codes up to the threshold indicated by the corrected distance.
It should be noted that this process is heuristic in nature and may select a non-optimal profile where there are larger numbers of missing alleles. Since these distances would generally be larger, they shouldn't affect outbreak cluster identification, which would generally be limited to low cluster thresholds.
Validation of the HierCC linking tool for PATH-SAFE
The full validation report can be found HERE.
Code Repository
Last updated