Reference Assignment

About

Each genome is linked to the nearest reference genome by comparing the substitutions in the core profiles to each of the reference core profiles. The reference assignment is then used to identify potentially unreliable loci in the query genome according to the variation filter method described in the Core Filter section.

For some species (e.g. Salmonella Typhi), genomes with the same reference assignment will be clustered to provide a more fine-grained view, useful for large collections in the Collection View.

Method

Creating The Reference Variance Profile

  1. The core profile is generated for each reference genome.

  2. All substitutions are selected - excluding those with non-ATCG characters - and are extracted and aggregated into a single list of variant locations per gene family.

Querying the Variance Profile

  1. Each genome is compared against each reference at all the sites in the species profile, excluding sites outside the boundaries of any fragment matches.

  2. The total number of sites in common are divided by the total number of compared sites in order to generate a similarity score.

  3. The query genome is then assigned to the subgroup identified by the name of the most similar reference. If two references have the same score then then alphabetical order is used.

Last updated