Assembling genomes with EToKi
Specialised assembly pipeline available to the PATH-SAFE Initiative
Last updated
Specialised assembly pipeline available to the PATH-SAFE Initiative
Last updated
This assembly pipeline is used for PATH-SAFE genomes being uploaded to PATH-SAFE Powered by Pathogenwatch. The assembly occurs outside of Pathogenwatch, and is included here as a reference for PATH-SAFE users.
After sequencing reads and metadata have been by PATH-SAFE team members, the genomes are automatically assembled using the EToKi pipeline.
For each assembled genome, a summary set of statistics are calculated that can help provide insight to the quality and completeness of the genome assembly. These statistics are displayed in the Interactive Collection View and the .
The length of the genome in nucleotide pairs, calculated by summing the lengths of the individual contigs.
The N50 () is a measure of how many contigs are required to cover more than half the genome, relative to the size of the genome. Better assemblies, in which the core genome has been assembled into a small number of contigs, will have a larger N50. The closer the N50 comes to the size of a gene, the more likely it is that core genes may have only been partially or incorrectly assembled.
The number of contigs in the assembly. Ideally this would match the number of chromosomes and plasmids in the genome assembly, though 10s or 100s of contigs is more typical. It's possible that an assembly with a well formed core can contain a lot of small contigs, so it's best to use this number in conjunction with the N50 when making quality judgements.
This is the number of non-ATCG characters in the assembly - 'N' for an uncertain nucleotide is a common occurrence. Again, the ideal is for there to be none present, and while their impact is minimal for most analyses, if there are more than a few hundred it could be indicative of an issue with sequencing or assembly.
The percentage of the nucleotides that are either guanine or cytosine. A significant deviation from the expected (e.g. 52.1% for S.enterica) might indicate contamination or missing parts of the assembly.