Assembling genomes with EToKi

Specialised assembly pipeline available to the PATH-SAFE Initiative

After sequencing reads and metadata have been uploaded to CLIMB by PATH-SAFE team members, the genomes are automatically assembled using the EToKi pipeline.

For each assembled genome, a summary set of statistics are calculated that can help provide insight to the quality and completeness of the genome assembly. These statistics are displayed in the Interactive Collection View data tables and the Genome Reports.

Assembly stats

Genome length

The length of the genome in nucleotide pairs, calculated by summing the lengths of the individual contigs.

N50

The N50 (Wikipedia) is a measure of how many contigs are required to cover more than half the genome, relative to the size of the genome. Better assemblies, in which the core genome has been assembled into a small number of contigs, will have a larger N50. The closer the N50 comes to the size of a gene, the more likely it is that core genes may have only been partially or incorrectly assembled.

No. contigs

The number of contigs in the assembly. Ideally this would match the number of chromosomes and plasmids in the genome assembly, though 10s or 100s of contigs is more typical. It's possible that an assembly with a well formed core can contain a lot of small contigs, so it's best to use this number in conjunction with the N50 when making quality judgements.

Non-ATCG

This is the number of non-ATCG characters in the assembly - 'N' for an uncertain nucleotide is a common occurrence. Again, the ideal is for there to be none present, and while their impact is minimal for most analyses, if there are more than a few hundred it could be indicative of an issue with sequencing or assembly.

GC content

The percentage of the nucleotides that are either guanine or cytosine. A significant deviation from the expected (e.g. 52.1% for S.enterica) might indicate contamination or missing parts of the assembly.

Last updated