Hash-based core genome multi-locus sequencing typing for Clostridium difficile.
Eyre DW., Peto TEA., Crook DW., Walker AS., Wilcox MH.
Background Pathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely-related genomes among a background of thousands of other genomes is challenging.Methods We describe a refinement to core-genome multi-locus sequence typing (cgMLST) where alleles at each gene are reproducibly converted to a unique hash, or short string of letters (hash-cgMLST). This avoids the resource-intensive need for a single centralised database of sequentially-numbered alleles. We test the reproducibility and discriminatory power of cgMLST/hash-cgMLST compared to mapping-based approaches in Clostridium difficile using repeated sequencing of the same isolates (replicates) and data from consecutive infection isolates from six English hospitals.Results Hash-cgMLST provided the same results as standard cgMLST with minimal performance penalty. Comparing 272 replicate sequence pairs, using reference-based mapping there were 0, 1 or 2 SNPs between 262(96%), 5(2%) and 1(<1%) respectively. Using hash-cgMLST, 218(80%) replicate pairs assembled with SPAdes had zero gene differences, 31(11%), 5(2%) and 18(7%) pairs had 1, 2 and >2 differences respectively. False gene differences were clustered in specific genes and associated with fragmented assemblies, but reduced using the SKESA assembler. Considering 412 pairs of infections within ≤2 SNPS, i.e. consistent with recent transmission, 376(91%) had ≤2 gene differences and 16(4%) ≥4. Comparing a genome to 100,000 others took <1 minute using hash-cgMLST.Conclusion. Hash-cgMLST is an effective surveillance tool for rapidly identifying clusters of related genomes. However, cgMLST/hash-cgMLST generates more false variants than mapping-based approaches. Follow-up mapping-based analyses are likely required to precisely define close genetic relationships.