Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Background Pathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely-related genomes among a background of thousands of other genomes is challenging.Methods We describe a refinement to core-genome multi-locus sequence typing (cgMLST) where alleles at each gene are reproducibly converted to a unique hash, or short string of letters (hash-cgMLST). This avoids the resource-intensive need for a single centralised database of sequentially-numbered alleles. We test the reproducibility and discriminatory power of cgMLST/hash-cgMLST compared to mapping-based approaches in Clostridium difficile using repeated sequencing of the same isolates (replicates) and data from consecutive infection isolates from six English hospitals.Results Hash-cgMLST provided the same results as standard cgMLST with minimal performance penalty. Comparing 272 replicate sequence pairs, using reference-based mapping there were 0, 1 or 2 SNPs between 262(96%), 5(2%) and 1(<1%) respectively. Using hash-cgMLST, 218(80%) replicate pairs assembled with SPAdes had zero gene differences, 31(11%), 5(2%) and 18(7%) pairs had 1, 2 and >2 differences respectively. False gene differences were clustered in specific genes and associated with fragmented assemblies, but reduced using the SKESA assembler. Considering 412 pairs of infections within ≤2 SNPS, i.e. consistent with recent transmission, 376(91%) had ≤2 gene differences and 16(4%) ≥4. Comparing a genome to 100,000 others took <1 minute using hash-cgMLST.Conclusion. Hash-cgMLST is an effective surveillance tool for rapidly identifying clusters of related genomes. However, cgMLST/hash-cgMLST generates more false variants than mapping-based approaches. Follow-up mapping-based analyses are likely required to precisely define close genetic relationships.

Original publication




Journal article


Journal of clinical microbiology

Publication Date



Big Data Institute, University of Oxford