Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

<jats:p>Read alignment is the central step of many analytic pipelines that perform variant calling. To reduce error, it is common practice to pre-process raw sequencing reads to remove low-quality bases and residual adapter contamination, a procedure collectively known as ‘trimming’. Trimming is widely assumed to increase the accuracy of variant calling, although there are relatively few systematic evaluations of its effects and no clear consensus on its efficacy. As sequencing datasets increase both in number and size, it is worthwhile reappraising computational operations of ambiguous benefit, particularly when the scope of many analyses now routinely incorporates thousands of samples, increasing the time and cost required. Using a curated set of 17 Gram-negative bacterial genomes, this study initially evaluated the impact of four read-trimming utilities (Atropos, <jats:sc>fastp</jats:sc>, Trim Galore and Trimmomatic), each used with a range of stringencies, on the accuracy and completeness of three bacterial SNP-calling pipelines. It was found that read trimming made only small, and statistically insignificant, increases in SNP-calling accuracy even when using the highest-performing pre-processor in this study, <jats:sc>fastp</jats:sc>. To extend these findings, &gt;6500 publicly archived sequencing datasets from <jats:italic> <jats:named-content content-type="species"> <jats:ext-link xmlns:xlink="" ext-link-type="uri" xlink:href="" xlink:type="simple">Escherichia coli</jats:ext-link> </jats:named-content> </jats:italic>, <jats:italic> <jats:named-content content-type="species"> <jats:ext-link xmlns:xlink="" ext-link-type="uri" xlink:href="" xlink:type="simple">Mycobacterium tuberculosis</jats:ext-link> </jats:named-content> </jats:italic> and <jats:italic> <jats:named-content content-type="species"> <jats:ext-link xmlns:xlink="" ext-link-type="uri" xlink:href="" xlink:type="simple">Staphylococcus aureus</jats:ext-link> </jats:named-content> </jats:italic> were re-analysed using a common analytic pipeline. Of the approximately 125 million SNPs and 1.25 million indels called across all samples, the same bases were called in 98.8 and 91.9 % of cases, respectively, irrespective of whether raw reads or trimmed reads were used. Nevertheless, the proportion of mixed calls (i.e. calls where &lt;100 % of the reads support the variant allele; considered a proxy of false positives) was significantly reduced after trimming, which suggests that while trimming rarely alters the set of variant bases, it can affect the proportion of reads supporting each call. It was concluded that read quality- and adapter-trimming add relatively little value to a SNP-calling pipeline and may only be necessary if small differences in the absolute number of SNP calls, or the false call rate, are critical. Broadly similar conclusions can be drawn about the utility of trimming to an indel-calling pipeline. Read trimming remains routinely performed prior to variant calling likely out of concern that doing otherwise would typically have negative consequences. While historically this may have been the case, the data in this study suggests that read trimming is not always a practical necessity.</jats:p>

Original publication




Journal article


Microbial Genomics


Microbiology Society

Publication Date