Methods for virus classification and the challenge of incorporating metagenomic sequence data.
Simmonds P.
The division of viruses into orders, families, genera and species provides a classification framework that seeks to organize and make sense of the diversity of viruses infecting animals, plants and bacteria. Classifications are based on similarities in genome structure and organization, the presence of homologous genes and sequence motifs and at lower levels such as species, host range, nucleotide and antigenic relatedness and epidemiology. Classification below the level of family must also be consistent with phylogeny and virus evolutionary histories. Recently developed methods such as PASC, DEMaRC and NVR offer alternative strategies for genus and species assignments that are based purely on degrees of divergence between genome sequences. They offer the possibility of automating classification of the vast number of novel virus sequences being generated by next-generation metagenomic sequencing. However, distance-based methods struggle to deal with the complex evolutionary history of virus genomes that are shuffled by recombination and reassortment, and where taxonomic lineages evolve at different rates. In biological terms, classifications based on sequence distances alone are also arbitrary whereas the current system of virus taxonomy is of utility precisely because it is primarily based upon phenotypic characteristics. However, a separate system is clearly needed by which virus variants that lack biological information might be incorporated into the ICTV classification even if based solely on sequence relationships to existing taxa. For these, simplified taxonomic proposals and naming conventions represent a practical way to expand the existing virus classification and catalogue our rapidly increasing knowledge of virus diversity.