Section 3.5. Methods of Sequencing Data Analysis (from DOI: 10.3390/v12020211)
From publication: "Current Trends in Diagnostics of Viral Infections of Unknown Etiology" published as Viruses; 2020 Feb 14 ; 12 (2); DOI: https://doi.org/10.3390/v12020211
Section 3.5. Methods of Sequencing Data Analysis
Whenever metagenomics is used for the detection of pathogens, it is crucial to use reliable bioinformatic tools and specialized databases that would help decide whether discovered microorganisms indeed have caused the infection or act merely as artifacts. Such processing normally demands lots of computing power and knowledge and skills in bioinformatics. Typically, data analysis mainly involves comparing obtained reads with reference genomes. Quite a few algorithms have been developed for this purpose, with BLAST being the most widely used. However, BLAST works slowly for the analysis of NGS data, and processing times can take several days, or even weeks, especially when calculations are performed for amino acids. Other often-used programs are Bowtie and BWA (Burrows-Wheeler Aligner), which are usually employed as filtration tools, and DIAMOND as an alternative to BLAST.
As for the published bioinformatic pipelines, in general they adapt either of the two strategies: (1) firsthand removing the host's reads by mapping them to the host's or another non-target reference genome and subsequently analyzing the remaining sequences; (2) assembling short reads into larger contigs, and only then comparing assembled sequences to reference genomes, including those of viruses. (Figure 3)
First filtering of the host reads (Figure 3a) works well only when the host's genome has been thoroughly studied and described in detail, although it is not a problem for detection of human viral infections. As HTS technologies advance, more complete genome sequences become publicly available. Another problem is that viruses may contain nucleotide fragments that are similar to certain regions in the human (or other host) genomes, leading to false negative results. The second approach requires high coverage to work, thus significantly increasing both the processing time and the experiment cost. Moreover, whenever input data are insufficient, spaces might form between reads, impairing their assembly into contigs, thus halting virus classification. In principal, two pipelines utilize the same concept of comparing sequences that "survived" filtering against reference viral genomes. Depending on the sample type and expected contaminants, the filtering step may also include rRNA, mtRNA, mRNA, bacterial or fungal sequences or non-human host genomes.
Apart from the differences in the methodology between first filtering out the host reads or leaving them included, there are other serious challenges, such as the proper taxonomic assignment of viral reads or contigs. Homology or alignment-based similarity search methods are often used, but a composition search, in which oligonucleotide frequencies or k-mer counts are matched to reference sequences, is also an option. At the same time, a composition search requires the program to be "trained" on reference data and it is not used much in viral genomics.
Dismissing the host's sequences is an important step that has been introduced into numerous processing algorithms, such as VirusFinder, VirusSeq or Vy-PER. This step helps remove false positives caused by similarities between some regions of the human and viral genomes.
One popular pipeline for analysis of viral reads is by Petty et al. :ezVIR, which was designed to process HTS data from any of the sequencing platforms, and which initially compares raw reads with reference human genome, subsequently removing them and analyzing the remaining sequences against the viral database.
VirusFinder, for instance, first utilizes Bowtie2 to detect and remove sequences derived from human genome. Next, it engages BLAT (BLAST-like alignment tool) to align the remaining sequences to the database of viral genomes. During the final step, the short reads, supposedly viral, are assembled into longer sequences:contigs:using Trinity. VirusFinder, developed primarily to identify viral integration sites within the human genome, produces the best results when given sequencing reads with the maximum depth possible. In the original article, coverage varies between 31x and 121x. Acquired contigs are then used for phylogenetic analysis.
VirusHunter, on the other hand, utilizes BLASTn to filter out human-related sequences after quality evaluation. Sequences that pass filtration are taxonomically classified according to BLASTn and BLASTx algorithms. Thus, VirusHunter requires a high-quality host's genome for sorting the sequences and significant computational power to run.
VirusSeq is intended for detection of viral sequences in tumor tissues. First, it removes human sequences by comparing them against the reference. The MOSAIK program is used both prior to filtration and afterwards to ensure quality sequence detection. VirusSeq sets limits on the minimally acceptable number of reads and coverage based on the size of viral genome. For example, it demands at least 1000 reads per virus given that the sequencing depth equals at least 30x. Although the threshold can be adjusted, this tool has been developed for reads with high coverage and is consequently not recommended for processing data with a low percentage of viral reads.
Vy-Per is another bioinformatic instrument that utilizes reference human genome for filtering out reads of a host's DNA. Sequences that are not dismissed during this step are compared with the data in the NCBI database using BLAT tool. Although experiments set up to test Vy-Per use samples with a rather high coverage (80x for samples and 40x for controls), it is not mandatory, but lowers the risk of false positives.
PathSeq is a powerful computational tool for analyzing the non-host part of sequencing data, which is able to detect the presence of both known and new pathogens. The PathSeq approach begins with a subtraction phase in which reads are aligned with human reference genome to subsequently exclude them, in an attempt to concentrate pathogen-derived data. This is followed by the analytical phase, in which the remaining reads are aligned to the microbial reference sequences and assembled de novo. The formation of large contigs, including several unmapped reads that do not have a significant similarity to alignment with any sequence in the referenced databases, may suggest a previously undetected organism.
SURPI ("sequence-based ultrarapid pathogen identification") is one more example of the pipelines for complex metagenomic NGS data generated from clinical samples that first filter out the non-target reads using 29 databases and then identify viral, bacterial, parasitic and fungal reads, which also involves de novo contig assembly.
There are many other tools available, and we refer to recent reviews by Nooij et al. (2018) and Fonseca et al..
The search for viral pathogens is often impeded by their genetic variability, caused by a multitude of factors: gene duplications and exchanges, frequent single-nucleotide mutations, gene-large insertions and rapid adaptation of viruses to new hosts. These become a particular nuisance whenever large sets of samples from various organisms are handled. Firstly, reliable reference genomes have only been assembled for a limited number of organisms, although, as previously mentioned, this issue is being addressed. Secondly, the amount of usable viral nucleic acids in a sample depends on numerous factors, including the stage of a pathogen's life cycle and the overall quality of material and its type, all of which might cause prejudice towards certain viral species and drop the amount of usable viral DNA and RNA. This complicates the assembly of contigs, because it requires the maximum number of reads possible.
All search methods rely on genome reference databases, such as the NCBI GenBank, RefSeq or BLAST nucleotide (nt) and non-redundant protein (nr) databases. The poor quality of the reference sequences is a major obstacle to processing data. Because of this, specialized databases are being compiled manually to ensure a strict quality control, e.g., ViPR, RVDB and viruSITE. However, they are reliable for the same reason that they are limited, because only a small fraction of all published sequences ever make it to these databases. As a consequence, reads obtained from supposedly new strains are frequently left out. Contrarily, a vast NCBI-based GenBank database is brimming with viral sequences, both complete and partial; however, the price for the quantity is the quality of the assembled data. Even so, GenBank is far from being truly comprehensive, but this is merely a question of time. Protein databases are also used, for example, Pfam and UniProt. Protein-level searches can usually detect more distant homology due to the usage of sensitive amino acid similarity matrices, which can improve the detection of divergent viruses, but untranslated regions of the genome remain unused.
Figure 3: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7077230/bin/viruses-12-00211-g003.jpg
Figure 1 caption:
Pipelines for processing reads utilize two main strategies for filtering out "junk" data: (a) mapping raw reads onto the reference genome of the host and removing them, while preserving unmapped sequences for further analysis; (b) assembling short reads into contigs and comparing them against the host's reference genome. Annotation follows both strategies.