Section 3.2. Problems of Metagenomic Approach (from DOI: 10.3390/v12020211)

From Wikibase.slis.ua.edu
Jump to navigation Jump to search


Navigation
ArticleCurrent Trends in Diagnostics of Viral Infections of Unknown Etiology (DOI: 10.3390/v12020211)
Sections in this Publication
SectionSection 1. Introduction (from DOI: 10.3390/v12020211)
SectionSection 2. Traditional Methods of Diagnosing Infections (from DOI: 10.3390/v12020211)
SectionSection 3. Studying Viral Pathogens with High Throughput Sequencing (HTS) (from DOI: 10.3390/v12020211)
SectionSection 3.1. Metagenomic Approach (from DOI: 10.3390/v12020211)
SectionSection 3.2. Problems of Metagenomic Approach (from DOI: 10.3390/v12020211)
SectionSection 3.3. Methods for Improving Sequencing Output (from DOI: 10.3390/v12020211)
SectionSection 3.3.1. Nucleic Acids Depletion (from DOI: 10.3390/v12020211)
SectionSection 3.3.2. Hybridization-Based Enrichment (from DOI: 10.3390/v12020211)
SectionSection 3.3.3. Target Amplification (from DOI: 10.3390/v12020211)
SectionSection 3.4. Whole Viral Genome Sequencing (from DOI: 10.3390/v12020211)
SectionSection 3.5. Methods of Sequencing Data Analysis (from DOI: 10.3390/v12020211)
SectionSection 4. Long Read Sequencing (from DOI: 10.3390/v12020211)
SectionSection 5. Obstacles to Overcome in the Nearest Future (from DOI: 10.3390/v12020211)
SectionSection 6. Conclusions (from DOI: 10.3390/v12020211)
SectionAuthor Contributions (from DOI: 10.3390/v12020211)
SectionFunding (from DOI: 10.3390/v12020211)
SectionConflicts of Interest (from DOI: 10.3390/v12020211)
SectionReferences (from DOI: 10.3390/v12020211)
Named Entities in this Section

From publication: "Current Trends in Diagnostics of Viral Infections of Unknown Etiology" published as Viruses; 2020 Feb 14 ; 12 (2); DOI: https://doi.org/10.3390/v12020211

Section 3.2. Problems of Metagenomic Approach

Even though the cost per raw megabase of DNA sequence has plummeted over the years, HTS remains a costly and time-consuming method, requiring a complex procedure of sample preparation, expensive equipment and at least some bioinformatic training with laboratory personnel, therefore rendering it unsuitable for large-scale screening studies. For example, library preparation in the aforementioned study by Graf et al. took no less than 14 h. A HiSeq 2500 (Illumina, San Diego, California, USA) was used to conduct sequencing that lasted 11 consecutive days. Hypothetically, a metagenomic approach could be adapted to a clinical lab routine, provided that library preparation and run processing are optimized to deliver results as fast as conventional tests and present them in an easily interpretable form. Outsourcing of data processing is one option to solve this problem. Ranges of bioinformatics tools facilitate data analysis, e.g., Kraken, Kaiju, VirusFinder. Different workflows have recently been benchmarked and comprehensively overviewed elsewhere.

In-house analysis is another alternative, given that the algorithms for sequence processing include criteria that would return a reliable interpretation of assessed reads. Ideally, such algorithms would not only filter out artefacts and perform mapping to the reference genome, but also suggest the most represented pathogen in each sample, narrowing down the spectrum of considered pathogens. This would create a fast definitive NGS-based diagnostic tool.

Library preparation requires the use of total DNA/RNA from a sample, implying the presence of nucleic acids from other organisms. Direct shotgun sequencing (i.e., without contaminant nucleic acid removal) yields most reads from a host, whereas target reads constitute a mere fraction of total data, leading to a significant drop in the method's sensitivity. Researchers are forced to search for target sequences amongst a handful of "junk" reads, employing time and computer power to solve this "needle in a haystack" sort of challenge.

A few strategies can tackle this methodological nuisance. In principal, they are based on either one or both of the following approaches: (1) in silico separation of host sequences based on their similarity with the hosts reference genome; (2) conducting of taxonomic classification for the whole set of sequences using comprehensive databases which includes sequences of both target and host genome. The former option is very demanding in terms of computational power, since millions of reads undergo a complex analysis. The latter is subject to faults of its own: partial or low-quality reference sequences, false mapping and high genetic diversity resulting in numerous polymorphisms.

A pathogen's content in a sample is unknown a priori and can only be approximated, for example, based on severity of an onset. Consequently, it becomes difficult to evaluate a minimally required number of sequencing reads per sample (SRPS) sufficient for a reliable in-depth analysis. Underestimation of SRPS results in a low yield of target sequences, for example, 0.008% for Epstein-Barr virus (EBV), 0.0003% for Lassa virus and 0.3% for Zika virus. Considering that metagenomics focuses on relative quantities of nucleic acids rather than absolute counts, establishing appropriate thresholds for positive and negative results becomes challenging. In cases of cell cultures and tissues, calculating copies per cell (CPC) has been suggested; however, this parameter would not work for other types of samples, e.g., swabs and washes. High content of a host's nucleic acids might artificially lower CPC, falsely attributing negative results to samples. As we have mentioned, the sequencing depth is another aspect to this issue, because shallow sequencing provides little data with poor scalability, to the point where an extra round of sequencing is required to verify the results based on low SRPS data. Therefore, taking the aforementioned factors into account and designing a functional SRPS assessment tool poses an important objective for further research.

Furthermore, DNA/RNA samples might be subject to contamination, which creates artifacts. Thus, it is crucial to keep track of any possible contamination within the laboratory. It would also help to scan scientific magazines periodically for the discovery of new microorganisms in normal human microbiota to adjust the sequence data accordingly.

Metagenomics was initially suggested for expanding the knowledge on existing families, rather than discovering new ones, which naturally limits its application to the broad-range search. In practice, reads very often fail to map to reference genomes due to a lack of even minimally homologous sequences. This situation is anticipated in cases of organisms that have not been screened for viral infections before or whenever strains of known viruses in the sample possess unique sequences that have not been described yet. Thus, a better understanding of the virosphere requires not only expansive sets of samples, but also improvements to data processing algorithms. Paradoxically, the more we explore the biology of viruses, the more apparent it becomes that only a negligible part of their taxonomy is being studied and, to make matters worse, mainly that of closely related viruses, rather than the principally new ones. In this way, while rapidly gaining profoundness, our knowledge of viruses is progressively narrowing.

Standardization of the method is yet another issue. Because metagenomics is aimed at describing a multitude of microorganisms in a sample, and also due to biases and host-cell DNA and RNA abundance, robust references and normalization methods have to be developed. A possible solution has been proposed by Wilder et al., who suggested creating reference metagenomic libraries for sequencing with unknown samples. Indeed, this approach seems logical and reliable in cases when an abundance of a particular virus is expected. For example, comparing reference samples with a clinical sample, where an abundance of a particular viral species is observed, would indicate a possible on-going infection. Nevertheless, confidence intervals have to be set to ensure that the differences are significant, and studying more samples is required for an accurate evaluation, so further research is required. However, it is challenging to conceptualize standards when the search for new viruses is concerned, because there is no apparent candidate for a standard. In this case, limits on the sequencing quality can be imposed that would help differentiate between sequencing artifacts and the actual discovery of a new pathogen. Validation could include cultivation with supplementary PCR and further whole genome sequencing for new pathogens.

Finally, complex techniques of sample preparation significantly drop reproducibility owing to numerous steps and variations in protocols, resulting in higher error rates. This is a major stumbling block for clinical application of HTS, because there is not much room for error in the medical field, with people's lives being at stake.

In summary, the most notable deterrents for clinical metagenomics are: (1) the complexity and high costs of sample preparation, (2) the requirement for special bioinformatic education and skills, (3) the need for a powerful data processing infrastructure and (4) the possible inconsistency between results owing to uneven pathogen distribution in a body and/or quality of sampling. Nevertheless, the mentioned limitations can potentially be alleviated with advances in sequencing technologies. It is important to remember that nearly 40 years ago Sanger sequencing appeared cost-ineffective and overly complex for clinical application; yet, it is one of the widely celebrated diagnostic tools in healthcare.

In most NGS-based pathogen studies, clutter reads are a problem. Whenever clinical samples are used, the presence of a host's nucleic acids becomes an inevitability, requiring additional computational power for filtering them out and, consequently, adding to the gross sequencing cost. To make matters worse, target DNA/RNA usually constitutes only a small fraction of total DNA/RNA. However, target amplification combined with enrichment become a game-changer, allowing for both concentration of target templates and removal of unwanted sequences.