When one of our customers requests NGS sequencing services from Macrogen Europe, one of the first questions we try to get answered is the required sequencing depth. This is commonly referred to as the required “read depth” or “coverage” of a specific target. In other words, how many sequencing reads or giga base pairs (Gbp) does a customer require for their specific application? This is a question that depends on multiple factors and on which we happily provide a free consultation.
For applications such as RNA sequencing (WTS), we commonly work with “read depth” or the overall number of reads in millions, and mappable coverage is less relevant. When the service concerns genomic reads such as whole genome sequencing (WGS), whole exome sequencing (WES), panel sequencing, or epigenome sequencing such as whole genome bisulfite sequencing (WGBS), additional complexity is added to this question, by having both the “raw coverage or read depth”, and the “mapped coverage or read depth”.
For example, when re-sequencing a germline sample for human whole genome sequencing (hWGS) using short reads (Illumina), the most commonly requested coverage is 30X. This refers to the number of times a nucleotide is read during sequencing. This means that each base in the genome is covered at around 30 times on average at each sequenceable base position
The haploid human genome is around 3 giga base pairs in size and can be sequenced with short or long reads (Illumina, PacBio respectively), resulting in a complete reading of the full genome at varying levels, depending on overall sequencing depth (coverage)*. Doing simple math (3x30), 30X average coverage results in approximately 90Gbp of data. However, this results in only a theoretical average coverage of 30X per base. This is referred to as “Raw” 30X coverage. This does not take into account the efficiency of the genome alignment and quality filtering processes.
For re-sequencing experiments, for which a reference genome is available, data is mapped/aligned to the reference genome, as is the case for hWGS. This allows for the identification of variants in a sample compared to a reference genome. The currently most used human reference genome is the Genome Reference Consortium Human Build 38 (HG38), but the reference may differ per application.
If part of the raw sequencing reads is discarded or lost during the alignment process, the post-alignment “mapped read depth” will be lower than the pre-alignment “raw read depth”.
This means you will not reach the 30X “mapped” coverage with 90Gbp of data. Discarded reads could be due to duplication (i.e. PCR cycles), non-mappable reads, contamination, sequencing base quality (Phred score), and other factors.
After mapping, several statistics are used to assess the quality of the mapped data. In most data reports, histograms are used to show the coverage range and uniformity of sequencing depth.
By displaying how many bases in the reference genome are covered by a certain amount of reads in a dataset, you can visualize the distribution of coverage and can assess the mean or the average depth in the mapped dataset. Most customers care for this “mean mapped sequencing depth” or “mapped mean depth”.
In other words, mapped read depth refers to the total number of bases sequenced AND aligned at a given reference base position.
In a sequencing coverage histogram, the read depths are displayed on the x-axis, while the total number of reference bases that are covered by that read depth is displayed on the left-hand side y-axis. If sequencing quality is good, the plot will take the form of a normal shaped Poisson distribution with as small as possible standard deviation, as seen in the sample histogram image taken from a basic Macrogen hWGS analysis report below. Actual distribution varies based on species, application, sample source, sequencing depth, and other parameters and may not always follow a clean bell-shaped Poisson distribution.
In order to guarantee 30X mean mapped data in hWGS for germline samples we generally recommend a minimum of 110Gb of data.
However, various alternative applications or sample sources require very different metrics, which depend on multiple factors.
For example, FFPE or Saliva derived DNA sources require considerably deeper sequencing in order to reach the same mean mapped results. This has to do with DNA quality (FFPE) or contamination of other DNA sources (Saliva) which affect the overall mapped reads, often requiring >120Gb data, to reach the desired mean mapped depth of 30X for hWGS.
Another example is whole exome or targeted sequencing. The actual required mean depth depends on the capture efficiency of the probes, the target size of the probes, the off-target effect, input quantity/quality, and PCR cycles, among other factors. In general, a standard rule of thumb dictates Mapped data is 50% of raw data for targeted approaches. For example,100X raw coverage = 50X mapped coverage.
An even more challenging example is whole-genome bisulfite sequencing to assess genome-wide methylation. We generally recommend making 2 separate libraries and sequencing at a depth of >180Gb to reach 30X mean mapped coverage. This is due to the unique process of the DNA treatment and library preparation.
Depending on the application and source material different advice will be given. The actual required mean mapped coverage depends on the client’s needs and questions and is mostly unique per project.
For example, for certain somatic mutation analyses in cancer, a mean mapped coverage as high as 1000X may be recommended for adequate variant allele frequency detection, whilst copy number variations may be picked up with shallow sequencing as low as 0.1X.
No matter the need or variable, for example; application, kit choice, re-sequencing or de Novo assembly, long or short reads, quality and quantity, species, ploidy and source; our consultants are highly trained to help you decide on the best and most cost-effective coverage for your specific needs.
Reference: Illumina - Sequencing Coverage for NGS