Recent research focusing on differential gene expression has moved away from microarrays to sequencing-based methods. The cost per base pair sequenced favors the use of the Illumina platform for such research. In particular, 100-bp paired end reads seem to be the platform du jour. But why? In particular, why not 100-bp single end reads? What does the extra read provide in the majority of research projects where the sole purpose of the project is generating differential expression data, and not an analysis of splice variants.
Paired end reads are particularly useful in understanding splicing and splice variants. For that reasons, they are essential for de novo assembly of transcriptome reads to generate a whole shotgun transcriptome or using reads to identify genes for eukaryotic gene annotation. In my own research, we frequently use paired reads to identify recent lateral gene transfers even in transcriptome projects, and I appreciate the plethora of paired end reads for my own research. But for organisms where gene models are known and projects where splice variants are not the focus, why are paired end reads favored?
One might argue, that it can’t hurt. The two measurements are not independent so we use FPKM to compensate, not RPKM. Two 100-bp paired end reads, for statistical purposes, counts the same as one 100-bp read. What you gain with paired reads is an increased likelihood that your read will map uniquely. But in the vast majority of organisms on earth, this doesn’t seem like it would happen very often. The argument is that two reads allows you to map better. True. If you can’t map read 1 alone, you might be able to map read 1 using read 2. But if you can map read 2, you’ve already counted that fragment with the FPKM calculation, so who cares? Paired reads only help when neither read 1 nor read 2 can be mapped uniquely, but can be mapped uniquely only in combination with the insert size distribution. But how often is that really the case? And is it worth the cost? Many experiments using paired end reads lack a suitable number of replicates. Wouldn’t the cost savings of single end reads be better put toward sequencing more replicates, particularly biological replicates? I suspect so.
There may be other factors at play. For instance, you need to have a whole flow cell of the same size and type of reads to realize this cost savings. Can you or your sequencing center fill a flow cell with enough projects with 100-bp or 50-bp single end reads? Probably not, unless more researchers are willing to test the waters.
Originally posted at: https://allthingsg.squarespace.com/blog/2014/7/8/why-not-single-read-differential-expression-analysis