Why not single read differential expression analysis?

Recent research focusing on differential gene expression has moved away from microarrays to sequencing-based methods. The cost per base pair sequenced favors the use of the Illumina platform for such research. In particular, 100-bp paired end reads seem to be the platform du jour. But why? In particular, why not 100-bp single end reads? What does the extra read provide in the majority of research projects where the sole purpose of the project is generating differential expression data, and not an analysis of splice variants.

Paired end reads are particularly useful in understanding splicing and splice variants. For that reasons, they are essential for de novo assembly of transcriptome reads to generate a whole shotgun transcriptome or using reads to identify genes for eukaryotic gene annotation. In my own research, we frequently use paired reads to identify recent lateral gene transfers even in transcriptome projects, and I appreciate the plethora of paired end reads for my own research. But for organisms where gene models are known and projects where splice variants are not the focus, why are paired end reads  favored?

One might argue, that it can’t hurt. The two measurements are not independent so we use FPKM to compensate, not RPKM. Two 100-bp paired end reads, for statistical purposes, counts the same as one 100-bp read. What you gain with paired reads is an increased likelihood that your read will map uniquely. But in the vast majority of organisms on earth, this doesn’t seem like it would happen very often. The argument is that two reads allows you to map better.  True. If you can’t map read 1 alone, you might be able to map read 1 using read 2.  But if you can map read 2, you’ve already counted that fragment with the FPKM calculation, so who cares?  Paired reads only help when neither read 1 nor read 2 can be mapped uniquely, but can be mapped uniquely only in combination with the insert size distribution. But how often is that really the case?  And is it worth the cost?  Many experiments using paired end reads lack a suitable number of replicates.  Wouldn’t the cost savings of single end reads be better put toward sequencing more replicates, particularly biological replicates?  I suspect so.

There may be other factors at play. For instance, you need to have a whole flow cell of the same size and type of reads to realize this cost savings. Can you or your sequencing center fill a flow cell with enough projects with 100-bp or 50-bp single end reads? Probably not, unless more researchers are willing to test the waters.

Originally posted at: https://allthingsg.squarespace.com/blog/2014/7/8/why-not-single-read-differential-expression-analysis

Secondary Data Usage

I know over the years many researchers have commented both positively and negatively on the NIH and NIAID’s data sharing policy.  They are set to change it again such that everyone who gets NIH funding will need to comply with the policy. Comments can be made on the new policy for at least another week.  So now is the time to speak up! Although, do it today as the comment period has almost expired. The new policy can be found at:http://www.gpo.gov/fdsys/pkg/FR-2013-09-20/pdf/2013-22941.pdf.

In summary, all projects are not expected to release raw data like instrument image data. However within 6 months of data generation they are expected to release the initial sequence reads, data after alignment and QC (e.g. bam files), and analyses like expression profiling and variant calling. If a manuscript is published within those 6 months, the data needs to be release upon acceptance of the publication. In addition all analyses relating genomic data to phenotype of other biological states must be released upon publication. It reads as if there are no embargo dates.

Having spent a significant amount of time on both sides of secondary data usage–being both a generator and a heavy user, I provided input on three issues in the comments to the policy change:

1.       I think data generators need a protected time from when data is released to when they can publish—something akin to an embargo date. Such an embargo date should be made clear to secondary users at the time data is acquired and should not change. An embargo date is important because I don’t think it will be long before groups automatically download new data, write a low quality paper, and publish it in a lower quality journal, making it impossible to have the time to do the validation and follow-up studies need to send it to a high quality journal. Allowing data generators time to publish their research and ideas in a high quality journal is essential to the future of this type of science.  I know others disagree with this and believe that everything should be open access. But today sequencing is cheaper than preparing the samples, so it is no longer a precious commodity only available to a limited few. One solution would be for the data generators to outline their particular focus.  While that focus area would be off limits, other research would be fair game to conduct an publish without embargo. Of course, reviewers and editors would need to do their diligence to enforce such a system and the scope allowed would need to be limited.

2.       Currently, raw human data is not required to be deposited; it is exempted and only alignments following cleaning are required. However, cleaning is not defined.  Since this could remove microbial reads, this is a problem for my research.  I think that if raw data is not provided, it needs to be stipulated that alignments need to include ALL reads.  In addition, I think that if microbial users provide alignments with ALL reads, they should not have to deposit FASTQ files either.

3.       There needs to be a system for retracting data and notifying users—there currently is not.  For instance, one data generator I rely on for data retracted multiple pieces of data because the metadata said the sequence data was from a man and it was clearly from a woman. They also retracted data because three samples were sequenced that should be different, but were genetically identical. This is good and important–ensuring high quality secondary data is available. Yet they did not notify users who had already downloaded the data that it was retracted. This is a major problem for secondary users. The short time frames required for deposition can make it difficult to identify all the problems, making the data less useful to secondary users.

Just some thoughts, and the hopes that more of you will comment.  It is nice to be given an opportunity to improve the system. The best system will arise from the consideration of a variety of thoughts and opinions put forth in the comments.

Originally posted at: https://allthingsg.squarespace.com/blog/2013/11/14/secondary-data-use

Writing Checklist

Check out the writing checklist that I made for students, postdoctoral fellows, and staff in my group.


Welcome to my new web page!  Stay tuned for update and news from the Dunning Hotopp group.