Wolbachia Taxonomy

At the Wolbachia meeting, Matt Chung (Ph.D. candidate in the Dunning Hotopp group) presented on our analysis of using whole genome nucleotide alignments and sequence identity matrices to recapitulate classically defined bacterial taxa, as described in “Using Core Genome Alignments to Assign Bacterial Species” by Chung, Munro, and Dunning Hotopp (doi: https://doi.org/10.1101/328021). This manuscript includes an analysis of Wolbachia strains and their supergroup designations that largely supports the ANI-based and DDH-based analyses by Ramirez-Puebla et al. in 2015, although illustrates how nucleotide identity thresholds need to be coupled with a phylogenetic-based approach to ensure the accuracy of the taxonomic assignment. The whole genome alignment based method used is readily amenable to both a combined percentage identity and phylogenetic approach; we would recommend its use, but are open to alternatives. Given these results and a lengthy discussion it seemed like there was broad support for defining Wolbachia species through a working group approach.

Subsequent discussions have led us to proposing a 5-pronged approach:

  1. Formalize a draft method for assigning Wolbachia species as determined by Wolbachia researchers working in a small working group. This could be ANI, dDDH, phylogeny, and/or whole genome alignment. Regardless there should be a workflow and specific recommendations. It should clarify the criteria for creating a new name, but also criteria for when an existing name is applied to a strain and when Wolbachia sp. is used. The latter two of which would not necessarily require whole genome sequencing but could rely on MLST, keeping in mind that new MLST schemes may be warranted that are more species-specific.
  2. Take contributed genomes and apply the method to these genomes.
  3. Formalize a final method for assigning Wolbachia species by this small working group.
  4. Develop a manuscript to circulate the results of this small working group and identify community supported species names, possibly using BioRXiv to distribute the manuscript to facilitate community engagement. The recommendation that seemed to be supported at the meeting was to avoid the names of living scientists actively working in the Wolbachia field as was suggested previously. However, please see the survey questions below.
  5. Apply the names, submitting the publication. I would suggest using a consortium as an author for the publication and then listing individual authors as is frequently done in genomics papers. There could be recognition of the small working group members as well, through the author contributions section. I would suggest including as many people as possible as authors, in order to demonstrate the strong community support for the proposal/study/analyses.

If you would like to be either in the small working group or an author on the final paper, please e-mail WolbachiaTaxonomy@gmail.com. To expression your anonymous thoughts about Wolbachia taxonomy to inform decisions to be made, please respond to the survey at: https://www.surveymonkey.com/r/5FKVJ67.

Quick Look at the Two Manuscripts on Tardigrade LGT

Tardigrades.  Never heard of them?  Me neither.  But lateral/horizontal gene transfer, I’ve heard of.

In the interest of full disclosure, I did not review the Boothby paper in PNAS. But I was asked to provide a comment for one of the two articles authored by Ed Yong. However, as I’m just coming off maternity leave, I was unable to provide one in time due to a lack of bandwidth and sleep as well as a fear that I might be unfairly biased.  As I told a colleague on November 24, “I thought their methodology was flawed and couldn’t put together [even] a coherent sentence as a comment. I think there is transfer but I think they severely over estimate it. I could be biased though by the fact they claim this is the largest, but [I think] my Drosophila has more extensive transfer.”

Now having given it time and thought, I think I can summarize these two manuscripts objectively and coherently from the perspective of someone who has spent a great deal of time trying to assess the levels of lateral gene transfer in animal genomes. In the interest of making this more tractable, I am only going to focus on the claim of massive LGT, and not on any aspect related to functionality or other aspects of either manuscript. Since you can have massive LGT that is not functional, I am going to ignore all of the RNASeq and EST data, and conclusions derived from them.  They are essentially irrelevant for establishing the claim of massive LGT.

First the Boothby manuscript and why I suspected a flawed methodology. The red flag came with the statement, “our tardigrade cultures are fed algae, not bacteria, and although our algal cultures are not axenic, we would expect little to no bacterial contamination in our sequencing data.” If they are not axenic, you can expect a great deal of bacterial contamination, talk to any microbiologist.

However, and despite this clear bias, the authors did perform several analyses to try to address this. The first was an examination of coverage, but the details are lacking. It is difficult to ascertain much from the figures referenced (Figure S2 C and D) and the focus is on the SD (presumably standard deviation) of the coverage, not the mean. The second piece of evidence is about bacterial rRNA genes.  However, it is quite likely all of the bacterial rRNA sequences are collapsed into single contigs. This has plagued all bacterial genome sequence projects from the dawn of genome sequencing, except those with a single rRNA operon.  The multiple rRNA operons collapse into a single repeat that does not typically assemble with the rest of the genome. Therefore, this analysis is actually largely uninformative.  Third, it is mentioned that general contamination is minimal. For example, human sequences are not found. Dataset S1 is referenced, which appears to be annotation of contigs, and does not fully address human contamination.  Given the size of the human genome and the amount of data generated, contaminating human sequences are unlikely to be assembled into any sizable contig. The true extent of human contamination should be assessed on the read level to be more accurate. Besides, bacterial and human contamination is like comparing apples and oranges (heck, apples and oranges are at least both fruit).

The manuscript then moves to look at the phylogeny of these sequences, establishing they are truly of bacterial origin. There are over 380 pages of supplementary figures analyzing this. I’m sure it was a tremendous amount of work, one that is often requested by reviewers, but it actually is just one aspect of LGT. Its unfortunate more space wasn’t devoted to the other aspects. This is followed by an analysis of domains, that finds genes of foreign origin contribute many unique protein domains.  But this seems premature; are these LGTs actually in the genome?

To establish this the authors perform PCR to test physical linkage of gene pairs. They obtain PCRs for 104 of 107 randomly selected genes. However, it isn’t reported how these genes were randomly selected. It is hard to truly be random and I suspect these represent the best 107 contigs, not random ones. But Kudos to the authors for actually providing images of the gels for all 107 amplifications. Yet, only roughly half (58) support lateral gene transfer since they are the only ones that bridge between the metazoan and foreign gene. The remaining 46 amplify from foreign gene to foreign gene, which cannot exclude contamination.  Furthermore there is a lack of specificity in numerous amplification reactions with products of different sizes. Frequently, the one of the correct size doesn’t even appear to be the most abundant product. However, a remarkable number do show strong amplification at the correct size for the proportion I checked.  However, it is difficult to be confident of the correct size given the limited migration in the gels shown and the highlighted boxes that can distort perception. Combined, it could be easy to conclude that the region was amplified when it was not amplified specifically. The solution would have been to end sequence verify the PCR products to ensure the correct product was amplified. There is no indication this was undertaken.

PacBio sequencing was conducted to further support these LGTs. First, low coverage PacBio is not a great method for LGT validation since it has steps in library construction that makes it prone to chimeras. This is a known problem we have published on that is not yet widely appreciated. However, LGTs that are recovered in both the PacBio dataset and the Illumina dataset should be real as you wouldn’t expect such random events to occur repeatedly across two platforms. One figure is shown in the manuscript that is used to demonstrate the congruity.  Congruity is expected, whether or not these are real LGTS or not, since most of the sequence is from tardigrade. Furthermore the PacBio assembly is <60 Mbp compared to the >200 Mbp Illumina assembly.  This means that only about a quarter of the data in the Illumina assembly is found in the PacBio assembly.  Therefore, a lot of data is missing; quite possibly these LGTs. If that was examined more closely, I couldn’t find where it was presented. However, it does not seem to support the hypothesis.

I’m going to stop critiquing the Boothby manuscript at this point. I am curious to understand how Moleculo sequencing may or may not yield LGT-like sequences, as I make it a point to understand this for all the common sequencing platforms. I’m disappointed given the claim at the lack of text about LGT in animals and references to the literature, including my own work. I wish all the points I’ve raised here had been addressed in the review of this manuscript, but clearly some key points were overlooked for whatever reasons. Unfortunately, it happens. But I do feel the review system failed for these authors.

So that leaves the Koutsovoulos bioRxiv preprint, which has not been peer-reviewed and may have been put together quite rapidly. So Kudos to the authors! I’ll try to give these authors the benefit of the doubt.  For full disclosure though, that means trying to ignore the parenthetical jab at my own paper that really needs to be supported by either an argument or citation.

Largely, my concern here is genome cleansing. Genome cleansing by assembly experts, database curators, and scientists had led to the erroneous removal of LGT of genomes time and time again. It is clearly the largest cause of LGT under-estimation.  Here, the authors use blobplots to identify contigs with abnormal GC content and coverage. These plots are then used to remove “contaminant” sequences through an iterative assembly process. The problem arises because this assumes that LGTs should have a similar composition to the host genome and similar coverage.  This is true for LGTs that have been in the genome for large spans of time and are fixed in the population. However, in organisms that acquire LGTs frequently, you would expect to have LGTs of all ages, including those that have not acquired the compositional biases of the host DNA. You would also expect that they are not fixed in the population yielding abnormally low coverage distributions. (Although I’ll note that in neither paper could I grasp if the population sequenced was inbred). We’ve even demonstrated in both insects and nematodes that recent transfers from bacterial Wolbachia endosymbionts can be extensively duplicated, so you might even expect abnormally high coverage distributions.  Essentially, these criteria aren’t necessarily good at distinguishing contamination from LGT. In fact, even contigs with the same coverage and compositional biases may not be LGT. Therefore, the test employed also does not adequately test the hypothesis. Even setting that aside, a large number of Bacteroidetes sequences seem to have been removed through the process that had the same coverage and compositional bias, and this isn’t explained or transparent.

So, is there LGT in the tardigrade genome? Both papers suggest yes, it is the extent that is at question.  I suspect that Boothby et al. applied criteria that are too liberal while Koutsovoulos et al. applied criteria that are too conservative.  Reality may lie in the vast expanse between the two estimates. An analysis of the coverage of junction-spanning read pairs (JSPRs) may prove informative.  Chimeras should occur randomly in a standard Illumina or PacBio run.  Therefore, chimeras in the assembly will only be supported by a single pair of reads.  Breaking regions of contigs supported by a single pair of reads and eliminating resulting contigs that lack a Metazoan may yield a better estimate of the true extent of LGT.  Although it will still be an estimate.

So how can we know more definitively? Ultimately, the experiment needs to be designed from the beginning to test that, minimizing contamination and using a strategy that minimizes artefactual chimeras. I’m guessing neither group set out to examine LGT in the genome, so they ultimately didn’t devise an experiment where it can established well. PacBio sequencing to obtain a complete genome on a homozygous inbred line with validation of metazoan-bacterial junctions by amplification and subsequent end sequencing verification should answer the question. Making it a homozygous line reared as aseptically as possible would be even better, possibly using antibiotics for multiple generationsto remove bacterial contaminants.

Is that possible? I don’t know; what’s a tardigrade? (Actually, they seem fascinating and I’m glad to have learned more about them).

A few other points. I’ve seen criticism of the UNC authors (Koutsovoulos et al.) for not having their data already available at NCBI.  There can be many reasons for this, not least of which is that genomes containing LGT typically take a long time to make it through the NCBI submission process. One of the steps is a “contaminant” screen that in effect removes all LGTs, whether they are real or not, whether they are experimentally validated or not. This needs to be remedied.

The UNC authors were also criticized for not comparing their genome to the genome from the Blaxter group.  However, I think this was the right call. First, the Blaxter group should be given the first opportunity to publish any large genome comparisons. Second, the Blaxter group demonstrates the value in having them present the comparison because they understand how the genome was assembled. Too often scientists consider genomes to be static pieces of DNA that are unambiguous.  Often, this is far from the case

UPDATE (12/7/15): The UNC/Boothby data was posted online upon numerous requests http://weatherby.genetics.utah.edu/seq_transf/between Nov. 30 and Dec. 4. Also, the genes were picked using a random number generator as described in the supplementary information.

Why not single read differential expression analysis?

Recent research focusing on differential gene expression has moved away from microarrays to sequencing-based methods. The cost per base pair sequenced favors the use of the Illumina platform for such research. In particular, 100-bp paired end reads seem to be the platform du jour. But why? In particular, why not 100-bp single end reads? What does the extra read provide in the majority of research projects where the sole purpose of the project is generating differential expression data, and not an analysis of splice variants.

Paired end reads are particularly useful in understanding splicing and splice variants. For that reasons, they are essential for de novo assembly of transcriptome reads to generate a whole shotgun transcriptome or using reads to identify genes for eukaryotic gene annotation. In my own research, we frequently use paired reads to identify recent lateral gene transfers even in transcriptome projects, and I appreciate the plethora of paired end reads for my own research. But for organisms where gene models are known and projects where splice variants are not the focus, why are paired end reads  favored?

One might argue, that it can’t hurt. The two measurements are not independent so we use FPKM to compensate, not RPKM. Two 100-bp paired end reads, for statistical purposes, counts the same as one 100-bp read. What you gain with paired reads is an increased likelihood that your read will map uniquely. But in the vast majority of organisms on earth, this doesn’t seem like it would happen very often. The argument is that two reads allows you to map better.  True. If you can’t map read 1 alone, you might be able to map read 1 using read 2.  But if you can map read 2, you’ve already counted that fragment with the FPKM calculation, so who cares?  Paired reads only help when neither read 1 nor read 2 can be mapped uniquely, but can be mapped uniquely only in combination with the insert size distribution. But how often is that really the case?  And is it worth the cost?  Many experiments using paired end reads lack a suitable number of replicates.  Wouldn’t the cost savings of single end reads be better put toward sequencing more replicates, particularly biological replicates?  I suspect so.

There may be other factors at play. For instance, you need to have a whole flow cell of the same size and type of reads to realize this cost savings. Can you or your sequencing center fill a flow cell with enough projects with 100-bp or 50-bp single end reads? Probably not, unless more researchers are willing to test the waters.

Originally posted at: https://allthingsg.squarespace.com/blog/2014/7/8/why-not-single-read-differential-expression-analysis

Secondary Data Usage

I know over the years many researchers have commented both positively and negatively on the NIH and NIAID’s data sharing policy.  They are set to change it again such that everyone who gets NIH funding will need to comply with the policy. Comments can be made on the new policy for at least another week.  So now is the time to speak up! Although, do it today as the comment period has almost expired. The new policy can be found at:http://www.gpo.gov/fdsys/pkg/FR-2013-09-20/pdf/2013-22941.pdf.

In summary, all projects are not expected to release raw data like instrument image data. However within 6 months of data generation they are expected to release the initial sequence reads, data after alignment and QC (e.g. bam files), and analyses like expression profiling and variant calling. If a manuscript is published within those 6 months, the data needs to be release upon acceptance of the publication. In addition all analyses relating genomic data to phenotype of other biological states must be released upon publication. It reads as if there are no embargo dates.

Having spent a significant amount of time on both sides of secondary data usage–being both a generator and a heavy user, I provided input on three issues in the comments to the policy change:

1.       I think data generators need a protected time from when data is released to when they can publish—something akin to an embargo date. Such an embargo date should be made clear to secondary users at the time data is acquired and should not change. An embargo date is important because I don’t think it will be long before groups automatically download new data, write a low quality paper, and publish it in a lower quality journal, making it impossible to have the time to do the validation and follow-up studies need to send it to a high quality journal. Allowing data generators time to publish their research and ideas in a high quality journal is essential to the future of this type of science.  I know others disagree with this and believe that everything should be open access. But today sequencing is cheaper than preparing the samples, so it is no longer a precious commodity only available to a limited few. One solution would be for the data generators to outline their particular focus.  While that focus area would be off limits, other research would be fair game to conduct an publish without embargo. Of course, reviewers and editors would need to do their diligence to enforce such a system and the scope allowed would need to be limited.

2.       Currently, raw human data is not required to be deposited; it is exempted and only alignments following cleaning are required. However, cleaning is not defined.  Since this could remove microbial reads, this is a problem for my research.  I think that if raw data is not provided, it needs to be stipulated that alignments need to include ALL reads.  In addition, I think that if microbial users provide alignments with ALL reads, they should not have to deposit FASTQ files either.

3.       There needs to be a system for retracting data and notifying users—there currently is not.  For instance, one data generator I rely on for data retracted multiple pieces of data because the metadata said the sequence data was from a man and it was clearly from a woman. They also retracted data because three samples were sequenced that should be different, but were genetically identical. This is good and important–ensuring high quality secondary data is available. Yet they did not notify users who had already downloaded the data that it was retracted. This is a major problem for secondary users. The short time frames required for deposition can make it difficult to identify all the problems, making the data less useful to secondary users.

Just some thoughts, and the hopes that more of you will comment.  It is nice to be given an opportunity to improve the system. The best system will arise from the consideration of a variety of thoughts and opinions put forth in the comments.

Originally posted at: https://allthingsg.squarespace.com/blog/2013/11/14/secondary-data-use