Quick Look at the Two Manuscripts on Tardigrade LGT

Tardigrades.  Never heard of them?  Me neither.  But lateral/horizontal gene transfer, I’ve heard of.

In the interest of full disclosure, I did not review the Boothby paper in PNAS. But I was asked to provide a comment for one of the two articles authored by Ed Yong. However, as I’m just coming off maternity leave, I was unable to provide one in time due to a lack of bandwidth and sleep as well as a fear that I might be unfairly biased.  As I told a colleague on November 24, “I thought their methodology was flawed and couldn’t put together [even] a coherent sentence as a comment. I think there is transfer but I think they severely over estimate it. I could be biased though by the fact they claim this is the largest, but [I think] my Drosophila has more extensive transfer.”

Now having given it time and thought, I think I can summarize these two manuscripts objectively and coherently from the perspective of someone who has spent a great deal of time trying to assess the levels of lateral gene transfer in animal genomes. In the interest of making this more tractable, I am only going to focus on the claim of massive LGT, and not on any aspect related to functionality or other aspects of either manuscript. Since you can have massive LGT that is not functional, I am going to ignore all of the RNASeq and EST data, and conclusions derived from them.  They are essentially irrelevant for establishing the claim of massive LGT.

First the Boothby manuscript and why I suspected a flawed methodology. The red flag came with the statement, “our tardigrade cultures are fed algae, not bacteria, and although our algal cultures are not axenic, we would expect little to no bacterial contamination in our sequencing data.” If they are not axenic, you can expect a great deal of bacterial contamination, talk to any microbiologist.

However, and despite this clear bias, the authors did perform several analyses to try to address this. The first was an examination of coverage, but the details are lacking. It is difficult to ascertain much from the figures referenced (Figure S2 C and D) and the focus is on the SD (presumably standard deviation) of the coverage, not the mean. The second piece of evidence is about bacterial rRNA genes.  However, it is quite likely all of the bacterial rRNA sequences are collapsed into single contigs. This has plagued all bacterial genome sequence projects from the dawn of genome sequencing, except those with a single rRNA operon.  The multiple rRNA operons collapse into a single repeat that does not typically assemble with the rest of the genome. Therefore, this analysis is actually largely uninformative.  Third, it is mentioned that general contamination is minimal. For example, human sequences are not found. Dataset S1 is referenced, which appears to be annotation of contigs, and does not fully address human contamination.  Given the size of the human genome and the amount of data generated, contaminating human sequences are unlikely to be assembled into any sizable contig. The true extent of human contamination should be assessed on the read level to be more accurate. Besides, bacterial and human contamination is like comparing apples and oranges (heck, apples and oranges are at least both fruit).

The manuscript then moves to look at the phylogeny of these sequences, establishing they are truly of bacterial origin. There are over 380 pages of supplementary figures analyzing this. I’m sure it was a tremendous amount of work, one that is often requested by reviewers, but it actually is just one aspect of LGT. Its unfortunate more space wasn’t devoted to the other aspects. This is followed by an analysis of domains, that finds genes of foreign origin contribute many unique protein domains.  But this seems premature; are these LGTs actually in the genome?

To establish this the authors perform PCR to test physical linkage of gene pairs. They obtain PCRs for 104 of 107 randomly selected genes. However, it isn’t reported how these genes were randomly selected. It is hard to truly be random and I suspect these represent the best 107 contigs, not random ones. But Kudos to the authors for actually providing images of the gels for all 107 amplifications. Yet, only roughly half (58) support lateral gene transfer since they are the only ones that bridge between the metazoan and foreign gene. The remaining 46 amplify from foreign gene to foreign gene, which cannot exclude contamination.  Furthermore there is a lack of specificity in numerous amplification reactions with products of different sizes. Frequently, the one of the correct size doesn’t even appear to be the most abundant product. However, a remarkable number do show strong amplification at the correct size for the proportion I checked.  However, it is difficult to be confident of the correct size given the limited migration in the gels shown and the highlighted boxes that can distort perception. Combined, it could be easy to conclude that the region was amplified when it was not amplified specifically. The solution would have been to end sequence verify the PCR products to ensure the correct product was amplified. There is no indication this was undertaken.

PacBio sequencing was conducted to further support these LGTs. First, low coverage PacBio is not a great method for LGT validation since it has steps in library construction that makes it prone to chimeras. This is a known problem we have published on that is not yet widely appreciated. However, LGTs that are recovered in both the PacBio dataset and the Illumina dataset should be real as you wouldn’t expect such random events to occur repeatedly across two platforms. One figure is shown in the manuscript that is used to demonstrate the congruity.  Congruity is expected, whether or not these are real LGTS or not, since most of the sequence is from tardigrade. Furthermore the PacBio assembly is <60 Mbp compared to the >200 Mbp Illumina assembly.  This means that only about a quarter of the data in the Illumina assembly is found in the PacBio assembly.  Therefore, a lot of data is missing; quite possibly these LGTs. If that was examined more closely, I couldn’t find where it was presented. However, it does not seem to support the hypothesis.

I’m going to stop critiquing the Boothby manuscript at this point. I am curious to understand how Moleculo sequencing may or may not yield LGT-like sequences, as I make it a point to understand this for all the common sequencing platforms. I’m disappointed given the claim at the lack of text about LGT in animals and references to the literature, including my own work. I wish all the points I’ve raised here had been addressed in the review of this manuscript, but clearly some key points were overlooked for whatever reasons. Unfortunately, it happens. But I do feel the review system failed for these authors.

So that leaves the Koutsovoulos bioRxiv preprint, which has not been peer-reviewed and may have been put together quite rapidly. So Kudos to the authors! I’ll try to give these authors the benefit of the doubt.  For full disclosure though, that means trying to ignore the parenthetical jab at my own paper that really needs to be supported by either an argument or citation.

Largely, my concern here is genome cleansing. Genome cleansing by assembly experts, database curators, and scientists had led to the erroneous removal of LGT of genomes time and time again. It is clearly the largest cause of LGT under-estimation.  Here, the authors use blobplots to identify contigs with abnormal GC content and coverage. These plots are then used to remove “contaminant” sequences through an iterative assembly process. The problem arises because this assumes that LGTs should have a similar composition to the host genome and similar coverage.  This is true for LGTs that have been in the genome for large spans of time and are fixed in the population. However, in organisms that acquire LGTs frequently, you would expect to have LGTs of all ages, including those that have not acquired the compositional biases of the host DNA. You would also expect that they are not fixed in the population yielding abnormally low coverage distributions. (Although I’ll note that in neither paper could I grasp if the population sequenced was inbred). We’ve even demonstrated in both insects and nematodes that recent transfers from bacterial Wolbachia endosymbionts can be extensively duplicated, so you might even expect abnormally high coverage distributions.  Essentially, these criteria aren’t necessarily good at distinguishing contamination from LGT. In fact, even contigs with the same coverage and compositional biases may not be LGT. Therefore, the test employed also does not adequately test the hypothesis. Even setting that aside, a large number of Bacteroidetes sequences seem to have been removed through the process that had the same coverage and compositional bias, and this isn’t explained or transparent.

So, is there LGT in the tardigrade genome? Both papers suggest yes, it is the extent that is at question.  I suspect that Boothby et al. applied criteria that are too liberal while Koutsovoulos et al. applied criteria that are too conservative.  Reality may lie in the vast expanse between the two estimates. An analysis of the coverage of junction-spanning read pairs (JSPRs) may prove informative.  Chimeras should occur randomly in a standard Illumina or PacBio run.  Therefore, chimeras in the assembly will only be supported by a single pair of reads.  Breaking regions of contigs supported by a single pair of reads and eliminating resulting contigs that lack a Metazoan may yield a better estimate of the true extent of LGT.  Although it will still be an estimate.

So how can we know more definitively? Ultimately, the experiment needs to be designed from the beginning to test that, minimizing contamination and using a strategy that minimizes artefactual chimeras. I’m guessing neither group set out to examine LGT in the genome, so they ultimately didn’t devise an experiment where it can established well. PacBio sequencing to obtain a complete genome on a homozygous inbred line with validation of metazoan-bacterial junctions by amplification and subsequent end sequencing verification should answer the question. Making it a homozygous line reared as aseptically as possible would be even better, possibly using antibiotics for multiple generationsto remove bacterial contaminants.

Is that possible? I don’t know; what’s a tardigrade? (Actually, they seem fascinating and I’m glad to have learned more about them).

A few other points. I’ve seen criticism of the UNC authors (Koutsovoulos et al.) for not having their data already available at NCBI.  There can be many reasons for this, not least of which is that genomes containing LGT typically take a long time to make it through the NCBI submission process. One of the steps is a “contaminant” screen that in effect removes all LGTs, whether they are real or not, whether they are experimentally validated or not. This needs to be remedied.

The UNC authors were also criticized for not comparing their genome to the genome from the Blaxter group.  However, I think this was the right call. First, the Blaxter group should be given the first opportunity to publish any large genome comparisons. Second, the Blaxter group demonstrates the value in having them present the comparison because they understand how the genome was assembled. Too often scientists consider genomes to be static pieces of DNA that are unambiguous.  Often, this is far from the case

UPDATE (12/7/15): The UNC/Boothby data was posted online upon numerous requests http://weatherby.genetics.utah.edu/seq_transf/between Nov. 30 and Dec. 4. Also, the genes were picked using a random number generator as described in the supplementary information.

21 responses to “Quick Look at the Two Manuscripts on Tardigrade LGT

  1. Michael Nitabach

    Chimeras should occur randomly in a standard Illumina or PacBio run. Therefore, chimeras in the assembly will only be supported by a single pair of reads.

    Is there a PCR step after adapter ligation in the preparation of genomic Illumina libraries? If so, then there could be multiple reads spanning chimeric contigs. Maybe the best way to exclude technical chimeras is to prepare multiple independent libraries, and sequence them separately. Then only assemble contigs based on read coverage in both libraries. Random technical chimeras will thus be excluded, even if they are amplified prior to sequencing in one of the libraries.

  2. Yes, there is. But there are tools that can identify these. The forward and reverse reads both start in the same place, which should technically be a rare event. So unless you have an absurd amount of coverage they can be removed and seen as a single read. However, I don’t know of many people who do this, particularly for assemblies. We do in our LGT/HGT work using prinseq or Picard, depending on the data sets available. Of note, depending on the parameters used, read pairs with a sequencing error may sneak through.

  3. “First, low coverage PacBio is not a great method for LGT validation since it has steps in library construction that makes it prone to chimeras. This is a known problem we have published on that is not yet widely appreciated”

    Could you say a bit more about this potential problem with Pacbio libraries and perhaps link to the papers you mention?

    • Of course. Here is a link to the paper (open-access): http://www.biomedcentral.com/1471-2164/15/788. Essentially since PacBio sequencing requires the ligation of adapters, chimeras can form. The PacBio library construction (at least the last time I checked) merely relies on the stochiometric overabundance of the adapters and diluteness of the DNA to try to limit their formation. Each should be unique, so given enough coverage they should be removed from the assembly. But in low-coverage assemblies they are a problem, albeit one that can be improved with the use of Quiver.

  4. Hi Julie… thanks for the blog. Apologies if you felt there was a “perenthetical jab” – this was not the intention (as you note, the MS was written very quickly).
    I was merely repeating the case made previously that the Dirofilaria immitis HGT figured in your Science paper (http://www.ncbi.nlm.nih.gov/pubmed/17761848 Fig 2C) is unlikely to be real. This HGT, where Wolbachia fragments form the introns and only the introns in an antigen gene (accession D23689.1 http://www.ncbi.nlm.nih.gov/nuccore/475860), certainly would be interesting – but complete replacement of introns (and only introns) by Wolbachia HGT needs some special mechanism.
    We saw this fragment when doing our original Wolbachia-filarial HGT analyses on Onchocerca ochengi/volvulus (http://www.ncbi.nlm.nih.gov/pubmed/17040125), but rejected it because it was so far beyond any likely plausible mechanism.
    We didnt see this construct in the genome when we sequenced D. immitis (http://www.ncbi.nlm.nih.gov/pubmed/22889830), but did see a perfectly normal version of the gene, with the same exons as D23689.1 but without Wolbachia neatly replacing each intron. In addition the last part of the last “intron” in the D23689.1 sequence is neither Wolbachia nor D. immitis but cloning vector. Bases 2971 to 3190 match a series of plasmids, and is ~98% ID to a region that I think is a commonly-used replication origin. The Wolbachia fragments derive from (and are ~100% identical to) “live” wDi sequences (in your MS you could only compare to wBm from B. malayi). The “normal” version of the locus was present in both D. immitis we screened – from Northern Italy and Southern USA.
    It is obviously impossible to say for sure, but I think it is likely it is a lab or computer artefact.

    • Fair enough. But I think it is important not to be so emphatic such that it becomes a fact. When I present it, I always point out that it has never been found again, and that is very unusual. But so was the integration of an entire Wolbachia genome when we first reported it, and it is now a repeated observation. The fact that we can’t explain the D. immitis gene with our current knowledge, doesn’t mean it isn’t real. Of course, it may not be real. But I like to think of it as an open question, although one I hardly consider anymore. So many other great instances to discuss that the D. immitis one can be relegated to history. In fact, it took me a good five minutes to even figure out what the reference was referring to. Unfortunately, we’ve contacted the authors several times asking for materials or the clone, but have never received a response. But as you said, you put it together quite quickly.

  5. Really interesting and informative! Perhaps you should publish THIS on BioArxiv!

  6. Hi Julie,

    I wanted to ask for some clarification as I am not an LGT/HGT person.

    You mention you aren’t concerned with issues of function or expression, but given the claims of the Boothby et al paper this would seem absolutely central to assessing their claims. How could HGTs play a formative evolutionary roles (enabling prolonged a sexuality, underpinning stress resistance) if they aren’t functioning? Is it thought that just extra DNA influences genome regulation globally? I hope you can clarify. I have been scampering through the literature but have found discussion of this yet.

    Kindest regards

    Aziz

    • Its not that I’m not concerned. I just wanted to focus on just one aspect; the one that was generating the controversy. It was long and time consuming to do just that and at least try to do it fairly.

      We already know there are large massive transfers that are unlikely to be fully functional. Mark Blaxter’s group has published a few, so has mine and so has Fukatsu’s group, among others. For instance, in Drosophila ananassae we know that multiple bacterial genomes have been integrated. If there is function, it isn’t likely to involve the entire genome. We honestly have no idea why it happened or if it does anything. But now 20% of one chromosome, and 2% of the flies genetic material is of bacterial origin (http://www.biomedcentral.com/1471-2164/15/1097). So could that have happened in the tardigrade? Maybe. Maybe their reproductive style prevents it. I honestly know next to nothing about tardigrades.

      The other question though is do you need to produce a protein to be functional? There is an idea that has been proposed in the related numt literature that numts are healing double-stranded breaks. Could that happen in tardigrades where double stranded breaks seem relevant? The transfer would then have a function, at least initially, even if the gene didn’t produce a protein.

  7. Great review, and hits many interesting points about HGT that I only suspected might be an issue :).

    One clarification regarding my own comments about the data: IMO, the raw data should have been available at the time of publication. Period. Frankly, the reviewers should have insisted on seeing it, if only to verify that it was there.

    If NCBI submission was delaying things, then there are other sites – including the authors’ own Web sites, or figshare, or Amazon S3 – where raw sequencing data can be deposited.

    There’s simply no excuse for not making raw data available for genomic papers, especially when big claims are being made from genome assemblies. We all should know by now that sequencing and assemblers are error prone and challenging and that the raw data needs to be included for validation and comparison of assembly techniques, methodologies, and results.

    –titus

    • I didn’t need raw data to review these two papers. I trust that both sets of authors know what to do with it; just like I trust a biochemist to be able to do a binding assay without seeing their raw data. That isn’t why the sequencing repositories exist. The repositories exist for secondary data usage to test OTHER hypotheses or generate OTHER results. So yes, the data needs to get there, but it really didn’t matter that it wasn’t there. I just think focusing on whether the data is in the repositories is a huge distraction from other more pertinent issues.

      • Well, now I’m just confused… The PNAS guidelines say:

        “”
        To allow others to replicate and build on work published in PNAS, authors must make materials, data, and associated protocols available to readers.
        “””

        So replication is one of the primary goals of making data available.

  8. Interesting post. However I’m not sure about the following statement ” I suspect that Boothby et al. applied criteria that are too liberal while Koutsovoulos et al. applied criteria that are too conservative.”, and I think your scope isn’t wide enough.
    If you pick the ten longest contigs from Boothby et al. (http://weatherby.genetics.utah.edu/seq_transf/tg.genome.fsa.gz) and blast the Open Reading Frames larger than 500nt, 99% to 100% of the best hits come from bacteria. In a 16kb N50 assembly, having a 1 Megabase contig is suspicious enough (your typical eukaryotic genome being full of repeated regions breaking the assembly contiguity, why would there be a 1Mb region repeat free ?). If your “anomaly” contig only yields bacterial hits, there is no way it can originate from horizontal gene transfer ! The only conclusion is that there is a massive contamination from bacteria in this assembly (more so if you get twice the expected genome size). The first thing to do is expunge the assembly from this contamination, and then go into more details about true HGT (where you have much experience it seems). But here the problem is way before this fine analysis : Boothby et al. checked that there was no chimera, but it’s simpler than that ! On one hand you have tardigrade DNA (in small fragmented contigs totalling 110+ Mb) and on the other hand bacterial DNA (the remainder, in larger contigs with only bacterial genes). The debate should be on the few genes coming from bacteria that are colinear to eukaryotic genes, on contigs unequivocally originating from tardigrade DNA (and validated through RNAseq expression). It’s quite puzzling that nobody (authors, reviewers, pundits ?) seems to be doing the “quick look” of just blasting a few ORFs and conclude that if it looks like bacteria, smells like bacteria, tastes like bacteria, and behaves like bacteria, well, maybe it’s bacteria after all ?

    • We agree with Adrien. I had made the github gist below a few days ago, and posted on twitter, but this seems to be a better place for it:

      https://gist.github.com/sujaikumar/4ddd79eb53c6c4c79528

      Summary:

      “For the first 15 longest scaffolds, all the predicted proteins, in order, hit Bacteria according to UNC’s own GFF (which was eventually uploaded to http://weatherby.genetics.utah.edu/seq_transf/ on Friday Dec 4).

      A handful of random hits are to eukaryotic organisms like Danio/Arabidopsis/Homo/Gallus/Bos… but if you’ve ever blast-ed against NR, you know how those hits turn up in any search.

      This is not intended to be comprehensive. It is just a quick extraction of UNC’s own best hits that show quite strongly that the longest scaffolds in their assembly are ALL bacterial ALL the way through.”

      • Adrien – thank you for describing the test that we had also started to do as soon as we got access to the UNC gene model GFF on Dec 4 – i.e. testing pairs of eukaryote-like and noneukaryote-like genes adjacent to each other on the same scaffold.

        Note: This does not take into account our additional RNAseq data or our genomic (lack of) coverage data as described in biorxiv.org/content/early/2015/12/01/033464 . For now, I just wanted to see how many eukaryote-noneukaryote adjacent pairs from UNC’s own list of genes could be verified using their own PacBio data.

        Because I did not have access to the UNC team’s assignment of genes as Metazoan/Bacterial/Fungal etc, I had to redo the taxonomic assignment of all genes so it is possible that my final number will be slightly different from theirs. Rather than pick gene pairs at random, I selected *all* instances where a eukaryote-like gene was next to a non-eukaryote like gene on the same scaffold.

        Only 294 gene pairs matched these requirements (euk-noneuk adjacent on same scaffold), and of those, only 10 were fully spanned by a PacBio scaffold.

        Repeating this for Metazoa-Nonmetazoa pairs, 713 such pairs remain (some of the euk-euk pairs become meta-nonmeta, and so the number of meta-nonmeta pairs went up). Of those, only 26 were verified by PacBio scaffolds.

        Details of scripts/commands are at https://github.com/sujaikumar/tardigrade (I hope I’ve not made any mistakes as it is 5am! If someone could double check my numbers I’d be very grateful).

        In summary, although UNC claim that the PacBio data verify their assembly, the reality (assuming my analysis is correct) is that only 10 euk-noneuk (or 26 metazoa-nonmetazoa) gene pairs are actually verified by their own data.

        As Julie points out in her excellent blog post analysis above, the PacBio assembly was very short (it summed to 120 Mb though, not 60 Mb as stated above). Therefore it is possible that it missed true HGT events.

        This is where Adrien’s “first thing to do” point comes in. Before even doing the HGT checks, we should be throwing out the scaffolds that only hit bacterial sequences in the public databases, and have such low genomic coverage that they could not possibly be from the nuclear genome of the tardigrade (unless one argues that only 1 bacterial cell/fragment was incorporated as an HGT for every 100 tardigrade cells). That analyses is ongoing and we are writing it up for formal peer review (although as someone on twitter pointed out – our biorxiv preprint will end up receiving much greater scrutiny anyway)

      • Good catch, I didn’t see the .gff when I did my quick “checking before talking” after reading the biorxiv preprint. To be honest, the evidence is pretty damning, you can even find a 16S rRNA gene from an “Armatimonadetes” in one of these megabase contigs…
        With the first 15 contigs being (without a doubt) bacteria, we remove 10 Mb from the assembly. Assuming 500 genes / Mb, approximately 5000 genes out of the original 38000 can be safely cut out of the tardigrade genome annotation. From there I don’t really know what’s left from the original point of the paper…

  9. Two updates: The UNC/Boothby data was posted online upon numerous requests http://weatherby.genetics.utah.edu/seq_transf/ between Nov. 30 and Dec. 4. Also, the genes were picked genes at random – using a random number generator as described in the supplementary information.

  10. i know nothing about tardigrades and next to nothing about lgt. therefore, my comment has nothing to do with either the investigators or the pnas paper/the biorxiv preprint or the science contained in both manuscripts. kudos to both the teams and julie to set the ball rolling in an open discussion forum and this is exactly the way science should go. no matter who is correct, both the teams and science are victorious at the end.

    my comment is to do with raw data availability. and there, i would like to respectfully disagree with julie about the timing of raw data deposit and agree 100% what titus brown said. in the case of tardigrade genome, you know the two groups that published the pnas paper and the biorxiv preprint. therefore, no points in looking at the raw data as you trust that they know what they are doing. but suppose you get the same paper from an unknown group to review, let’s say from our group, who you have never heard of, what do you do when you hear the title that says “…extensive horizontal gene transfer…..”? as an established researcher in the field, perhaps the first thing you would do is to ask your grad student or postdoc to download the data and do a quick and dirty analysis on whether the claims are correct or not. and thats why getting the raw data submitted at the same time is as crucial as getting the paper out. additionally, getting the raw data out gives an investigator who other established ones have not heard of a chance to be correct before their papers are rejected, if indeed they claim big things. binay panda

  11. I haven’t paid a ton of attention to this controversy but in a general sense of assessing LGT (as well as genome assembly!) it seems obvious that HiC/proximity-ligation approaches would clear things up. If LGT events weren’t PCR duplicates of chimeras, then you’d see diminishing interaction frequencies between those prokaryotic regions and, say, tardigrade sequence with increasing linear physical distance. This approach would also generate multiple independent points by which to confirm LGTs. Contamination from other genomes is extremely unlikely to be involved in proximity ligation under these conditions; with a good experiment you only see about 8-10% ligation across chromosomes within fixed nuclei, let alone nucleic acids outside those nuclei. Any reason why folks don’t do this more often?

  12. Is there a PCR step after adapter ligation in the preparation of genomic Illumina libraries? If so, then there could be multiple reads spanning chimeric contigs. Maybe the best way to exclude technical chimeras is to prepare multiple independent libraries, and sequence them separately. Then only assemble contigs based on read coverage in both libraries. Random technical chimeras will thus be excluded, even if they are amplified prior to sequencing in one of the libraries.

Leave a Reply

Your email address will not be published. Required fields are marked *