Our PacBio throughput and read lengths have been improving steadily over the past year and may have just taken yet another big step forward. We upgraded our PacBio sequencer to RSII in mid-May and we are seeing significant increases in per-cell yield and improved read lengths with our longer libraries. The most notable change in the upgrade from RSI to RSII is the doubling of the number of simultaneously observable sequencing reactions on the SMRTcell, allowing throughput to be effectively doubled as well. Let’s take a look at some examples:
In this comparison of an 8kb Mycobacterium library that was run both before and after the upgrade, we see an almost 3x increase in total yield per-SMRTcell, while read lengths remain about the same.
Below is a comparison of per-SMRTcell stats from multiple libraries across multiple organisms, including both 8kb and 14kb libraries from Mycobacterium sp., Plasmodium falciparum, Saccharomyces cerevisiae and Candida albicans. Driven by the longer libraries, we see both dramatically higher yield and longer read lengths. On one recent 8 SMRTcell run of a 14kb library, we saw an average per-SMRTcell yield of 417 Mbp!
Here is a read length plot comparing the runs from the table above:
Although we are early in our use and optimization of the new PacBio RSII, we are encouraged by the increase in both yield and read length, and expect continued improvement in our PacBio data, subsequently improving data analysis and genome assembly.
16S amplicon sequencing has proven to be an important tool for identifying and quantifying microbes present in metagenomic samples. We have several researchers here at IGS who have used this to analyze organismal and environmental communities for several years.
Together with these researchers, the GRC has been working over the past year to transition high-throughput sequencing of 16S rRNA regions amplified from metagenomic samples from the 454 platform to the Illumina platform. With the increased read length (2x250bp) on the MiSeq, it is now well suited to generate 16S data for a fraction of the cost of generating data on the 454 FLX.
A typical 16S amplicon run on the 454 produces ~1M reads with an average read length of ~500 bp, which enables deep profiling of 100-200 samples. A paired-end MiSeq run generates 500 bp of sequence per amplicon and produces an average of 12M read pairs per run. We are now routinely profiling a minimum of 400 samples per run with even greater depth than possible on 454 for less than half the per-sample cost.
Please contact us for more information about our 16S profiling service using the Illumina MiSeq.
We’ve spent some time recently testing a new way to assemble PacBio data called HGAP, which stands for “hierarchical genome assembly process”. Unlike previous assemblers of PacBio data that have relied on the use of either Illumina and/or PacBio CCS reads for error correction of PacBio long reads, HGAP uses multiple alignments of all reads to perform the corrections, potentially eliminating the need for other libraries and data types. The corrected reads are assembled with an overlap-layout consensus assembler (in this case Celera Assembler) to form contigs. More details about HGAP can be read found here: https://github.com/PacificBiosciences/DevNet/wiki/Hierarchical-Genome-Assembly-Process-%28HGAP%29
We have evaluated HGAP on several of our projects and compared it to our assembly of illumina-corrected Pacbio reads assembled with Celera Assembler. So far, the results have been very encouraging and we have seen significant improvement in many cases. The chart below shows several examples:
So the assemblies are more contiguous, but are the corrections good enough to generate accurate consensus sequence? In an attempt to verify the consensus accuracy of these HGAP assemblies for several Bordetella genomes, we aligned >240x coverage of 250bp Illumina MiSeq data to the HGAP-generated contigs and looked for discrepancies and SNPs using GATK. We found no cases of high-quality, passed-filter variants, which supports a highly accurate consensus sequence generated by the HGAP assembly. We continue to test and compare HGAP with other PacBio assembly methods but are encouraged by initial results.
Ken Dewar, McGill University, highlights how PacBio Circular Consensus Sequencing could be used to sequence ‘Rhino’viruses:
Sometimes it is not possible to come up with the amount of DNA or RNA required for a standard Illumina library prep. We are frequently asked what the options are when there is just not enough sample available.
There are several kits on the market now that allow Illumina libraries to be prepared from minimal amounts of starting material. We have processed clinical samples, metagenomic samples, and samples from FFPE tissues that yielded extremely low amounts of RNA or DNA.
For RNA samples, we have generated linearly-amplified cDNA with the Nugen Ovation v2 kit. An advantage of this kit is that the amplification of rRNA is somewhat suppressed, increasing the percentage of usable data. Starting with sub-nanogram amounts of RNA, we are able to generate micrograms of cDNA. We’ve tested various library preparation methods with the amplified cDNA, and we have found that the Illumina TruSeq prep to work the best for us.
The Illumina Nextera system is an option available when DNA amounts are limiting. The Nextera XT DNA Sample Prep Kit requires exactly 1 ng of input material (best for plasmids or small genomes), and the Nextera kit DNA Sample Prep Kit requires exactly 50 ng of DNA. The library fragmentation is accomplished via transposon insertion events. We skip the normalization/denaturation portion of the protocol, and determine the quality and quantity of the libraries following our standard procedures. We have found that the library sizes tend to vary, and can be much wider than our traditional Illumina DNA libraries, but this is still a great option when there is very little material available.
Contact us if you have questions or would like additional information.
At AGBT a couple of weeks ago, I presented a poster with an overview of methods developed by GRC members to sequence and assemble viral genomes from clinical samples. To view the poster, follow the link below:
AGBT Rhinovirus Poster
IGS also presented a poster about custom capture at this year’s AGBT meeting. The poster below presents data demonstrating that custom capture can be an effective way to sequence entire genomes of obligate intracellular parasites that cannot be grown independently, including such organisms isolated form field samples.
One of the posters IGS presented at AGBT 2013 involved a thorough comparison of sequencing platforms and assemblies strategies across 5 microbial species that vary in both genome size and GC content. The conclusions from this study aim to inform future large-scale microbial projects and aid in efficiency and project design.
Please click on the link below to view a PDF image of the actual poster.
AGBT 2013 Assembly Comparison Poster
A new feature that was added with the recent PacBio upgrade is something called ‘Stage Start’. This allows for data collection to start earlier than it did previously. When this option is used, data collection begins immediately after the polymerase is activated, resulting in longer reads.
Below are the results from a quick test we performed. We sequenced two libraries with and without the ‘Stage Start’ feature turned on.
The libraries sequenced were about 8kb in length, and were sequenced using the Magbead Standard Seq v1 protocol. One 90-minute movie was taken of each SMRTcell. Standard Polymerase Binding and Sequencing kits were used (not the newer ‘XL’ version of the kits).
Over the past few months, several members of the GRC bioinformatics team have been working diligently on testing a variety of assemblers and analyzing results. The assembler testing is intended to help critically evaluate the results/performance of some of the more popular de novo assemblers. Similar studies have been done before (such as: http://gage.cbcb.umd.edu/), but we aim to expand upon those studies by testing on different organisms and data types. To that end, WGS data generated at IGS, from many samples and across multiple species (such as E. coli, V. cholera, S. aureus and M. massiliense), have been assembled at multiple coverage levels using assemblers such as Celera Assembler, MSRCA, Velvet, SOAPdenovo and ABySS. In addition, the data has been sequenced using various NGS platforms, including Illumina HiSeq, Illumina MiSeq and PacBio. These data types will be assembled in different combinations and as stand-alone assemblies to gauge the affects of hybrid assemblies of different data types and combinations. We hope to have lots of stats compiled in the very near future.