PacBio Pipeline Off to a Strong Start for 2014

It has been a busy January for our PacBio RSII instrument. We are excited to report a new record yield from a single SMRT cell – 896,457,524 passed filter bases! It seems we are not far off from hitting 1 G.

 

Some more stats from this cell:

                Mean Read Length: 8391 bp

                P50 Subread Length: 6187 bp

                P90 Subread Length: 12314 bp

                P95 Subread Length: 14032 bp

                Maximum Subread Length: 24585 bp

 

We have come a long way in the past year. Here is a comparison of yields and mean read lengths of our top 20 SMRT cells in January 2013, compared to our top 20 SMRT cells so far in 2014:

The increases in both SMRT cell yields and read lengths are making PacBio an attractive option for sequencing and finishing microbial genomes. We are excited to see where 2014 will take us!

For more information on our full range of sequencing and analysis services, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions.

Finishing Genomes with the PacBio RS II – Read our Core Lab Profile

The GRC, which offers services from sequencing library prep through genome assembly and downstream analysis, is generating complete bacterial genome sequences and methylation profiles using PacBio SMRT sequencing on the RS II. Several advancements in the library prep, sequencer, sequencing protocols, and data analysis software have all contributed to this.

To learn more about these breakthroughs and other emerging applications of SMRT sequencing, please read the PacBio Core Lab Profile showcasing the research performed at GRC and IGS here.

GRC and IGS offer not only cutting-edge sequencing, but a complete menu of services including assembly, annotation, and custom analyses. For more information about services offered, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions.

 

Increasing PacBio RS II SubRead Lengths

Although the latest SMRTcell has been designed to shift the loading bias towards larger read lengths, when working with long insert libraries (10-20 kb), the preferential loading of smaller fragments often limits the potential of these libraries.

A solution to this is to remove small fragments from the libraries. We have evaluated the Blue Pippin (Sage Science, Inc., Beverly MA), an automated electrophoresis system that separates and collects DNA fragments based upon their size, for this purpose.

In order to measure the increase in subread length, long insert libraries were prepared with fragments larger than 4 kb or 7 kb isolated using the Blue Pippin and a 0.75% Agarose Gel Cassette (BLF7510) and compared to a library without Blue Pippin size selection. As shown below, the removal of smaller library fragments prior to sequencing increases the average length of the library fragments loaded into ZMWs on the SMRTcell.

In addition to longer subreads, there is also a boost to the amount of data generated per ZMW. As the fragment length increases, the percentage of SMRTbell adapter sequence decreases and the percentage of library insert increases. The graph below shows the average number of passed-filter bases per active ZMW versus the average fragment length of each library. Using Blue Pippin size selection, we have achieved yields of >500 M passed filter bases from individual SMRTcells.

Below are the sequencing and assembly results of four genomes sequenced from long-insert, Blue Pippin size-selected libraries. Using only PacBio long subread data, we were able to assemble complete microbial genomes for three of the four isolates. Even with only a single under-loaded and low-yield SMRTcell, the remaining isolate still resulted in a nearly complete genome assembly with 10 total contigs and >60% of the genome assembled in the largest contig.


GAGE-B: An Evaluation of Genome Assemblers for Bacterial Organisms

Researchers from the Genomics Resource Center were significant contributors to the recently published paper “GAGE-B: An Evaluation of Genome Assemblers for Bacterial Organisms” which can be found here:

http://bioinformatics.oxfordjournals.org/content/early/2013/05/10/bioinformatics.btt273.long
Following the standards set by the original GAGE assembly comparison (Salzberg, et al., 2012), GAGE-B (Genome Assembly Gold-standard Evaluation for Bacteria) evaluates how genome assemblers compare on a spectrum of bacterial genomes sequenced by the newest sequencing technologies.
The need to contain DNA preparation costs, particularly in comparison to sequencing costs, often results in the creation of only a single sequencing library, which frequently poses challenges during genome assembly. GAGE-B evaluates the following open source genome assemblers:

• Abyss v1.3.4 (Simpson, et al., 2009)
•Cabog v7.0 (Miller, et al., 2008)
•Mira v3.4.0 (Chevreux, et al., 2004)
•MaSuRCA v1.8.3 (Zimin, et al., 2013)
•SGA v0.9.34 (Simpson and Durbin, 2012)
•SoapDenovo2 v2.04 (including GapCloser) (Li, et al., 2010)
•SPAdes v2.3.0 (Bankevich, et al., 2012)
•Velvet v1.2.08 (Zerbino and Birney, 2008)

Here we highlight some exciting results using the data provided in the paper

First, let’s take a look at which assembler generates the best assemblies of bacterial species from a single whole genome shotgun library. The HiSeq sequences were 100 bp paired-end, with coverage levels ranging from 100-300x; MiSeq reads were 250 bp paired ends with 100x coverage for all samples.

Comparison of corrected N50 contig sizes for assemblies where the finished reference genome was identical or near-identical.

Comparison of N50 contig sizes for assemblies where the sequenced strain was too divergent to compute a corrected N50 value. All genomes shown here were assembled from 100bp HiSeq reads.

Overall, MaSuRCA and SPAdes produced the best assemblies across these twelve bacterial organisms. MaSuRCA had the largest contig sizes, measured by either N50 or corrected N50 values, for ten of the twelve genomes. The SPAdes assembler came in first for the other two genomes, and was a close second for an additional two organisms.

Next, we compared the assemblies produced by the high coverage, one-library strategy to the best assemblies created by a two-library sequencing strategy. In this experiment, one set of single-library sequence data consists of 101 bp paired-end HiSeq reads with 210x coverage, while the other consists of MiSeq 251 bp paired-end reads at 100x coverage. The two-library data set from the original GAGE study is compromised of 101 bp reads generated from sequencing one 180 bp fragment library and one 3000 bp jumping library intended to span the repetitive areas of a genome) with the Illumina Genome Analyzer II. The GAGE ALLPATHS-LG assembly was generated with 31x sequence coverage of the short insert library and 29x sequence coverage of the jumping library; the GAGE MaSuRCA assembly was generated with 45x sequence coverage of the short insert library and 9x coverage from the jumping library.

Assemblies of R. sphaeroides using one versus two libraries.

Assemblies of R. sphaeroides using one versus two libraries.

The contigs created by both MaSuRCA and SPAdes from a single deep coverage library were considerably larger than those from the two library data, which was at lower coverage (100X).

However, scaffold analysis shows the other side of the coin:

The lack of long “jumping” pairs from a second library makes a very significant difference in the size of scaffolds, primarily because a single library of paired reads from relatively short fragments is not sufficient to span many of the repetitive sequences in a genome. In essence, the best scaffolds for the one-library assembly were less than 10% of the length of the biggest scaffold generated from the two-library strategy

So what’s the takeaway? Overall, the results support a conclusion that, with deep sequence coverage, the latest genome assemblers can produce extraordinarily good de novo bacterial assemblies using sequence data from just a single, short-fragment DNA library. We always run multiple assemblies using multiple assemblers. For a single assembly attempt, MaSuRCA appears to be the best option for now. But, genome assemblers are rapidly evolving with the changing sequencing landscape, and the best approach can change quickly. We are constantly testing, evaluating, and developing over here and will be sure to keep you posted…

PacBio RSII producing encouraging early results

Our PacBio throughput and read lengths have been improving steadily over the past year and may have just taken yet another big step forward.  We upgraded our PacBio sequencer to RSII in mid-May and we are seeing significant increases in per-cell yield and improved read lengths with our longer libraries.  The most notable change in the upgrade from RSI to RSII is the doubling of the number of simultaneously observable sequencing reactions on the SMRTcell, allowing throughput to be effectively doubled as well.  Let’s take a look at some examples:

In this comparison of an 8kb Mycobacterium library that was run both before and after the upgrade, we see an almost 3x increase in total yield per-SMRTcell, while read lengths remain about the same.

Below is a comparison of per-SMRTcell stats from multiple libraries across multiple organisms, including both 8kb and 14kb libraries from Mycobacterium sp., Plasmodium falciparum, Saccharomyces cerevisiae and Candida albicans.  Driven by the longer libraries, we see both dramatically higher yield and longer read lengths. On one recent 8 SMRTcell run of a 14kb library, we saw an average per-SMRTcell yield of 417 Mbp!

Here is a read length plot comparing the runs from the table above:

 Although we are early in our use and optimization of the new PacBio RSII, we are encouraged by the increase in both yield and read length, and expect continued improvement in our PacBio data, subsequently improving data analysis and genome assembly.

Highly Multiplexed 16S Sequencing on MiSeq

16S amplicon sequencing has proven to be an important tool for identifying and quantifying microbes present in metagenomic samples. We have several researchers here at IGS who have used this to analyze organismal and environmental communities for several years.

Together with these researchers, the GRC has been working over the past year to transition high-throughput sequencing of 16S rRNA regions amplified from metagenomic samples from the 454 platform to the Illumina platform. With the increased read length (2x250bp) on the MiSeq, it is now well suited to generate 16S data for a fraction of the cost of generating data on the 454 FLX.

A typical 16S amplicon run on the 454 produces ~1M reads with an average read length of ~500 bp, which enables deep profiling of 100-200 samples. A paired-end MiSeq run generates 500 bp of sequence per amplicon and produces an average of 12M read pairs per run. We are now routinely profiling a minimum of 400 samples per run with even greater depth than possible on 454 for less than half the per-sample cost.

Please contact us for more information about our 16S profiling service using the Illumina MiSeq.

Complete microbial genomes using only PacBio data? Testing HGAP…

We’ve spent some time recently testing a new way to assemble PacBio data called HGAP, which stands for “hierarchical genome assembly process”.  Unlike previous assemblers of PacBio data that have relied on the use of either Illumina and/or PacBio CCS reads for error correction of PacBio long reads, HGAP uses multiple alignments of all reads to perform the corrections, potentially eliminating the need for other libraries and data types.  The corrected reads are assembled with an overlap-layout consensus assembler (in this case Celera Assembler) to form contigs.  More details about HGAP can be read found here: https://github.com/PacificBiosciences/DevNet/wiki/Hierarchical-Genome-Assembly-Process-%28HGAP%29

We have evaluated HGAP on several of our projects and compared it to our assembly of illumina-corrected Pacbio reads assembled with Celera Assembler.  So far, the results have been very encouraging and we have seen significant improvement in many cases.  The chart below shows several examples:

So the assemblies are more contiguous, but are the corrections good enough to generate accurate consensus sequence? In an attempt to verify the consensus accuracy of these HGAP assemblies for several Bordetella genomes, we aligned >240x coverage of 250bp Illumina MiSeq data to the HGAP-generated contigs and looked for discrepancies and SNPs using GATK. We found no cases of high-quality, passed-filter variants, which supports a highly accurate consensus sequence generated by the HGAP assembly.  We continue to test and compare HGAP with other PacBio assembly methods but are encouraged by initial results.

Options When Starting Material is Limiting

Sometimes it is not possible to come up with the amount of DNA or RNA required for a standard Illumina library prep. We are frequently asked what the options are when there is just not enough sample available.

There are several kits on the market now that allow Illumina libraries to be prepared from minimal amounts of starting material. We have processed clinical samples, metagenomic samples, and samples from FFPE tissues that yielded extremely low amounts of RNA or DNA.

For RNA samples, we have generated linearly-amplified cDNA with the Nugen Ovation v2 kit. An advantage of this kit is that the amplification of rRNA is somewhat suppressed, increasing the percentage of usable data. Starting with sub-nanogram amounts of RNA, we are able to generate micrograms of cDNA. We’ve tested various library preparation methods with the amplified cDNA, and we have found that the Illumina TruSeq prep to work the best for us.

The Illumina Nextera system is an option available when DNA amounts are limiting. The Nextera XT DNA Sample Prep Kit requires exactly 1 ng of input material (best for plasmids or small genomes), and the Nextera kit DNA Sample Prep Kit requires exactly 50 ng of DNA. The library fragmentation is accomplished via transposon insertion events. We skip the normalization/denaturation portion of the protocol, and determine the quality and quantity of the libraries following our standard procedures. We have found that the library sizes tend to vary, and can be much wider than our traditional Illumina DNA libraries, but this is still a great option when there is very little material available.

Contact us if you have questions or would like additional information.