GRC Spring 2015 Update

The Genomics Resource Center (GRC) continues to expand its capabilities and project portfolio. As part of our contract with the U.S. Food and Drug Administration (FDA) to sequence, assemble, and annotate pathogens in support of the development and expansion of a comprehensive, curated public reference database, we are developing a new pipeline for Ebola virus sequencing and analysis. We have also initiated several new projects to sequence large animal and plant genomes using the Pacific Biosciences platform. These larger projects were made possible by our recent upgrade to the new P6-C4 chemistry. This new chemistry, combined with improved software, has increased read lengths by more than 30% and doubled overall throughput. In June, we will host the Pacific Biosciences East Coast User Group Meeting for the third consecutive year. Please join us to hear about this exciting technology and its expanding applications.

Our Illumina platform continues to improve as well. In April, we will take delivery of our first HiSeq4000. This sequencer, the newest announced by Illumina, will increase throughput by 50% while reducing run time by an additional 50%. Each HiSeq4000 will be capable of sequencing 24 human genomes per week. We have also expanded our MiSeq repertoire with the installation of a MiSeq Dx in our CLIA facility for clinical sequencing applications.

The GRC will be hosting a booth at the annual American Society for Microbiology (ASM) general meeting in New Orleans from May 30 – June 2, 2015. If you’re there, please stop by to visit and learn more about our services and capabilities!

Q&A with the Co-Directors of the GRC

How do I initiate a project with GRC?

It’s easy! Contact us via our website (www.igs.umarylande.edu/grc) or email (grc-info@som.umaryland.edu) and we will set up an initial consultation with you. During this consultation, we will discuss your project goals and expectations and advise on experimental design. From there, we develop a project plan that includes sample requirements, timelines, cost estimates, and deliverabes. For large, long-term projects, we schedule additional discussions to finalize the project plan and monitor progress.

How long does it take? How much will it cost?

These are the most common questions we hear, but often difficult to answer. Depending on the scope and scale of the project, the timeline can vary from a few weeks to months or even years. Similarly, costs can fall in a wide range. We treat each project separately and develop the best estimates of cost and timelines as part of our consultation with each investigator.

Do you offer analysis, or only sequencing?

We do it all – from project design through sequencing and analysis. We have bioinformatics teams specialized in genome assembly, variant analysis, metagenomics, transcriptomics, and epigenomic analysis. If you are interested in analysis, we include that as part of the project consultation and project plan.

Click here to find the full IGS Spring 2015 newsletter as well as previous editions.

Automated Preparation of Long Insert PacBio Libraries

The GRC sequences thousands of microbial samples each year. The high throughput of our sequencers and the small genome sizes of these samples means that a lot of libraries are needed to keep our sequencers running at full capacity. Automation of library preps is key to keeping instruments busy, reducing error, and maximizing the productivity of our lab staff.

We have used automation to prepare large batches of libraries for Illumina sequencing for several years. Earlier this year, we began testing preparation of long-insert libraries for our PacBio RS II platform.

Our Biomek FXP (BeckmanCoulter) is a Dual Multi-channel Span-8 with an integrated thermal cycler. The combination of a 96-channel pipetting head and Span-8 pipetting head (which has eight independently controlled channels) allows for the preparation of up to 96 libraries at a time and the minimization of master mix dead volumes.

A script for preparing SMRTbell libraries developed by Todd Hartley at NCI/SAIC-Frederick was downloaded from Pacific Biosciences SMRT Community. We modified to the protocol to accommodate the specific tips and reagents tubes our lab uses, and to optimize reaction mixing.

 

Deck Layout

PB_Biomek_Deck

Prior to being loaded on the Biomek, samples are sheared with g-TUBEs (Covaris, Woburn, MA), targeting an average fragment size of 20kb. Master mixes for each reaction are prepared and placed in the robot. The following steps are performed by the Biomek:

  • DNA damage repair
  • End Repair
  • Ampure clean up
  • SMRTbell Ligation
  • Exonuclease
  • Ampure clean up

Once the run is complete, the libraries are removed and size-selection is performed using the BluePippin (Sage Science, Beverly, MA).

  manual preps robotic preps
Number of libraries (n) 91 27
Average input amount (ng of sheared gDNA) 4945 4951
Average library size (bp) 18507 18910
Average library concentration (nM) 5.1 3.5
Average recovery (ng) 902.8 707.4
Average recovery (%) 18.7 14.2

Above are data comparing libraries prepped manually and on the Biomek, from March 2014 to date. While the yields from the Biomek preps are slightly lower than manual preps, the yield is comparable and sufficient for sequencing multiple SMRT cells per library.

For more information on our full range of sequencing and analysis services, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions.

 

Single Molecule Sequencing and Genome Assembly of a Clinical Specimen of Loa loa

Scientists Apply Successful Single Molecule Sequencing and de novo Genome Assembly to a Parasitic Worm that Infects Human Eyes and Skin

Investigators at the Institute for Genome Sciences (IGS) at the University of Maryland School of Medicine and the Laboratory of Parasitic Diseases at the National Institute of Allergy and Infectious Diseases (NIAID) at the National Institutes of Health (NIH) used the long-read, single-molecule Pacific Biosciences platform for the successful genome sequencing and de novo assembly of Loa loa round worms from a clinical sample. Their research, which generated the most complete genome sequence of a filarial nematode produced to date, provides a more comprehensive reference genome for this parasite in the hopes of developing better molecular diagnostics to decrease morbidity from filarial nematodes. Their findings appear in today’s issue of BMC Genomics.

Click here to access the abstract and complete article.

GRC Awarded Contract to Expand FDA Microbial Genome Database

IGS and the GRC have been awarded a contract to assist the U.S. Food and Drug Administration (FDA) in the expansion and curation of a public database of microbial genome sequences and associated metadata. This will serve as a valuable reference to evaluate and assess high-throughput sequencing based diagnostic devices. In addition to all publicly available microbial genome sequences, the database will include more than 550 newly sequenced, assembled, and annotated genomes from under-represented branches of the phylogenetic tree. For more information on the project, please click here or contact the GRC.

GRC Posters Presented at AGBT 2014

This year we are highlighting some of the work we’ve done in the past year.

The first poster provides an overview of how changes to our PacBio pipeline have increased our sequencing yields and read lengths, resulting in finished, high-quality microbial genomes, assembled using only PacBio data.

The second poster demonstrates how Next Gen sequencing can be used to investigate host and pathogen associations in cases of pulmonary non-tuberculous mycobacterial (PNTM) infections.

For more information on our full range of sequencing and analysis services, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions.

PacBio Pipeline Off to a Strong Start for 2014

It has been a busy January for our PacBio RSII instrument. We are excited to report a new record yield from a single SMRT cell – 896,457,524 passed filter bases! It seems we are not far off from hitting 1 G.

 

Some more stats from this cell:

                Mean Read Length: 8391 bp

                P50 Subread Length: 6187 bp

                P90 Subread Length: 12314 bp

                P95 Subread Length: 14032 bp

                Maximum Subread Length: 24585 bp

 

We have come a long way in the past year. Here is a comparison of yields and mean read lengths of our top 20 SMRT cells in January 2013, compared to our top 20 SMRT cells so far in 2014:

The increases in both SMRT cell yields and read lengths are making PacBio an attractive option for sequencing and finishing microbial genomes. We are excited to see where 2014 will take us!

For more information on our full range of sequencing and analysis services, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions.

Finishing Genomes with the PacBio RS II – Read our Core Lab Profile

The GRC, which offers services from sequencing library prep through genome assembly and downstream analysis, is generating complete bacterial genome sequences and methylation profiles using PacBio SMRT sequencing on the RS II. Several advancements in the library prep, sequencer, sequencing protocols, and data analysis software have all contributed to this.

To learn more about these breakthroughs and other emerging applications of SMRT sequencing, please read the PacBio Core Lab Profile showcasing the research performed at GRC and IGS here.

GRC and IGS offer not only cutting-edge sequencing, but a complete menu of services including assembly, annotation, and custom analyses. For more information about services offered, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions.

 

Increasing PacBio RS II SubRead Lengths

Although the latest SMRTcell has been designed to shift the loading bias towards larger read lengths, when working with long insert libraries (10-20 kb), the preferential loading of smaller fragments often limits the potential of these libraries.

A solution to this is to remove small fragments from the libraries. We have evaluated the Blue Pippin (Sage Science, Inc., Beverly MA), an automated electrophoresis system that separates and collects DNA fragments based upon their size, for this purpose.

In order to measure the increase in subread length, long insert libraries were prepared with fragments larger than 4 kb or 7 kb isolated using the Blue Pippin and a 0.75% Agarose Gel Cassette (BLF7510) and compared to a library without Blue Pippin size selection. As shown below, the removal of smaller library fragments prior to sequencing increases the average length of the library fragments loaded into ZMWs on the SMRTcell.

In addition to longer subreads, there is also a boost to the amount of data generated per ZMW. As the fragment length increases, the percentage of SMRTbell adapter sequence decreases and the percentage of library insert increases. The graph below shows the average number of passed-filter bases per active ZMW versus the average fragment length of each library. Using Blue Pippin size selection, we have achieved yields of >500 M passed filter bases from individual SMRTcells.

Below are the sequencing and assembly results of four genomes sequenced from long-insert, Blue Pippin size-selected libraries. Using only PacBio long subread data, we were able to assemble complete microbial genomes for three of the four isolates. Even with only a single under-loaded and low-yield SMRTcell, the remaining isolate still resulted in a nearly complete genome assembly with 10 total contigs and >60% of the genome assembled in the largest contig.


GAGE-B: An Evaluation of Genome Assemblers for Bacterial Organisms

Researchers from the Genomics Resource Center were significant contributors to the recently published paper “GAGE-B: An Evaluation of Genome Assemblers for Bacterial Organisms” which can be found here:

http://bioinformatics.oxfordjournals.org/content/early/2013/05/10/bioinformatics.btt273.long
Following the standards set by the original GAGE assembly comparison (Salzberg, et al., 2012), GAGE-B (Genome Assembly Gold-standard Evaluation for Bacteria) evaluates how genome assemblers compare on a spectrum of bacterial genomes sequenced by the newest sequencing technologies.
The need to contain DNA preparation costs, particularly in comparison to sequencing costs, often results in the creation of only a single sequencing library, which frequently poses challenges during genome assembly. GAGE-B evaluates the following open source genome assemblers:

• Abyss v1.3.4 (Simpson, et al., 2009)
•Cabog v7.0 (Miller, et al., 2008)
•Mira v3.4.0 (Chevreux, et al., 2004)
•MaSuRCA v1.8.3 (Zimin, et al., 2013)
•SGA v0.9.34 (Simpson and Durbin, 2012)
•SoapDenovo2 v2.04 (including GapCloser) (Li, et al., 2010)
•SPAdes v2.3.0 (Bankevich, et al., 2012)
•Velvet v1.2.08 (Zerbino and Birney, 2008)

Here we highlight some exciting results using the data provided in the paper

First, let’s take a look at which assembler generates the best assemblies of bacterial species from a single whole genome shotgun library. The HiSeq sequences were 100 bp paired-end, with coverage levels ranging from 100-300x; MiSeq reads were 250 bp paired ends with 100x coverage for all samples.

Comparison of corrected N50 contig sizes for assemblies where the finished reference genome was identical or near-identical.

Comparison of N50 contig sizes for assemblies where the sequenced strain was too divergent to compute a corrected N50 value. All genomes shown here were assembled from 100bp HiSeq reads.

Overall, MaSuRCA and SPAdes produced the best assemblies across these twelve bacterial organisms. MaSuRCA had the largest contig sizes, measured by either N50 or corrected N50 values, for ten of the twelve genomes. The SPAdes assembler came in first for the other two genomes, and was a close second for an additional two organisms.

Next, we compared the assemblies produced by the high coverage, one-library strategy to the best assemblies created by a two-library sequencing strategy. In this experiment, one set of single-library sequence data consists of 101 bp paired-end HiSeq reads with 210x coverage, while the other consists of MiSeq 251 bp paired-end reads at 100x coverage. The two-library data set from the original GAGE study is compromised of 101 bp reads generated from sequencing one 180 bp fragment library and one 3000 bp jumping library intended to span the repetitive areas of a genome) with the Illumina Genome Analyzer II. The GAGE ALLPATHS-LG assembly was generated with 31x sequence coverage of the short insert library and 29x sequence coverage of the jumping library; the GAGE MaSuRCA assembly was generated with 45x sequence coverage of the short insert library and 9x coverage from the jumping library.

Assemblies of R. sphaeroides using one versus two libraries.

Assemblies of R. sphaeroides using one versus two libraries.

The contigs created by both MaSuRCA and SPAdes from a single deep coverage library were considerably larger than those from the two library data, which was at lower coverage (100X).

However, scaffold analysis shows the other side of the coin:

The lack of long “jumping” pairs from a second library makes a very significant difference in the size of scaffolds, primarily because a single library of paired reads from relatively short fragments is not sufficient to span many of the repetitive sequences in a genome. In essence, the best scaffolds for the one-library assembly were less than 10% of the length of the biggest scaffold generated from the two-library strategy

So what’s the takeaway? Overall, the results support a conclusion that, with deep sequence coverage, the latest genome assemblers can produce extraordinarily good de novo bacterial assemblies using sequence data from just a single, short-fragment DNA library. We always run multiple assemblies using multiple assemblers. For a single assembly attempt, MaSuRCA appears to be the best option for now. But, genome assemblers are rapidly evolving with the changing sequencing landscape, and the best approach can change quickly. We are constantly testing, evaluating, and developing over here and will be sure to keep you posted…