IGS to Host Pacific Biosciences User Group Meeting

The Institute for Genome Sciences is pleased to once again host the PacBio Americas East Coast UGM and Workshops in Baltimore from June 16-18, 2015.

The UGM meeting is a day-long event on June 17th.  An optional half-day Bioinformatics workshop will be held on the afternoon of June 16th and the morning of June 18th, and an afternoon sample prep workshop will be held on June 16th. The bioinformatics workshop will cover large genome assembly with FALCON and analysis for the Iso-Seq™ method, among other topics. Special guests Jon Badalamenti and Sergey Koren will host on Tuesday and Thursday, respectively. The sample prep workshop will include talks on ultra-long DNA fragments, barcoding, targeted sequencing with solution-based capture, the Iso-Seq method, an intro to the SMRT Portal, and more.

View the agenda and register to attend.

GRC a PacBio Certified Service Provider; Co-sponsoring SMRTest Microbe Grant Program at ASM 2015

We are pleased to announce that the GRC is the first PacBio certified service provider on the East Coast. This recently announced program is a partnership between PacBio and select sequence providers who have completed the certification process and offer the highest quality sequencing and analysis services using the PacBio technology. We offer a full range of PacBio services, including whole genome sequencing, transcriptome sequencing via Iso-Seq, targeted amplicon sequencing, and other customized applications. Our analysis team has expertise in genome assembly and annotation, variant analysis, transcriptome analysis, and base modification detection. We look forward to continuing our strong partnership with PacBio and offering the highest quality sequencing and analysis to our customers and collaborators.

As part of this new partnership, the GRC is proud to co-sponsor the SMRTest Microbe Grant Program. One lucky winner will receive sequencing and analysis services from the GRC. To enter, submit a short grant application detailing your project and how it would benefit from the long reads and high consensus accuracy of SMRT Sequencing. The deadline for submissions is June 27, 2015.

For more information on our full range of sequencing and analysis services, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions or would like a quote.

If you are attending ASM 2015 next week, please stop by the IGS booth (#776) to learn more about the grant program and all of our sequencing and analysis services. See you in New Orleans!

Over 100 Publications Strong!

Less than eight years after beginning in an empty lab, our sequencing and analysis has contributed to more than 100 peer-reviewed publications. The first, published in Nature, was a comparison of the gut microbiome in obese and lean twins (DOI: 10.1038/nature07540). Since then, data generated by the GRC have been used for many other types of studies. Check out our list of publications.

If you have a publication resulting from data generated by the GRC that isn’t listed here, please let us know!

Thanks to all of our collaborators on these exciting projects for making the GRC a success. We look forward to the next 100 publications!

GRC Posters Presented at AGBT 2014

This year we are highlighting some of the work we’ve done in the past year.

The first poster provides an overview of how changes to our PacBio pipeline have increased our sequencing yields and read lengths, resulting in finished, high-quality microbial genomes, assembled using only PacBio data.

The second poster demonstrates how Next Gen sequencing can be used to investigate host and pathogen associations in cases of pulmonary non-tuberculous mycobacterial (PNTM) infections.

For more information on our full range of sequencing and analysis services, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions.

Finishing Genomes with the PacBio RS II – Read our Core Lab Profile

The GRC, which offers services from sequencing library prep through genome assembly and downstream analysis, is generating complete bacterial genome sequences and methylation profiles using PacBio SMRT sequencing on the RS II. Several advancements in the library prep, sequencer, sequencing protocols, and data analysis software have all contributed to this.

To learn more about these breakthroughs and other emerging applications of SMRT sequencing, please read the PacBio Core Lab Profile showcasing the research performed at GRC and IGS here.

GRC and IGS offer not only cutting-edge sequencing, but a complete menu of services including assembly, annotation, and custom analyses. For more information about services offered, visit our Laboratory Services and Analysis Services pages. Please contact us if you have any questions.


GAGE-B: An Evaluation of Genome Assemblers for Bacterial Organisms

Researchers from the Genomics Resource Center were significant contributors to the recently published paper “GAGE-B: An Evaluation of Genome Assemblers for Bacterial Organisms” which can be found here:

Following the standards set by the original GAGE assembly comparison (Salzberg, et al., 2012), GAGE-B (Genome Assembly Gold-standard Evaluation for Bacteria) evaluates how genome assemblers compare on a spectrum of bacterial genomes sequenced by the newest sequencing technologies.
The need to contain DNA preparation costs, particularly in comparison to sequencing costs, often results in the creation of only a single sequencing library, which frequently poses challenges during genome assembly. GAGE-B evaluates the following open source genome assemblers:

• Abyss v1.3.4 (Simpson, et al., 2009)
•Cabog v7.0 (Miller, et al., 2008)
•Mira v3.4.0 (Chevreux, et al., 2004)
•MaSuRCA v1.8.3 (Zimin, et al., 2013)
•SGA v0.9.34 (Simpson and Durbin, 2012)
•SoapDenovo2 v2.04 (including GapCloser) (Li, et al., 2010)
•SPAdes v2.3.0 (Bankevich, et al., 2012)
•Velvet v1.2.08 (Zerbino and Birney, 2008)

Here we highlight some exciting results using the data provided in the paper

First, let’s take a look at which assembler generates the best assemblies of bacterial species from a single whole genome shotgun library. The HiSeq sequences were 100 bp paired-end, with coverage levels ranging from 100-300x; MiSeq reads were 250 bp paired ends with 100x coverage for all samples.

Comparison of corrected N50 contig sizes for assemblies where the finished reference genome was identical or near-identical.

Comparison of N50 contig sizes for assemblies where the sequenced strain was too divergent to compute a corrected N50 value. All genomes shown here were assembled from 100bp HiSeq reads.

Overall, MaSuRCA and SPAdes produced the best assemblies across these twelve bacterial organisms. MaSuRCA had the largest contig sizes, measured by either N50 or corrected N50 values, for ten of the twelve genomes. The SPAdes assembler came in first for the other two genomes, and was a close second for an additional two organisms.

Next, we compared the assemblies produced by the high coverage, one-library strategy to the best assemblies created by a two-library sequencing strategy. In this experiment, one set of single-library sequence data consists of 101 bp paired-end HiSeq reads with 210x coverage, while the other consists of MiSeq 251 bp paired-end reads at 100x coverage. The two-library data set from the original GAGE study is compromised of 101 bp reads generated from sequencing one 180 bp fragment library and one 3000 bp jumping library intended to span the repetitive areas of a genome) with the Illumina Genome Analyzer II. The GAGE ALLPATHS-LG assembly was generated with 31x sequence coverage of the short insert library and 29x sequence coverage of the jumping library; the GAGE MaSuRCA assembly was generated with 45x sequence coverage of the short insert library and 9x coverage from the jumping library.

Assemblies of R. sphaeroides using one versus two libraries.

Assemblies of R. sphaeroides using one versus two libraries.

The contigs created by both MaSuRCA and SPAdes from a single deep coverage library were considerably larger than those from the two library data, which was at lower coverage (100X).

However, scaffold analysis shows the other side of the coin:

The lack of long “jumping” pairs from a second library makes a very significant difference in the size of scaffolds, primarily because a single library of paired reads from relatively short fragments is not sufficient to span many of the repetitive sequences in a genome. In essence, the best scaffolds for the one-library assembly were less than 10% of the length of the biggest scaffold generated from the two-library strategy

So what’s the takeaway? Overall, the results support a conclusion that, with deep sequence coverage, the latest genome assemblers can produce extraordinarily good de novo bacterial assemblies using sequence data from just a single, short-fragment DNA library. We always run multiple assemblies using multiple assemblers. For a single assembly attempt, MaSuRCA appears to be the best option for now. But, genome assemblers are rapidly evolving with the changing sequencing landscape, and the best approach can change quickly. We are constantly testing, evaluating, and developing over here and will be sure to keep you posted…

Complete microbial genomes using only PacBio data? Testing HGAP…

We’ve spent some time recently testing a new way to assemble PacBio data called HGAP, which stands for “hierarchical genome assembly process”.  Unlike previous assemblers of PacBio data that have relied on the use of either Illumina and/or PacBio CCS reads for error correction of PacBio long reads, HGAP uses multiple alignments of all reads to perform the corrections, potentially eliminating the need for other libraries and data types.  The corrected reads are assembled with an overlap-layout consensus assembler (in this case Celera Assembler) to form contigs.  More details about HGAP can be read found here: https://github.com/PacificBiosciences/DevNet/wiki/Hierarchical-Genome-Assembly-Process-%28HGAP%29

We have evaluated HGAP on several of our projects and compared it to our assembly of illumina-corrected Pacbio reads assembled with Celera Assembler.  So far, the results have been very encouraging and we have seen significant improvement in many cases.  The chart below shows several examples:

So the assemblies are more contiguous, but are the corrections good enough to generate accurate consensus sequence? In an attempt to verify the consensus accuracy of these HGAP assemblies for several Bordetella genomes, we aligned >240x coverage of 250bp Illumina MiSeq data to the HGAP-generated contigs and looked for discrepancies and SNPs using GATK. We found no cases of high-quality, passed-filter variants, which supports a highly accurate consensus sequence generated by the HGAP assembly.  We continue to test and compare HGAP with other PacBio assembly methods but are encouraged by initial results.

AGBT Whole Genome Capture Poster

IGS also presented a poster about custom capture at this year’s AGBT meeting.  The poster below presents data demonstrating that custom capture can be an effective way to sequence entire genomes of obligate intracellular parasites that cannot be grown independently, including such organisms isolated form field samples.

AGBT Assembler Comparisons Poster

One of the posters IGS presented at AGBT 2013 involved a thorough comparison of sequencing platforms and assemblies strategies across 5 microbial species that vary in both genome size and GC content.  The conclusions from this study aim to inform future large-scale microbial projects and aid in efficiency and project design.

Please click on the link below to view a PDF image of the actual poster.

AGBT 2013 Assembly Comparison Poster


Assembler Comparisons

Over the past few months, several members of the GRC bioinformatics team have been working diligently on testing a variety of assemblers and analyzing results.  The assembler testing is intended to help critically evaluate the results/performance of some of the more popular de novo assemblers.  Similar studies have been done before (such as: http://gage.cbcb.umd.edu/), but we aim to expand upon those studies by testing on different organisms and data types.  To that end, WGS data generated at IGS, from many samples and across multiple species (such as E. coli, V. cholera, S. aureus and M. massiliense), have been assembled at multiple coverage levels using assemblers such as Celera Assembler, MSRCA, Velvet, SOAPdenovo and ABySS.  In addition, the data has been sequenced using various NGS platforms, including Illumina HiSeq, Illumina MiSeq and PacBio.  These data types will be assembled in different combinations and as stand-alone assemblies to gauge the affects of hybrid assemblies of different data types and combinations.  We hope to have lots of stats compiled in the very near future.