Bioinformatics - Projects

Project Principal Investigator
NeMO: Neuroscience Multi-Omic Archive Seth Ament PhD, Owen White PhD
GCID: Genome Center for Infectious Disease Claire M. Fraser PhD, David Rasko PhD, Owen White PhD
ECO: Evidence and Conclusion Ontology Michelle Gwinn-Giglio PhD
Analysis Engine Michelle Gwinn-Giglio PhD
Transformative Research Award (TR01) Julie Dunning Hotopp PhD
NSF Funded: ABI: Development Julie Dunning Hotopp PhD
SpeciateIT: A Fast Clustering-Free 16S Ribosomal RNA Gene Sequence Taxomic Assignment Tool Jacques Ravel PhD
VIRGO, a comprehensive non-redundant gene catalog Jacques Ravel PhD
DO: Disease Ontology Lynn Schriml PhD
DCPPC: The NIH Data Commons Pilot Phase Consortium Owen White PhD
HMP DACC Owen White PhD

NeMO: Neuroscience Multi-Omic Archive

As part of the NIH BRAIN Initiative, researchers at IGS have developed the Neuroscience Multi-Omic (NeMO) Archive, specifically focused on the storage and dissemination of omic data from the BRAIN initiative and related brain research projects. The repository provides a catalog of omics data from cells in the mammalian brain together with rich metadata about the studies. The NeMO Portal provides a query interface for users to explore this vast store of data.


This project explores the dynamic interactions between pathogens, hosts, microbiota, the immune system, and the environment, with the goal to provide a comprehensive understanding of the determinants of infectious disease. The project includes work on bacteria (Escherichia coli), fungi (Candida, Aspergillus, and Mucormycosis species), and eukaryotic parasites (Plasmodium and Brugia).

ECO: Evidence and Conclusion Ontology

The Evidence and Conclusion Ontology (ECO) contains terms to describe types of evidence used in the process of biocuration. Capture of this information in a systemica way with an ontology, allows tracking of annotation provenance, establishment of quality control measures and query of evidence. ECO contains over 1500 terms and is in use by many leading biological resources including the Gene Ontology, UniProt and several model organism databases. ECO is continually being expanded and revised based on the needs of the biocuration community.

Analysis Engine

It has become relatively easy to acquire the genome sequence of prokaryotic organisms. However, there are still few options available for doing systematic, complete annotation of the whole genome using a robust annotation pipeline. The IGS Analysis Engine provides comprehensive automated annotation along with all underlying search data as well as tools for visualization and (optional) manual curation. In addition to single genome annotation, we also offer comparative analysis of multiple genomes with an associated visualization tool. These services are provided on a fee-for-service basis.

Transformative Research Award (TR01) — Extent & Significance of Bacterial DNA Integrations in the Human Cancer Genome.

The integration of exogenous DNA into the human genome can cause somatic mutations associated with oncogenesis. For example, the insertion of HPV DNA into human chromosomes is the single most important event leading to tumorigenesis in cervical cancer. It is also now preventable with vaccines against HPV. In contrast to viral DNA integrations, the instances and repercussions of bacterial DNA integration into the somatic human genome are less clear. This project has three objectives aimed at addressing our gap in knowledge about bacterial DNA integrations. First, virtual machines will be developed for LGTSeek and LGTview, our bioinformatics tools that we have used previously to detect bacterial DNA integrations in human genome sequencing projects. LGTSeek and LGTView will be used to further interrogate publicly available cancer genome data where such integrations are likely to occur because the tissues are exposed to the microbiome (e.g. colon). Second, genome and transcriptome sequencing will be undertaken of new stomach adenocarcinoma samples and acute myeloid leukemia samples in order to reproduce previous results that suggest the presence of bacterial DNA integrations, includinge control samples with exogenous bacterial nucleic acids added to the sample in order to quantify the formation of chimeras in modern sequencing techniques. Third, the effect that previously detected bacterial DNA integrations have on transcription will be interrogated using luciferase reporter constructs and the CRISPR/Cas9 system. Collectively, this research is expected to improve our understanding of the extent and significance of bacterial DNA integrations in the somatic human genome.


NSF Funded: ABI: Development: Cloud-based Identification and Visualization of Lateral Gene Transfers in Genome Data

All genomes accumulate mutations that are both beneficial and detrimental to the organism. The best understood mutations are those that involve alteration, insertion, or deletion of a single base pair, where there are numerous tools for identifying and validating such changes. Yet in many organisms, it is increasingly appreciated that large, even massive, insertions of DNA can occur from other organisms, termed lateral gene transfer, that have the potential to have a profound effect on the organism, either detrimental or beneficial. For example, large insertional mutations led to the transition of endosymbionts to organelles like mitochondria and chloroplasts. This project seeks to improve tools previously developed to identify such lateral gene transfers from genome sequencing data, and to make these tools available to the research community after ensuring that they are more robust and user friendly. In addition, this proposal seeks to develop YouTube whiteboard videos to educate the general public about these mutations, genomics, and the tools developed in this proposal.


SpeciateIT: A Fast Clustering-Free 16S Ribosomal RNA Gene Sequence Taxomic Assignment Tool

Clustering of sequences into Operational Taxonomic Units (OTUs) has become a mainstream approach to facilitate taxonomic classification of large numbers of 16S rRNA gene sequences. This is partly due to the high computational requirements for processing each sequence in increasingly large datasets. A primary focus of the field has been development and improvement of OTU-based sequence clustering methods that rely on distances between each pair of sequences in a dataset. Following OTU-based clustering, representative sequences are commonly classified using tools such as the RDP Naïve Bayesian Classifier (Wang et al. 2007), and the resulting classification transitively assigned to all sequences comprising that OTU. However, problems with this strategy exist (Nguyen et al., 2016). We have developed speciateIT, a novel per sequence taxonomic assigner which quickly and accurately classifies millions of 16S rRNA gene sequences using higher order Markov Chain models built from a user- specified set of reference sequences, hence does not require the need for OTU clustering.


VIRGO, a comprehensive non-redundant gene catalog

Analysis of metagenomic and metatranscriptomic data is complicated and typically requires extensive computational resources. Leveraging a curated reference database of genes encoded by members of the target microbiome can make these analyses more tractable. Unfortunately, there is no such reference database available for the vaginal microbiome. In this project, we assembled a comprehensive human vaginal non-redundant gene catalog (VIRGO) from 264 vaginal metagenomes and 416 genomes of urogenital bacterial isolates. VIRGO includes 0.95 million non-redundant genes compiled from a total of 5.5 million genes belonging to 318 unique bacterial species. We show that VIRGO covers more than 95% of the vaginal bacterial gene content in metagenomes from North American, African, and Chinese women. The gene catalog was extensively functionally annotated from 17 diverse protein databases, and importantly taxonomy was assigned through in silico binning of genes derived from metagenomic assemblies. To further enable focused analyses of individual genes and proteins, we also clustered the non-redundant genes into vaginal orthologous groups (VOG). The gene-centric design of VIRGO and VOG provides an easily accessible tool to comprehensively characterize the structure and function of vaginal metagenome and metatranscriptome datasets. To highlight the utility of VIRGO, we analyzed 1,507 additional vaginal metagenomes, uncovering an as of yet undetected high degree of intraspecies diversity within and across vaginal microbiota. VIRGO offers a convenient reference database and toolkit that will facilitate a more in-depth understanding of the role of vaginal microorganisms in women’s health and reproductive outcomes.


DO: Disease Ontology

The Disease Ontology has been developed as a standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts through collaborative efforts of researchers at Northwestern University, Center for Genetic Medicine and the University of Maryland School of Medicine, Institute for Genome Sciences. The Disease Ontology semantically integrates disease and medical vocabularies through extensive cross mapping of DO terms to MeSH, ICD, NCI’s thesaurus, SNOMED and OMIM.


DCPPC: The NIH Data Commons Pilot Phase Consortium

Massive quantities of high throughtput biological data of many types have been generated. Currently, it is difficult for most researchers to combine, query, and carry out analysis on data from diverse sources. The NIH Data Commons will provide a cloud-based platform to house data and analysis workflows from NIH-funded projects. Ultimately, this resource will provide a storage and computing environment that will facilitate the ability of researchers to store, share, access and carry out analysis, resulting in new hypothesis generation and discovery. IGS is part of the NIH Data Commons Pilot Phase Consortium charged with producing an initial implementation of this cloud resource. As a member of the DCPPC, we produce a publicly facing web resource for the project as well as contributing to infrastructure and metadata harmonization efforts.


The NIH Common Fund initiated the Human Microbiome Project (HMP) to explore the microbial communities of the human host and characterize their role in human health and disease. The initial five-year phase of the effort established a baseline of data from a large sample of healthy subjects, explored changes in community compositions with disease states, and provided resources for the community to use in human microbiome research. A second phase of the effort called the Integrated Human Microbiome Project, or iHMP, focused on three particular conditions in human health: onset of type 2 diabetes, inflammatory bowel disease, and pregnancy/pre-term birth. In contrast to the initial phase of the HMP project, in this phase many different types of omics approaches were carried out on both the microbiome and the host in order to provide a more systems-level view of human-microbe interactions. IGS is the data coordination center for both of these HMP efforts.