eCOMPASS

evaluative Comparison of Multiple Protein Alignments by Statistical Score

Protein analyses based on statistical and machine learning methods require large and accurate multiple sequence alignments (MSAs). To assess accuracy, MSA methods are typically evaluated using benchmark alignments, which, however, usually contain very few sequences and are therefore inadequate for evaluating methods to align large numbers of sequences.  eCOMPASS addresses this problem by relying on a statistical measure of relative MSA accuracy based on direct coupling analysis (DCA): To maintain structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. Based on this principle, eCOMPASS computes the statistical significance of the agreement between the most directly coupled pairs in an MSA and 3D contacts in corresponding structures. This provides a direct measure of MSA quality inasmuch as the agreement between high DC-scoring pairs and 3D contacts depends upon proper alignment of homologous residues.

eCOMPASS source code (v0.9):

Example input MSAs & structures

MBL-fold metallohydrolase domain

(cd06262)

(PF00753)

mbl_msa1.tar

mbl_msa2.tar

mbl_pdb.tar

Rhodanese Homology Domain

(cd00158)

Pleckstrin homology (PH) domain

(cd00900)

C2 domain

(cd00030)

PTS_IIB domain

(cd00133)

Sulfatase Domain
Cupredoxin domain

(cd00920)

Haloacid dehalogenase-like hydrolase domain

Contact aneuwald@som.umaryland.edu to compare CDD-derived MSAs to JackHMMER-generated MSAs for the following domains:

AAAplus AAT_1 ABHF ActinLike AmyAc C2 Cupredoxin DEATH DNA_BREc EEP FRIZZLED GLOBINS GNAT GTPASES GT_A HAD HATPase HDAC HELICASE_Ct LPLAT MBL NP_I NR_LBD PARB PH PKc PPBP PRTaseII PTP PTS PeptidS8S53 Peptidase_C19 RHOD RMS RPA2_OBF SORTASE UDG_LIKE ZnMP ZnMc tRNA_Synth

References:

Neuwald, A.F., Kolaczkowski, B.D.  & Altschul, S.F.  2021. eCOMPASS: evaluative comparison of multiple protein alignments by statistical score. Bioinformatics btab374. doi: 10.1093/bioinformatics/btab374.

Neuwald, A.F. & Altschul S.F. 2018. Statistical Investigations of Protein Residue Direct Couplings. PLoS Computational Biology 14(12):e1006237.

Neuwald, A.F., Lanczycki, C. J., Hodges, T.K., & A. Marchler-Bauer. 2020. Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments. Database. doi:10.1093/database/baaa042.

Seemayer S., Gruber M. & Söding J. 2014. CCMpred–fast and precise prediction of protein residue-residue contacts from correlated mutations. Bioinformatics 30(21):3128-30.

Funding:

National Institutes of Health, National Institute of General Medical Sciences grant R01GM125878