evaluative Comparison of Multiple Protein Alignments by Statistical Score

Protein analyses based on statistical and machine learning methods require large and accurate multiple sequence alignments (MSAs). To assess accuracy, MSA methods are typically evaluated using benchmark alignments, which, however, usually contain very few sequences and are therefore inadequate for evaluating methods to align large numbers of sequences. eCOMPASS addresses this problem by relying on a statistical measure of relative MSA accuracy based on direct coupling analysis (DCA): To maintain structural integrity over evolutionary time, substitutions at one residue position typically result in compensating substitutions at other positions. Based on this principle, eCOMPASS computes the statistical significance of the agreement between the most directly coupled pairs in an MSA and 3D contacts in corresponding structures. This provides a direct measure of MSA quality inasmuch as the agreement between high DC-scoring pairs and 3D contacts depends upon proper alignment of homologous residues.

eCOMPASS executable:

ecompass_v0_9

For eCOMPASS C++ source code download the source code tarball from the bottom of the software webpage:

main source code tarball

Example input MSAs & structures

MBL-fold metallohydrolase domain

(cd06262)

(PF00753)

mbl_msa1.tar

mbl_msa2.tar

mbl_pdb.tar

Rhodanese Homology Domain

(cd00158)

(PF00581)

rhod_msa.tar

rhod_pdb.tar

Pleckstrin homology (PH) domain

(cd00900)

(PF00169)

ph_msa.tar

ph_pdb.tar

C2 domain

(cd00030)

(PF00168)

c2_msa.tar

c2_pdb.tar

PTS_IIB domain

(cd00133)

(PF02302)

pts_msa.tar

pts_pdb.tar

Sulfatase Domain

(cd00016)

(PF00884)

sfts_msa1.tar

sfts_msa2.tar

sfts_pdb.tar

Cupredoxin domain

(cd00920)

(PF07732)

cudxn_msa.tar

cudxn_pdb.tar

Haloacid dehalogenase-like hydrolase domain

(cd01427)

(PF00702)

had_msa1.tar

had_msa2.tar

had_pdb.tar

Contact aneuwald@som.umaryland.edu to compare CDD-derived MSAs to JackHMMER-generated MSAs for the following domains:

AAAplus AAT_1 ABHF ActinLike AmyAc C2 Cupredoxin DEATH DNA_BREc EEP FRIZZLED GLOBINS GNAT GTPASES GT_A HAD HATPase HDAC HELICASE_Ct LPLAT MBL NP_I NR_LBD PARB PH PKc PPBP PRTaseII PTP PTS PeptidS8S53 Peptidase_C19 RHOD RMS RPA2_OBF SORTASE UDG_LIKE ZnMP ZnMc tRNA_Synth

References:

Neuwald, A.F., Kolaczkowski, B.D. & Altschul, S.F. 2021. eCOMPASS: evaluative comparison of multiple protein alignments by statistical score. Bioinformatics btab374. doi: 10.1093/bioinformatics/btab374.

Neuwald, A.F. & Altschul S.F. 2018. Statistical Investigations of Protein Residue Direct Couplings. PLoS Computational Biology 14(12):e1006237.

Neuwald, A.F., Lanczycki, C. J., Hodges, T.K., & A. Marchler-Bauer. 2020. Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments. Database. doi:10.1093/database/baaa042.

Seemayer S., Gruber M. & Söding J. 2014. CCMpred–fast and precise prediction of protein residue-residue contacts from correlated mutations. Bioinformatics 30(21):3128-30.

Funding:

National Institutes of Health, National Institute of General Medical Sciences grant R01GM125878