Protein Secondary Structure Prediction: HMM & CRF Implementation

A comprehensive machine learning research project implementing probabilistic approaches for protein secondary structure prediction using Hidden Markov Models and Conditional Random Fields. Built on the CB513 dataset with sophisticated feature engineering and advanced model architectures targeting the fundamental challenge of predicting protein folding patterns from amino acid sequences.

Overview

This research addresses the critical bioinformatics challenge of predicting protein secondary structures from amino acid sequences, with direct implications for drug design and disease mechanism understanding. Protein structure prediction remains fundamentally difficult because identical amino acid subsequences can adopt different structures based on molecular environment, requiring sophisticated modeling approaches that capture both local and global structural determinants.

This work investigates two complementary probabilistic frameworks: generative Hidden Markov Models and discriminative Conditional Random Fields.

The project advances understanding of sequence-structure relationships through feature engineering works like beta-sheet specific interaction patterns, evolutionary conservation analysis, and state balance mechanisms. Using the CB513 dataset containing 514 non-homologous protein sequences, we developed specialized techniques for handling the complex interdependencies inherent in protein folding while maintaining numerical stability for long sequences up to 700 residues.

Algorithmic Analysis

Hidden Markov Models (HMM) approach protein structure prediction as a generative sequence modeling problem. The algorithm takes amino acid sequences and models them as probabilistic state transitions between structural elements (helix, sheet, coil). At each position, the HMM asks: "What structural state am I likely in, given the amino acids I've seen and the transitions I've learned?" The math builds emission probabilities (how likely each amino acid is in each structure) and transition probabilities (how structures flow into each other).

Our implementation uses mixture-of-Gaussians to capture complex amino acid patterns within each structural state, essentially learning multiple "flavors" of helices, sheets, and coils from the data.

Conditional Random Fields (CRF) take a discriminative approach, directly modeling the probability of structure sequences given amino acid input. Instead of generating sequences like HMMs, CRFs ask: "Given this specific amino acid sequence, what's the most likely structure pattern?" The algorithm builds feature functions that capture relationships between amino acids and structures, then uses global optimization to find the best structural labeling for the entire sequence.

The fundamental difference: HMMs learn how protein sequences are "generated" from structures, while CRFs learn how to "discriminate" between different structural possibilities given a sequence. This explains why our CRF achieved 67.17% accuracy with balanced predictions while the HMM suffered from state collapse—discriminative models handle protein structure's complex interdependencies more effectively than generative assumptions.

What each algorithm does:

HMM: Models sequence generation (generative approach)
CRF: Models structure discrimination (discriminative approach)

The core mathematical approach:

HMM: Builds emission + transition probabilities, asks "what state am I in?"
CRF: Builds feature functions, asks "what's the best structure for this sequence?"

Process flows for HMM - CRF:

- HMM Processing Flow:
Raw Sequence → Feature Extraction (PSSM/OneHot) → Forward Pass → Backward Pass → Mixture Responsibilities → State Statistics → Parameter Updates → Predictions
     ↓              ↓                               ↓              ↓                 ↓                        ↓                ↓                ↓
[MKLLLL...]  [42-dim vectors]                 [α probabilities] [β probabilities] [Component weights]    [State counts]   [New params]    [Final states]
     ↓              ↓                               ↓              ↓                 ↓                        ↓                ↓                ↓
Preprocessing → Feature Engineering → Likelihood Computation → Posterior Calc → GMM Responsibilities → Balance Check → Update Rules → Viterbi Decode

- CRF Processing Flow:
Raw Sequence → Enhanced Features → Window Context → Feature Functions → Forward Pass → Backward Pass → Gradient Compute → Parameter Updates → Predictions
     ↓              ↓                  ↓               ↓               ↓              ↓                ↓                ↓                ↓
[MKLLLL...]  [Base + β-sheet]    [13-pos window]  [22 functions]  [Factor Graph]  [Messages]     [Log-Linear Grad]  [New Weights]  [Final labels]
     ↓              ↓                  ↓               ↓               ↓              ↓                ↓                ↓                ↓
Preprocessing → Feature Engineering → Context Build → Function Eval → Message Pass → Belief Prop → Gradient Ascent → Weight Update → Decode Labels

Hidden Markov Model Implementation

Architecture features three-component Gaussian mixtures per state with mixture weights stabilizing at 0.46, 0.35, and 0.19 after initial fluctuation, designed to capture multi-modal amino acid patterns while balancing model capacity with computational efficiency.

Core Architectural Components:

Modified Baum-Welch Algorithm: State-specific constraints with biological priors, dataset-derived initialization replacing random starts for improved convergence
Enhanced Viterbi Decoding: Log-space computations with state transition constraints, numerical stability for sequences up to 700 residues through adaptive scaling
Adaptive State Balance System: Dynamic probability thresholds (min: 0.016, max: 0.047) derived from dataset analysis to prevent model collapse
Component Specialization Tracking: Monitors mixture evolution during training, identifies structural motif preferences, prevents component collapse through adaptive M-step weighting

Critical Performance Insights: Despite sophisticated engineering, the HMM revealed fundamental limitations in generative modeling for protein structures. The model exhibited severe state collapse with distributions degrading from balanced initial states [0.000013, 0.011012, 0.988836] to heavily skewed final distributions [0.000000, 0.004259, 0.995602].

This collapse persisted across multiple configurations including feature-specific learning rates (0.094353), extensive feature engineering (one-hot: 42%, PSSM: 39%, auxiliary: 19%), and advanced balance enforcement mechanisms. Training dynamics showed initial gradient norms of 208.573±246.648 stabilizing to 36.074 post-warmup, with emission means ranging [-0.893, 1.246] and covariance stability measures averaging 0.285.

The conditional independence assumptions prove incompatible with protein structure's highly interdependent nature.

Conditional Random Field Implementation

The CRF implementation achieved significant success through its discriminative framework, reaching 67.17% accuracy with balanced state predictions [0.364, 0.289, 0.347] across helix, sheet, and coil structures. This excelled at beta-sheet detection through specialized N→N+3 residue interaction scoring, enabling biologically meaningful structure transitions without explicit enforcement while approaching the established 70% benchmark.

Advanced Feature Engineering Pipeline:

258-Dimensional Feature Space: Base features (45D) including one-hot encoding, PSSM scores, position features; enhanced features (24D) with 22 specialized beta-sheet characteristics; context features (189D) through window-based analysis
Multi-Scale Conservation Analysis: Structure-specific boost factors (helix: 1.2x, sheet: 1.0x, coil: 0.8x) derived from PSSM evaluation
Beta-Sheet Pattern Recognition: Specialized N→N+3 residue interaction scoring with distance-weighted scaling, crucial for capturing long-range interactions missed by local feature windows
Structure Transition Framework: 9-state transition analysis revealing stability patterns (H→H: 0.91, E→E: 0.67, C→C: 0.39)

Training Architecture & Results: The enhanced CRF achieved 67.17% accuracy through intelligent gradient management and adaptive feature weighting that automatically balanced evolutionary information (PSSM) with structural indicators. Model confidence steadily improved from 0.458 to 0.653 during training, demonstrating robust learning dynamics. Performance highlights include:

Helix prediction: 70.5% F1-score with strong precision-recall balance Sheet detection: 63.7% F1-score using specialized beta-sheet features Overall improvement: Substantial gains over baseline while approaching the established 70% benchmark

The implementation successfully integrated complex biological patterns into a discriminative framework, though persistent beta-sheet challenges reveal opportunities for capturing longer-range structural dependencies.

Real-World Problem Analysis

Protein secondary structure prediction represents a fundamental bottleneck in computational biology, directly impacting drug discovery timelines and costs. Current experimental methods like X-ray crystallography and NMR spectroscopy require weeks to months. The challenge we observed in the domain was the many-to-many nature of sequence-structure relationships where:

Local Context Limitations: Simple sequence windows fail to capture critical long-range interactions necessary for accurate prediction
Evolutionary Constraints: Structural patterns are preserved through evolution but manifest differently across protein families
State Interdependencies: Secondary structure elements exhibit complex transition probabilities that violate traditional independence assumptions
Data Complexity: High-dimensional feature spaces (39,900 dimensions per sequence) require sophisticated dimensionality handling

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
BiLSTM Approach		BiLSTM Approach
HMM, CRF Experiments		HMM, CRF Experiments
SVM Approach		SVM Approach
.gitattributes		.gitattributes
README.md		README.md
Report - Protein Struct Prediction HMM, CRF, LSTM, ML.pdf		Report - Protein Struct Prediction HMM, CRF, LSTM, ML.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protein Secondary Structure Prediction: HMM & CRF Implementation

Overview

Algorithmic Analysis

What each algorithm does:

The core mathematical approach:

Process flows for HMM - CRF:

Hidden Markov Model Implementation

Conditional Random Field Implementation

Real-World Problem Analysis

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

mjsushanth/ML_Protein_Structure_Prediction

Folders and files

Latest commit

History

Repository files navigation

Protein Secondary Structure Prediction: HMM & CRF Implementation

Overview

Algorithmic Analysis

What each algorithm does:

The core mathematical approach:

Process flows for HMM - CRF:

Hidden Markov Model Implementation

Conditional Random Field Implementation

Real-World Problem Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages