Feature extraction for improved Profile HMM based biological sequence analysis

Thomas Pl{\"o}tz and Gernot A. Fink
Proc. Int. Conf. on Pattern Recognition, pages 315-318, 2004.

BibTeX PDF

Abstract

State-of-the-art systems for biological sequence analysis employ statistical modeling techniques, most notably so-called Profile HMMs. However, all approaches still rely on a purely symbolic sequence representation, which severely limits their capabilities in describing weak similarities between remotely homologue members of sequence families. Therefore, we propose a multi-channel signal-like sequence representation based on a combination of several numerically encoded biochemical properties of the individual residues. From this representation features are extracted capturing relevant local sequence properties by applying wavelet and principal component analysis. Evaluation results on a challenging task of sequence family classification prove that Profile HMMs trained on the feature-based sequence representation significantly outperform discrete models.