How to detect and use sequence signals

Signals in protein sequence regulate a lot of the chemistry in a cell. For instance there are cells that regulate if a particular protein can be cleaved at a particular position, if it can be glycyoslated etc. Further there are signals that predict if a particular cell is exported from the cell etc. Some of these signals are easy to detect others very subtle.

Detecting singals

It is often not possible to detect a signal until a substantial number of proteins belonging to the same family is sequenced and annotated. When the proteins of a family are aligned correctly it is possible to calculate the bias for a certain residue in a certain position. If such a bias exist a logo is an eficient way to visualize it.

Predictions methods

Prosite Patterns

Often it is not possible to detect a conserved motif until many members of a protein family can be compared. If a motif is specific enough can it be used to search for new members of the family. The largest family of motifs are available from the PROSITE database. In PROSITE every motif is described with a expression that indicates what amino acids are available in a certain position of the motif. An example is shown below.

DE Signal peptidases I serine active site (PS00501).

PA [GS]-x-S-M-x-P-[AT]-[LF]

NR   /TOTAL=34(34); /POSITIVE=19(19); /UNKNOWN=0(0); /FALSE_POS=15(15); /FALSE_NEG=0

The motif consist od a glycine (G) or a serine (S) followed by any arbitrary aminoacid, a serine, a methionine, an arbitrary, an alanin or threonin, a leucine or phenylanine. In a recent version of Swissprot there are 19 known "Signal peptidases I serine active site" and 15 other proteins that contain this motif. Prosite contains many hundreds of these types od patterns.

Profile matrixes

An extension to prosite patterns are weight (or profile) matrices. In a weight matrix one do not tell what aminoacids are allowed in a certain position but rather how probably it is to find a certain amino acid in a certain position.

bas/position 1 2 3 4 5
A 0.21 0.86 0.10 0.07 0.13
C 0.02 0.12 0.75 0.05 0.10
G 0.54 0.01 0.11 0.64 0.08
T 0.22 0.01 0.04 0.14 0.69
 
bas/position 1 2 3 4 5
A -0.17 1.24 -0.92 -1.27 -0.65
C -2.53 -0.73 1.10 -1.61 -0.92
G 0.77 -3.22 -0.82 0.94 -1.14
T -0.13 -3.22 -1.83 -0.58 1.02
query: AACGGTGACGTGAAGTGC

resultat: 1.97; 0.24; -8.29; -3.99; -5.46; -7.02; 5.07; -3.44; ....

A simple weightmatrix corresponding to the DNA sequence GACGT.

neural network

With the increased amount of sequence data produced it has been even more important to increase the sensitivity of a given method. One of the best method for this is to not use a neural network or other machine learning approaches.


Arne Elofsson
Last modified: Fri Oct 9 15:54:24 CEST 1998