Signals in protein sequence regulate a lot of the chemistry in a cell. For instance there are cells that regulate if a particular protein can be cleaved at a particular position, if it can be glycyoslated etc. Further there are signals that predict if a particular cell is exported from the cell etc. Some of these signals are easy to detect others are very subtle.
It is often not possible to detect a signal until a substantial number of proteins belonging to the same family is sequenced and annotated. When the proteins of a family are aligned correctly it is possible to calculate the bias for a certain residue in a certain position. If such a bias exist a logo is an efficient way to visualize it.
If a motif is specific enough can it be used to search for new members of the family. The largest family of motifs are available from the PROSITE database. In PROSITE every motif is described with an expression that indicates what amino acids are available in a certain position of the motif. An example is shown below.
DE Signal peptidases I serine active site (PS00501).
PA [GS]-x-S-M-x-P-[AT]-[LF]
NR /TOTAL=34(34); /POSITIVE=19(19); /UNKNOWN=0(0); /FALSE_POS=15(15); /FALSE_NEG=0
The motif consist of a glycine (G) or a serine (S) followed by any arbitrary aminoacid, a serine (S), a methionine (M), another arbitrary, an alanin (A) or threonin (T), a leucine (L) or phenylanine (F). In a recent version of Swissprot there are 19 known "Signal peptidases I serine active site" and 15 other proteins that contain this motif. Prosite contains many hundreds different patterns.
An extension to prosite patterns are weight (or profile) matrices. In a weight matrix one do not tell what aminoacids are allowed in a certain position but rather how probably it is to find a certain amino acid in a certain position.
| bas/position | 1 | 2 | 3 | 4 | 5 |
| A | 0.21 | 0.86 | 0.10 | 0.07 | 0.13 |
| C | 0.02 | 0.12 | 0.75 | 0.05 | 0.10 |
| G | 0.54 | 0.01 | 0.11 | 0.64 | 0.08 |
| T | 0.22 | 0.01 | 0.04 | 0.14 | 0.69 |
| bas/position | 1 | 2 | 3 | 4 | 5 |
| A | -0.17 | 1.24 | -0.92 | -1.27 | -0.65 |
| C | -2.53 | -0.73 | 1.10 | -1.61 | -0.92 |
| G | 0.77 | -3.22 | -0.82 | 0.94 | -1.14 |
| T | -0.13 | -3.22 | -1.83 | -0.58 | 1.02 |
results: 1.97; 0.24; -8.29; -3.99; -5.46; -7.02; 5.07; -3.44; ....
A simple weight matrix corresponding to the DNA sequence GACGT.
With the increased amount of sequence data produced it has been even more important to increase the sensitivity of a given method. One method for this is to use a neural network or other machine learning approaches.
|
Arne Elofsson Stockholm Bioinformatics Center, Department of Biochemistry, Arrheniuslaboratoriet Stockholms Universitet 10691 Stockholm, Sweden |
Tel: +46-(0)8/161553 Fax: +46-(0)8/158057 Hem: +46-(0)8/6413158 Email: arne@sbc.su.se WWW: /~arne/ |
|---|