Home Page Local Home Page Text Assignments Schedule Links

Bioinformatics for protein sequence, structure and function


How to detect and use sequence signals

How to detect and use sequence signals

Signals in protein sequence regulate a lot of the chemistry in a cell. For instance there are cells that regulate if a particular protein can be cleaved at a particular position, if it can be glycyoslated etc. Further there are signals that predict if a particular cell is exported from the cell etc. Some of these signals are easy to detect others are very subtle.

Detecting signals

It is often not possible to detect a signal until a substantial number of proteins belonging to the same family is sequenced and annotated. When the proteins of a family are aligned correctly it is possible to calculate the bias for a certain residue in a certain position. If such a bias exist a logo is an efficient way to visualize it.

Predictions methods

Prosite Patterns

If a motif is specific enough can it be used to search for new members of the family. The largest family of motifs are available from the PROSITE database. In PROSITE every motif is described with an expression that indicates what amino acids are available in a certain position of the motif. An example is shown below.

DE Signal peptidases I serine active site (PS00501).

PA [GS]-x-S-M-x-P-[AT]-[LF]

NR   /TOTAL=34(34); /POSITIVE=19(19); /UNKNOWN=0(0); /FALSE_POS=15(15); /FALSE_NEG=0

The motif consist of a glycine (G) or a serine (S) followed by any arbitrary aminoacid, a serine (S), a methionine (M), another arbitrary, an alanin (A) or threonin (T), a leucine (L) or phenylanine (F). In a recent version of Swissprot there are 19 known "Signal peptidases I serine active site" and 15 other proteins that contain this motif. Prosite contains many hundreds different patterns.

Profile matrixes

An extension to prosite patterns are weight (or profile) matrices. In a weight matrix one do not tell what aminoacids are allowed in a certain position but rather how probably it is to find a certain amino acid in a certain position.

bas/position 1 2 3 4 5
A 0.21 0.86 0.10 0.07 0.13
C 0.02 0.12 0.75 0.05 0.10
G 0.54 0.01 0.11 0.64 0.08
T 0.22 0.01 0.04 0.14 0.69
 
bas/position 1 2 3 4 5
A -0.17 1.24 -0.92 -1.27 -0.65
C -2.53 -0.73 1.10 -1.61 -0.92
G 0.77 -3.22 -0.82 0.94 -1.14
T -0.13 -3.22 -1.83 -0.58 1.02
query: AACGGTGACGTGAAGTGC

results: 1.97; 0.24; -8.29; -3.99; -5.46; -7.02; 5.07; -3.44; ....

A simple weight matrix corresponding to the DNA sequence GACGT.

neural network

With the increased amount of sequence data produced it has been even more important to increase the sensitivity of a given method. One method for this is to use a neural network or other machine learning approaches.


Arne Elofsson
Last modified: Wed Oct 27 16:06:57 CEST 1999
Arne Elofsson
Stockholm Bioinformatics Center,
Department of Biochemistry,
Arrheniuslaboratoriet
Stockholms Universitet
10691 Stockholm, Sweden
Tel: +46-(0)8/161553
Fax: +46-(0)8/158057
Hem: +46-(0)8/6413158
Email: arne@sbc.su.se
WWW: /~arne/