PROSITE logo

PROSITE documentation PDOC50099
Sequence regions enriched in a particular amino acid profiles


Description

Many proteins contain compositionally biased sequence regions which are also called low-complexity regions [1]. Typically, such regions are highly enriched in one or a few amino acids. We have included profiles specific for each of the 20 amino acids so as to search for regions that are significantly enriched in a particular amino acid (alanine, cysteine, aspartic acid, glutamic acid, phenylalanine, glycine, histidine, isoleucine, lysine, leucine, methionine, asparagine, proline, glutamine, arginine, serine, threonine, valine, tryptophan and tyrosine-rich region profiles). The behaviour of these profiles is controlled by two parameters, the match and mismatch scores. These parameters were chosen such that the "target frequencies" of the corresponding amino acids computed according to the Karlin-Altschul theory [2] approximate 35% for the residue composition of Swiss-Prot (see below).

   Amino     Average    Match    Mismatch   Target
   acid      frequency  score    score      frequency

   Ala (A)   7.55          4       -1       38.5
   Cys (C)   1.69          7       -1       36.8
   Asp (D)   5.30          5       -1       35.1
   Glu (E)   6.32          5       -1       32.4
   Phe (F)   4.07          6       -1       31.9
   Gly (G)   6.84          5       -1       31.2
   His (H)   2.24          7       -1       33.6
   Ile (I)   5.72          5       -1       34.0
   Lys (K)   5.93          5       -1       33.4
   Leu (L)   9.33          4       -1       34.7
   Met (M)   2.35          7       -1       33.1
   Asn (N)   4.52          5       -1       37.4
   Pro (P)   4.92          5       -1       36.2
   Gln (Q)   4.02          6       -1       32.1
   Arg (R)   5.15          5       -1       35.5
   Ser (S)   7.22          4       -1       39.2
   Thr (T)   5.74          5       -1       33.9
   Val (V)   6.52          5       -1       32.0
   Trp (W)   1.25          8       -1       34.9
   Tyr (Y)   3.19          6       -1       35.1

The normalisation parameters for converting raw scores into per-residue log expectation values, which are given within the profile, were empirically derived by fitting an extreme value distribution to the score distribution obtained from a random database that conserves the length distribution and global amino acid composition of Swiss-Prot but not the composition of the individual sequences.

Note:

These profiles do not characterize biologically defined objects. As the underlying definition is purely statistical, it is not possible to speak of true or false matches to these profiles, neither is it possible to assign a false negative status to a sequence.

Expert(s) to contact by email:

Bucher P.

Last update:

February 2024 / Text revised.

-------------------------------------------------------------------------------


Technical section

PROSITE methods (with tools and information) covered by this documentation:

ALA_RICH, PS50310; Alanine-rich region profile  (MATRIX with a high probability of occurrence!)

ARG_RICH, PS50323; Arginine-rich region profile  (MATRIX with a high probability of occurrence!)

ASN_RICH, PS50321; Asparagine-rich region profile  (MATRIX with a high probability of occurrence!)

ASP_RICH, PS50312; Aspartic acid-rich region profile  (MATRIX with a high probability of occurrence!)

CYS_RICH, PS50311; Cysteine-rich region profile  (MATRIX with a high probability of occurrence!)

GLN_RICH, PS50322; Glutamine-rich region profile  (MATRIX with a high probability of occurrence!)

GLU_RICH, PS50313; Glutamic acid-rich region profile  (MATRIX with a high probability of occurrence!)

GLY_RICH, PS50315; Glycine-rich region profile  (MATRIX with a high probability of occurrence!)

HIS_RICH, PS50316; Histidine-rich region profile  (MATRIX with a high probability of occurrence!)

ILE_RICH, PS50317; Isoleucine-rich region profile  (MATRIX with a high probability of occurrence!)

LEU_RICH, PS50319; Leucine-rich region profile  (MATRIX with a high probability of occurrence!)

LYS_RICH, PS50318; Lysine-rich region profile  (MATRIX with a high probability of occurrence!)

MET_RICH, PS50320; Methionine-rich region profile  (MATRIX with a high probability of occurrence!)

PHE_RICH, PS50314; Phenylalanine-rich region profile  (MATRIX with a high probability of occurrence!)

PRO_RICH, PS50099; Proline-rich region profile  (MATRIX with a high probability of occurrence!)

SER_RICH, PS50324; Serine-rich region profile  (MATRIX with a high probability of occurrence!)

THR_RICH, PS50325; Threonine-rich region profile  (MATRIX with a high probability of occurrence!)

TRP_RICH, PS50327; Tryptophan-rich region profile  (MATRIX with a high probability of occurrence!)

TYR_RICH, PS50328; Tyrosine-rich region profile  (MATRIX with a high probability of occurrence!)

VAL_RICH, PS50326; Valine-rich region profile  (MATRIX with a high probability of occurrence!)


References

1AuthorsWootton J.C. Federhen S.
TitleAnalysis of compositionally biased regions in sequence databases.
SourceMethods Enzymol. 266:554-571(1996).
PubMed ID8743706

2AuthorsKarlin S. Bucher P. Brendel V. Altschul S.F.
TitleStatistical methods and insights for protein and DNA sequences.
SourceAnnu. Rev. Biophys. Biophys. Chem. 20:175-203(1991).
PubMed ID1867715



PROSITE is copyrighted by the SIB Swiss Institute of Bioinformatics and distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND 4.0) License, see prosite_license.html.

Miscellaneous

View entry in original PROSITE document format
View entry in raw text format (no links)