The PROSITE database of protein domains, families and functional sites - User Manual

Table of contents

What is PROSITE ?

Contact information
Copyright notice
Introduction
History up to Release 20.0
Distributed files
Citation
Feedback

Prosite methodology

Patterns development

Patterns from the literature
Steps in the development of a new pattern

Profiles development
Repeats identification

Database conventions

General structure
Documentation file structure
Data file structure

Structure of a data entry
Example of a pattern entry
Example of a profile (matrix) entry

Auxiliary file structure

The different line types for PROSITE.DAT and PROSITE.AUX files

PURL
The ID line
The AC line
The DT line
The DE line
The PA line
The MA line
The PP line
The NR line
The CC line

The /TAXO-RANGE qualifier
The /MAX-REPEAT qualifier
The /SITE qualifier
The /SKIP-FLAG qualifier
The /VERSION qualifier
The /MATRIX_TYPE qualifier
The /SCALING_DB qualifierr
The /AUTHOR qualifier
The /FT_KEY and /FT_DESC qualifiers

The DR line
The 3D line
The PR line
The DO line
The termination line

I. What is PROSITE ?

I.A. Contact Information

Swiss-Prot group
SIB Swiss Institute of Bioinformatics
Centre Medical Universitaire (CMU)
1, rue Michel Servet
1211 Geneva 4
Switzerland
Fax: +41-22-702 58 58
Email: prosite.expasy.org/contact
www server: https://prosite.expasy.org/
ftp server: https://ftp.expasy.org/databases/prosite/

I.B. Copyright notice

PROSITE is copyright.
There are no restrictions on its use by non-profit institutions as long as its content is in no way modified. Usage by and for commercial entities requires a license agreement. For information about the licensing scheme see the prosite license. The copyright notice also applies to this user manual as well as to any other PROSITE document.

I.C. Introduction

The PROSITE database is a collection of protein families, domains and/or motifs. Its patterns and profiles can help to determine what is the function of uncharacterized proteins translated from genomic or cDNA sequences. With appropriate computational tools patterns and/or profiles can rapidly and reliably identify which known family of protein (if any) the new sequence belongs to and/or which domain(s) and/or sites it contains.

In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint. These motifs arise because of particular requirements on the structure of specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity. These requirements impose very tight constraints on the evolution of those limited (in size) but important portion(s) of a protein sequence. To paraphrase Orwell, in Animal Farm, we can say that "some regions of a protein sequence are more equal than others" !

The use of protein sequence patterns (or motifs )to determine the function(s) of proteins is becoming very rapidly one of the essential tools of sequence analysis. This reality has been recognized by many authors, as it can be illustrated from the following citations from two of the most well known experts of protein sequence analysis, R.F. Doolittle and A.M. Lesk:

"There are many short sequences that are often (but not always) diagnostics of certain binding properties or active sites. These can be set into a small subcollection and searched against your sequence (1)".

"In some cases, the structure and function of an unknown protein which is too distantly related to any protein of known structure to detect its affinity by overall sequence alignment may be identified by its possession of a particular cluster of residues types classified as a motifs. The motifs, or templates, or fingerprints, arise because of particular requirements of binding sites that impose very tight constraint on the evolution of portions of a protein sequence (2)."

Based on these observations we decided, in 1988, to actively pursue the development of a database of patterns which would be used to search against sequences of unknown function. This database, called PROSITE, contains a few patterns which have been published in the literature, but the majority have been developed, in the last ten years by the author. Originally this dictionary was conceived as part of the author's doctoral dissertation as well as an integral part of the PROSITE program in the PC/Gene sequence analysis software package. But, as many people have expressed their interest in this project, we have decided to make this work available on computer media.

There are a number of protein families as well as functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence; the use of techniques based on weight matrices (also known as profiles) allows the detection of such proteins or domains. In 1994 we started a collaborative project with Philipp Bucher to introduce profiles in PROSITE. Currently, most of the new PROSITE entries are centered around profiles and are developed by the PROSITE collaborators at the SIB Swiss Institute of Bioinformatics in Geneva and Lausanne.

References

1) Doolittle R.F.

(In) Of URFs and ORFs: a primer on how to analyze derived amino acid

sequences., University Science Books, Mill Valley, California, (1986).

2) Lesk A.M.

(In) Computational Molecular Biology, Lesk A.M., Ed., pp17-26, Oxford

University Press, Oxford (1988).

I.D. History up to Release 20.0

PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them [ More... ]. PROSITE is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids [ More... ].

The following table shows the growth of the database since its creation in 1989 up to release 20.0 in 2010:

Release	Date	Documentation entries	Patterns/profiles entries	Note
1.0	03/89	58	60	Only released in PC/Gene (Version 5.16)
2.0	03/89	129	132	Only released in PC/Gene (Version 6.00)
3.0	05/89	?	160
4.0	10/89	?	202	Printed release (EMBL Biocomputing document)
5.0	04/90	296	338
6.0	11/90	375	433
7.0	05/91	441	508
8.0	11/91	530	605
9.0	06/91	580	689
10.0	12/92	635	803
11.0	10/93	715	927
12.0	06/94	785	1029	First release to include profiles
13.0	11/95	889	1167
14.0	12/97	997	1335
15.0	06/98	1014	1352
16.0	07/99	1034	1374
17.0	12/01	1108	1501
18.0	07/03	1200	1639
19.0	04/05	1344	1841
20.0	04/10	1449	2004	Introduction of the ProRule section

I.E. Distributed files

old_releases	Contains, for each former release of PROSITE, prosite.dat, prosite.aux (as of release 2024_03), prodoc.dat and prorule.dat (as of release 20.0)
ps_scan	Perl program used to scan one or several patterns and/or profiles from PROSITE against one or several protein sequences
README	Architecture and list of files present in https://ftp.expasy.org/databases/prosite/
README.license	url to PROSITE license conditions document ( prosite_license.html )
README.pftools	url to the pftools (collection of programs supporting the generalized profile format and search method of PROSITE)
README.prosuser	url to PROSITE user manual document (this document)
_COPYRIGHT_NOTICE_	PROSITE copyright notice
evaluator.dat	Miniprofiles data file (all profiles used to evaluate pattern matches)
jourlist.txt	List of abbreviations for journals cited in prosite.doc
profile.txt	Syntax for PROSITE profiles
prorule.dat	PROSITE ProRule file (all rules)
prosite.aux	PROSITE auxiliary information file
prosite.dat	PROSITE data file (all patterns and profiles)
prosite.doc	PROSITE documentation file (all documentation entries)
prosite.lis	List of all documentation entries
prosite_alignments.tar.gz	Alignment of all UniProtKB/Swiss-Prot true positive hits for each pattern or profile entry
prosite_alignments_for_logos.tar.gz	Alignment of all UniProtKB/Swiss-Prot true positive hits for each pattern or profile entry, same as the ones contained in prosite_alignments.tar.gz but without any insertion.
ps_reldt.txt	Current PROSITE release number and date
psdelac.txt	List of all entries accession numbers that have been deleted from prosite.dat or prosite.doc
sequence_logos.tar.gz	Logo for each alignment present in prosite_alignments.tar.gz
unirule.pdf	Syntax for the rules in ProRule
prosuser.html ^*	User manual (this document)
prosite_license.html ^*	PROSITE license conditions

* These files are not in the prosite ftp

I.F. Citation

Persistent URL (PURL):

If you want to refer to a specific PROSITE motif or documentation entry in publications or data records, you should use the persistent URL (PURL) of that entry. For a documentation entry the PURL is in the form: https://purl.expasy.org/prosite/documentation/PDOCXXXX where XXXXX is the documentation entry number (e.g. https://purl.expasy.org/prosite/documentation/PDOC00022). For a motif entry, the PURL is in the form: https://purl.expasy.org/prosite/signature/PSXXXXX where XXXXX is the motif number (e.g. https://purl.expasy.org/prosite/signature/PS51092).

To reference a specific PROSITE publication, please go to PROSITE References.

I.G. Feedback

We welcome any feedback. If you find errors, omissions, or if you want to suggest new sites, patterns or profiles to be added to this database, please let us know. You can contact us at /contact .

II. PROSITE methodology

II.A. Patterns development

In this section we will explain how we selected or developed the signature patterns described in this compilation. Our first and most important criterion is that a good signature pattern must be as short as possible, should detect all or most of the sequences it is designed to describe and should not give too many false positive results. In other words it must exhibit both high sensitivity and high specificity.

II.A.1. Patterns from the literature

A number of the patterns described in this dictionary have been published. We have tested those patterns on UniProtKB/Swiss-Prot to see if the signature pattern was still specific to the group of family of proteins since the paper was published. If this was the case we used the published pattern as such, otherwise we updated the pattern using methods similar to those used to develop a new pattern and which are described in the following sub-section.

II.A.2. Steps in the development of a new pattern

We generally start by studying review(s) on a group or family of proteins. We build an alignment table of the proteins discussed in that review. If necessary we add to this table new published sequences relevant to the subject under consideration. Using such alignment tables we pay particular attention to the residues and regions thought or proved to be important to the biological function of that group of proteins. These biologically significant regions or residues are generally:

- Enzyme catalytic sites.
- Prosthetic group attachment sites (heme, pyridoxal-phosphate, biotin, etc).
- Amino acids involved in binding a metal ion.
- Cysteines involved in disulfide bonds.
- Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein.

We then try to find a short (not more than four or five residues long) conserved sequence which is part of a region known to be important or which includes biologically significant residue(s). We call the pattern(s) created at this stage the 'core' pattern(s). The most recent version of UniProtKB/Swiss-Prot is then scanned with these core pattern(s). If a core pattern will detect all the proteins under consideration and none (or very few) of the other proteins, we can stop at this stage and use the core pattern as a bona fide signature. In most cases we are not so lucky and we pick up a lot of extra sequences which clearly do not belong to the group of proteins under consideration. A further series of scans, involving a gradual increase in the size of the pattern, is then necessary. In some cases we never manage to find a good pattern and we have to retry with a core pattern from a different part of the sequence. It must also be noted that we take particular attention to try to avoid 'false' patterns. We will use an example to describe what we call a 'false' pattern:

Let us assume that we have a partial alignment of three sequences around an active site residue (in this example an histidine whose position is marked with an asterisk) as shown below:


                    *

             ALRDFATHDDF

             SMTAEATHDSI

             ECDQAATHEAS

Here we would start scanning with a core pattern with the sequence A-T-H-[D or E]. This pattern is small and would probably pick up too many false positive results. According to the procedure outlined above, we would then have to extend the core pattern. But in this case, any extension would be artificial and group together residues which have different properties and which are represented only once in a given position of the alignment. For example, we could scan with the pattern [R, T or D]-[D, A or Q]-[F, E or A]-A-T-H-[D or E]. This pattern would probably only pick up the sequences which are in the alignment, but it would be biologically meaningless; there is no consensus in the first three positions of the pattern and the pattern does not even group residues with identical physicochemical properties. Consequently, this pattern would probably fail to detect a new sequence containing the same active site but having a different N-terminal sequence.

II.B. Profiles development

A profile or weight matrix (the two terms are used synonymously here) is a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence. As with patterns, there may be several matches to a profile in one sequence, but multiple occurrences in the same sequences must be disjoint (non-overlapping) according to a specific definition included in the profile.

The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (3). Additional parameters allow representation of other motif descriptors, including the currently popular hidden Markov models. A technical description of the profile structure and of the corresponding motif search method is given in the file profile.txt included in each PROSITE release.

Profiles can be constructed by a large variety of different techniques. The classical method developed by Gribskov and co-workers (4) requires a multiple sequence alignment as input and uses a symbol comparison table to convert residue frequency distributions into weights. The profiles included in the current PROSITE release were generated by this procedure applying recent modifications described by Luethy and co-workers (5). In the future, we intend to apply additional profile construction tools including structure-based approaches and methods involving machine learning techniques. We also consider the possibility of distributing published profiles developed by others in PROSITE format along with locally produced documentation entries.

Unlike patterns, profiles are usually not confined to small regions with high sequence similarity. Rather they attempt to characterize a protein family or domain over its entire length. This can lead to specific problems not arising with PROSITE patterns. With a profile covering conserved as well as divergent sequence regions, there is a chance to obtain a significant similarity score even with a partially incorrect alignment. This possibility is taken into account by our quality evaluation procedures. In order to be acceptable, a profile must not only assign high similarity scores to true motif occurrences and low scores to false matches. In addition, it should correctly align those residues having analogous functions or structural properties according to experimental data.

Profiles are supposed to be more sensitive and more robust than patterns because they provide discriminatory weights not only for the residues already found at a given position of a motif but also for those not yet found. The weights for those not yet found are extrapolated from the observed amino acid compositions using empiric knowledge about amino acid substitutability. The effect of such a procedure is exemplified below.

Shown are a short alignment without gaps and the corresponding weighting table derived with our standard method.


                  F   K   L   L   S   H   C   L   L   V

                  F   K   A   F   G   Q   T   M   F   Q

                  Y   P   I   V   G   Q   E   L   L   G

                  F   P   V   V   K   E   A   I   L   K

                  F   K   V   L   A   A   V   I   A   D

                  L   E   F   I   S   E   C   I   I   Q

                  F   K   L   L   G   N   V   L   V   C



          A     -18 -10  -1  -8   8  -3   3 -10  -2  -8

          C     -22 -33 -18 -18 -22 -26  22 -24 -19  -7

          D     -35   0 -32 -33  -7   6 -17 -34 -31   0

          E     -27  15 -25 -26  -9  23  -9 -24 -23  -1

          F      60 -30  12  14 -26 -29 -15   4  12 -29

          G     -30 -20 -28 -32  28 -14 -23 -33 -27  -5

          H     -13 -12 -25 -25 -16  14 -22 -22 -23 -10

          I       3 -27  21  25 -29 -23  -8  33  19 -23

          K     -26  25 -25 -27  -6   4 -15 -27 -26   0

          L      14 -28  19  27 -27 -20  -9  33  26 -21

          M       3 -15  10  14 -17 -10  -9  25  12 -11

          N     -22  -6 -24 -27   1   8 -15 -24 -24  -4

          P     -30  24 -26 -28 -14 -10 -22 -24 -26 -18

          Q     -32   5 -25 -26  -9  24 -16 -17 -23   7

          R     -18   9 -22 -22 -10   0 -18 -23 -22  -4

          S     -22  -8 -16 -21  11   2  -1 -24 -19  -4

          T     -10 -10  -6  -7  -5  -8   2 -10  -7 -11

          V       0 -25  22  25 -19 -26   6  19  16 -16

          W       9 -25 -18 -19 -25 -27 -34 -20 -17 -28

          Y      34 -18  -1   1 -23 -12 -19   0   0 -18

Note that at certain positions, a residue not occurring in the alignment receives a higher score than one occurring in the alignment, as a result of other residues at that position. Thus A occurring in the third column has a lower score (-1) than M (+10) not occurring there but physicochemically similar to L, I, V, F found in the other sequences. Similar extrapolation procedures are used to derive position-specific insertion and deletion scores which further enhance the selectivity of the profile.

II.C. Repeats identification

Generally repeats possess high amino acid substitution rates and their identification is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and number of repetitive units often cannot be determined using current profile search. We have implemented a context dependent threshold that allows the detection of strongly divergent repeats when well characterized ones have already been identified.

Our approach aims to set a lower acceptance threshold for sub-optimal alignments of profiles to proteins containing repeats. This is accomplished by scanning the profile against a randomized database of sequences where the occurrence of at least one copy of the repeat has been assessed with high confidence. The computed lower acceptance threshold is then used both for the detection of additional copies of the same repeat within the protein, and for the identification of new distantly related members of the protein family.

Two complementary approaches were designed to increase the sensitivity of profiles for the detection of repeats. One approach, Repeats Detection Method 1 (RDM1) consists in defining (computing) a low acceptance threshold placed at level -1 in the profile. For simplicity we will call level 0 cutoff protein-threshold and level -1 cutoff minimal-threshold. When the profile is compared with a given sequence a list of matches with scores greater than the minimal-threshold is collected. The matches are considered as significant, only if at least a hit with a score greater than the protein-threshold has been detected in the target protein. In a target sequence, where the occurrence of a particular domain has been reported, the minimal-threshold represents the score above which the probability of detecting additional copies of the same domain by chance is close to zero.

However, the detection of repeats in proteins where no single domain scores above the protein-threshold remains critical. This is typically the case for more distantly related members of a protein family. To obviate this problem a second approach was devised, Repeats Detection Method 2 (RDM2). The sum of the scores of alignments with scores greater than the minimal-threshold is computed. If the sum of the individual domain scores is larger than a threshold (the sum-of-scores-threshold), these domains are considered to be true homologues. Based on the inspection of the list of positive hits found upon databases searches, we found that a good estimate for the sum-of-scores-threshold is the value of the sum of the protein-threshold with the minimal-threshold. This value was chosen since it represents in theory the minimal match score that would be detected when aligning a profile to a member of a given protein family containing only two copies of a repeat.

RDM1 and RDM2 were implemented in the ps_scan PROSITE scanning program, the standalone version of ScanProsite (6). ps_scan allows to scan a protein sequence (either from UniProtKB/Swiss-Prot or UniProtKB/TrEMBL or provided by the user) for the occurrence of patterns and profiles stored in the PROSITE database. The modified ps_scan program applies as default RDM1 and/or RDM2 when run with profiles for repetitive domains. Profiles for repetitive domains are tagged with 'R' and 'RR' or 'R?' in the TEXT field of the CUT_OFF lines (LEVEL=0 and LEVEL=-1) of the profile. When the profile is tagged with 'RR' the two methods RDM1 and RDM2 are applied, whereas when it is tagged with 'R?' only RDM1 is applied. In the output of the program the reported matches are tagged with 'R' or with 'r' when the hits have been detected with RDM1 or RDM2 respectively.

Example:

MA   /CUT_OFF: LEVEL=0; SCORE=246; N_SCORE=8.5; MODE=1; TEXT='R';

MA   /CUT_OFF: LEVEL=-1; SCORE=158; N_SCORE=5.8; MODE=1; TEXT='RR';

MA   /CUT_OFF: LEVEL=0; SCORE=246; N_SCORE=8.5; MODE=1; TEXT='R';

MA   /CUT_OFF: LEVEL=-1; SCORE=158; N_SCORE=5.8; MODE=1; TEXT='R?';

References:: 3) Gribskov M., McLachlan AD, Eisenberg D.; Proc. Natl. Acad. Sci. U.S.A. 84:4355-4358(1987).; 4) Gribskov M., Luethy R., Eisenberg D.; Meth. Enzymol. 183:146-159(1990).; 5) Luethy R., Xenarios I., Bucher P.; Protein Sci. 3:139-146(1994).

III. Database conventions

III.A. General structure

Since release 2024_03 of 29 May 2024 the PROSITE database is composed of three ASCII (text) files. The first file (PROSITE.DOC) contains textual information that fully documents protein families, domains and/or sites, describing (if known) their biological functions, taxonomic distributions and structures. The second file (PROSITE.DAT) is a computer readable file made of patterns and profiles. It contains all the information necessary to programs to scan sequence(s) for the belonging to one of these protein families or the presence of these protein domains or sites. The third file (PROSITE.AUX) contains auxiliary information, previously stored in the PROSITE.DAT file, that is recomputed at each release based on PROSITE motif matches in UniProtKB/Swiss-Prot and PDB.
We must point out that we strongly urge software developers to build software tools that make use of all three files. A list of patterns or profiles present in a sequence is not very useful to biologists without the relevant documentation and auxiliary information.

III.B. Documentation file structure

The PROSITE documentation file is an ASCII file. The maximum line length has been set to 78 characters. The general format of a documentation entry is the following:


  {PDOCnnnnn}

  {PSmmmmm; ENTRY_NAME}

  ..

  {BEGIN}

  Documentation text lines

  .

  ..

  {END}

The first line '{PDOCnnnnn}', where 'nnnnn' is a five digit number is the documentation entry accession number.
The following lines '{PSmmmmm; ENTRY_NAME}' list the accession number and entry name of the PROSITE data file entri(es) that correspond to the documentation entry.
The documentation text lines are in ordinary English and are free-format. The only restriction is that they do not start with the character '{'.
Reference to other PROSITE documentation is indicated as followed:
(see <PDOC00100>)
Reference to PDB entries are indicated as followed:


      (see <PDB:1A4B>)

       or

      (see  <PDB:1J5E; M>)  where M is the name of a chain.

As an example, we show here a section of the documentation file that contains two entries.


   {PDOC00082}

   {PS00087; SOD_CU_ZN_1}

   {PS00332; SOD_CU_ZN_2}

   {BEGIN}

   ***********************************************

   * Copper/Zinc superoxide dismutase signatures *

   ***********************************************



   Copper/Zinc superoxide dismutase (EC 1.15.1.1) (SODC) [1] is  one of the three

   forms of an enzyme that catalyzes the dismutation of superoxide radicals. SODC

   binds one atom each  of zinc and copper.  Various forms  of  SODC are known: a

   cytoplasmic  form in  eukaryotes, an additional chloroplast form in plants, an

   extracellular form in some  eukaryotes, and a periplasmic form in prokaryotes.

   The metal binding sites are conserved in all the known SODC sequences [2].



   We derived two signature  patterns for this family of enzymes:  the  first one

   contains two  histidine residues that  bind the copper atom; the second one is

   located in the C-terminal section of  SODC  and  contains a  cysteine which is

   involved in a disulfide bond.



   -Consensus pattern: [GA]-[IMFAT]-H-[LIVF]-H-x(2)-[GP]-[SDG]-x-[STAGDE]

                       [The two H's are copper ligands]

   -Sequences known to belong to this class detected by the pattern: ALL.

   -Other sequence(s) detected in UniProtKB/Swiss-Prot: 5.



   -Consensus pattern: G-[GN]-[SGA]-G-x-R-x-[SGA]-C-x(2)-[IV]

                       [C is involved in a disulfide bond]

   -Sequences known to belong to this class detected by the pattern: ALL.

   -Other sequence(s) detected in UniProtKB/Swiss-Prot: NONE.



   -Note: these patterns will not detect proteins related to SODC, but which have

    lost their catalytic activity, such as Vaccinia virus protein A45.



   -Last update: July 1999 / Patterns and text revised.



   [ 1] Bannister J.V., Bannister W.H., Rotilio G.

        CRC Crit. Rev. Biochem. 22:111-154(1987).

   [ 2] Smith M.W., Doolittle R.F.

        J. Mol. Evol. 34:175-184(1992).

   {END}

   {PDOC00083}

   {PS00088; SOD_MN}

   {BEGIN}

   ******************************************************

   * Manganese and iron superoxide dismutases signature *

   ******************************************************



   Manganese  superoxide dismutase (EC 1.15.1.1) (SODM)  [1] is  one of the three

   forms of an enzyme that catalyzes the dismutation  of superoxide radicals. The

   four  ligands of  the manganese atom  are  conserved in  all  the  known  SODM

   sequences.  These metal ligands are also conserved in the related iron form of

   superoxide  dismutases [2,3].  We selected, as  a signature, a short conserved

   region which includes two of the four ligands: an aspartate and a histidine.



   -Consensus pattern: D-x-W-E-H-[STA]-[FY](2)

                       [D and H are manganese/iron ligands]

   -Sequences known to belong to this class detected by the pattern: ALL.

   -Other sequence(s) detected in UniProtKB/Swiss-Prot: NONE.

   -Last update: June 1992 / Text revised.



   [ 1] Bannister J.V., Bannister W.H., Rotilio G.

        CRC Crit. Rev. Biochem. 22:111-154(1987).

   [ 2] Parker M.W., Blake C.C.F.

        FEBS Lett. 229:377-382(1988).

   [ 3] Smith M.W., Doolittle R.F.

        J. Mol. Evol. 34:175-184(1992).

   {END}

III.C. Data file structure

III.C.1. Structure of a data entry

The entries in the database data file (PROSITE.DAT) are structured so as to be usable by human readers as well as by computer programs. Each entry in the database is composed of lines. Different types of lines, each with its own format, are used to record the various types of data which make up the entry. The general structure of a line is the following:


   Characters   Content

   ----------   ----------------------------------------------------------

   1 to 2       Two-character line code. Indicates the type of information

                contained in the line.

   3 to 5       Blank

   6 up to 128  Data

The currently used line types, along with their respective line codes, are listed below:


   ID  Identification                       (Begins each entry; 1 per entry)

   AC  Accession number                     (1 per entry)

   DT  Date                                 (1 per entry)

   DE  Short description                    (1 per entry)

   PA  Pattern                              (>=0 per entry)

   MA  Matrix/profile                       (>=0 per entry)

   PP  Post-processing                      (>=0 per entry)

   CC  Comments                             (>=0 per entry)

   PR  Reference to associated ProRule      (>=0 per entry)

   DO  Reference to the documentation file  (1 per entry)

   //  Termination line                     (Ends each entry; 1 per entry)

Lines do not extend over 78 characters, with the exception of "MA" lines whose length has no limit.

III.C.2. Example of a pattern entry


ID   CUTINASE_1; PATTERN.

AC   PS00155;

DT   APR-1990 (CREATED); NOV-1997 (DATA UPDATE); MAR-2005 (INFO UPDATE).

DE   Cutinase, serine active site.

PA   P-x-[STA]-x-[LIV]-[IVT]-x-[GS]-G-Y-S-[QL]-G.

NR   /RELEASE=46.4,178022;

NR   /TOTAL=20(20); /POSITIVE=20(20); /UNKNOWN=0(0); /FALSE_POS=0(0);

NR   /FALSE_NEG=0; /PARTIAL=0;

CC   /TAXO-RANGE=??EP?; /MAX-REPEAT=1;

CC   /SITE=11,active_site;

DR   P63880, CUT1_MYCBO , T; P63879, CUT1_MYCTU , T; P63882, CUT2_MYCBO , T;

DR   P63881, CUT2_MYCTU , T; P0A537, CUT3_MYCBO , T; P0A536, CUT3_MYCTU , T;

DR   P00590, CUTI1_FUSSO, T; Q96UT0, CUTI2_FUSSO, T; Q96US9, CUTI3_FUSSO, T;

DR   P41744, CUTI_ALTBR , T; P29292, CUTI_ASCRA , T; P52956, CUTI_ASPOR , T;

DR   Q00298, CUTI_BOTCI , T; P10951, CUTI_COLCA , T; P11373, CUTI_COLGL , T;

DR   Q8X1P1, CUTI_ERYGR , T; Q99174, CUTI_FUSSC , T; P30272, CUTI_MAGGR , T;

DR   Q8TGB8, CUTI_MONFR , T; Q9Y7G8, CUTI_PYRBR , T;

3D   1AGY; 1CEX; 1CUA; 1CUB; 1CUC; 1CUD; 1CUE; 1CUF; 1CUG; 1CUH; 1CUS; 1CUU;

3D   1CUV; 1CUW; 1CUY; 1CUZ; 1FFA; 1FFB; 1FFC; 1FFD; 1FFE; 1OXM; 1XZA; 1XZB;

3D   1XZC; 1XZD; 1XZE; 1XZF; 1XZG; 1XZH; 1XZJ; 1XZK; 1XZL; 1XZM; 2CUT;

DO   PDOC00140;

//

III.C.3. Example of a profile (matrix) entry


ID   HSP20; MATRIX.

AC   PS01031;

DT   JUN-1994 (CREATED); DEC-2001 (DATA UPDATE); MAR-2005 (INFO UPDATE).

DE   Heat shock hsp20 proteins family profile.

MA   /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=88;

MA   /DISJOINT: DEFINITION=PROTECT; N1=6; N2=83;

MA   /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=-0.7971325; R2=0.0157729; TEXT='-LogE';

MA   /CUT_OFF: LEVEL=0; SCORE=590; N_SCORE=8.5; MODE=1; TEXT='!';

MA   /CUT_OFF: LEVEL=-1; SCORE=463; N_SCORE=6.5; MODE=1; TEXT='?';

MA   /DEFAULT: M0=-8; D=-20; I=-20; B1=-50; E1=-50; MI=-105; MD=-105; IM=-105; DM=-105;

MA   /I: B1=0; BI=-105; BD=-105;

MA   /M: SY='D'; M=-10,26,-29,38,34,-34,-14,-2,-33,7,-24,-23,8,-6,8,-4,0,-9,-27,-33,-19,21;

MA   /M: SY='I'; M=-8,-31,-23,-35,-28,7,-32,-27,27,-24,15,13,-27,-26,-24,-23,-20,-9,25,-4,2,-27;

MA   /M: SY='R'; M=-11,-12,-26,-12,-1,-13,-23,-1,-8,1,-7,-3,-8,-11,-2,8,-9,-6,-8,-22,-3,-4;

MA   /M: SY='E'; M=-11,17,-27,23,29,-24,-15,-3,-27,1,-22,-20,9,-1,6,-6,3,-4,-25,-32,-17,17;

MA   /M: SY='D'; M=-7,10,-23,11,2,-25,0,-6,-26,-4,-23,-18,7,-6,-5,-8,7,7,-20,-31,-17,-2;

MA   /I: I=-4; MD=-22;

MA   /M: SY='D'; M=-8,17,-27,25,19,-30,-13,-5,-28,6,-25,-20,7,3,4,-1,0,-7,-24,-30,-19,10; D=-4;

MA   /I: I=-4; MI=0; MD=-22; IM=0; DM=-22;

MA   /M: SY='D'; M=-11,20,-25,24,16,-29,-12,-1,-27,14,-25,-16,14,-9,10,5,1,-6,-23,-28,-14,13; D=-4;

MA   /I: I=-4; DM=-22;

..

... Some lines omitted..

..

MA   /M: SY='K'; M=-9,-5,-25,-6,0,-22,-21,-12,-17,30,-21,-6,-3,-16,1,23,-9,-7,-6,-23,-11,0;

MA   /I: E1=0; IE=-105; DE=-105;

NR   /RELEASE=46.4,178022;

NR   /TOTAL=195(194); /POSITIVE=190(189); /UNKNOWN=5(5); /FALSE_POS=0(0);

NR   /FALSE_NEG=1; /PARTIAL=8;

CC   /MATRIX_TYPE=protein_domain;

CC   /SCALING_DB=reversed;

CC   /AUTHOR=P_Bucher;

CC   /TAXO-RANGE=A?EP?; /MAX-REPEAT=2;

CC   /FT_KEY=DOMAIN; /FT_DESC=HSP20;

DR   P0A5B8, 14KD_MYCBO , T; P0A5B7, 14KD_MYCTU , T; P46729, 18K1_MYCAV , T;

DR   P46730, 18K1_MYCIT , T; P46731, 18K2_MYCAV , T; P46732, 18K2_MYCIT , T;

DR   P12809, 18KD_MYCLE , T; P80485, ASP1_STRTR , T; O30851, ASP2_STRTR , T;

..

... Some lines omitted..

..

DR   P12812, P40_SCHMA  , T; Q06823, SP21_STIAU , T; O34321, YOCM_BACSU , T;

DR   O12987, CRYAB_COLLI, P; O12991, CRYAB_EUDEL, P; Q91518, CRYAB_TRASC, P;

DR   O12995, CRYAB_TURME, P; P81161, HS22M_LYCES, P; P30220, HS30E_XENLA, P;

DR   P81083, HSP11_PINPS, P; Q9QUK5, HSPB7_RAT  , P;

DR   P22979, HSP6C_DROME, N;

DR   Q29438, ODFP_BOVIN , ?; Q14990, ODFP_HUMAN , ?; Q61999, ODFP_MOUSE , ?;

DR   Q29077, ODFP_PIG   , ?; P21769, ODFP_RAT   , ?;

3D   1SHS;

DO   PDOC00791;

//

III.D. Auxillary file structure

Structure of an auxillary entry

The entries in the database auxillary file (PROSITE.AUX) are structured so as to be usable by human readers as well as by computer programs. Each entry in the database is composed of lines. Different types of lines, each with its own format, are used to record the various types of data which make up the entry. The general structure of a line is the following:


   Characters   Content

   ----------   ----------------------------------------------------------

   1 to 2       Two-character line code. Indicates the type of information

                contained in the line.

   3 to 5       Blank

   6 up to 76   Data

The currently used line types, along with their respective line codes, are listed below:


   ID  Identification                                 (Begins each entry; 1 per entry)

   AC  Accession number                               (1 per entry)

   NR  Numerical results                              (>=0 per entry)

   CC  Comments                                       (>=0 per entry)

   DR  Cross-references to UniProtKB/Swiss-Prot       (>=0 per entry)

   3D  Cross-references to PDB                        (>=0 per entry)

   //  Termination line                               (Ends each entry; 1 per entry)

Lines do not extend over 76 characters.

IV. The different line types for PROSITE.DAT and PROSITE.AUX files

This section describes in detail the format of each type of line used in the database data and auxillary files (PROSITE.DAT and PROSITE.AUX).

IV.A. The ID line

The ID (IDentification) line is found in both PROSITE.DAT and PROSITE.AUX entries. It is always the first line of an entry. The general form of the ID line is:

ID   ENTRY_NAME; ENTRY_TYPE.

The first item on the ID line is the entry name. This name is a useful means of identifying an entry. The entry name consists of from 2 to 21 uppercase alphanumeric characters. The characters that are allowed in an entry name are: A-Z, 0-9, and the underscore character "_".

The second item on the ID line indicates the type of PROSITE entry. Currently this can be one the following:


 PATTERN

 MATRIX

Examples:


ID   ADH_ZINC; PATTERN.

ID   SH3; MATRIX.

IV.B. The AC line

The AC (ACcession number) line is found in both PROSITE.DAT and PROSITE.AUX entries. It lists the accession number associated with an entry. It is always the second line of an entry. Accession numbers provide a stable way of identifying entries from release to release. It is sometimes necessary for reasons of consistency to change the names of the entries between releases.

An accession number, however, never change. Accession numbers allow unambiguous citation of database entries. Researchers who wish to cite a PROSITE entry in their publications should always cite the accession number of that entry in order to ensure that readers can find the relevant data in a subsequent release.

The format of the AC line is:

AC   PSnnnnn;

Where 'PS' stands for PROSITE and 'nnnnn' is a five digit number. Example:

AC   PS00123;

IV.C. The DT line

The DT (DaTe) line is unique to PROSITE.DAT entries. It shows the date of entry or last modification of the entry. It is always the third line of an entry. The format of the DT line is:

DT   MMM-YYYY (CREATED); MMM-YYYY (DATA UPDATE); MMM-YYYY (INFO UPDATE).

where:

MMM is the month and YYYY the year.
The first date indicates when the entry first appeared in the database.
The second date indicates when the 'primary' data of the entry was last modified. By this we mean the data relevant to the pattern or matrix being described in that entry (PA and MA lines) as well as post-processing (PP) lines.
The third date indicates when any data other then the 'primary' data has been modified.

Example:


    DT   APR-1990 (CREATED); JUL-1990 (DATA UPDATE); JUL-1998 (INFO UPDATE).

IV.D. The DE line

The DE (DEscription) line is unique to PROSITE.DAT entries. It provides descriptive information about the content of the entry. It is always the fourth line of an entry. The format of the DE line is:

DE   Description.

The description is given in ordinary English and is free-format.
Examples:


DE   Myb DNA-binding domain repeat signature 1.

DE   Iron-containing alcohol dehydrogenases signature.

DE   Zinc finger, C2H2 type, domain.

DE   Globins profile.

IV.E. The PA line

PA (PAttern) lines are unique to PROSITE.DAT entries. They contain the definition of a PROSITE pattern. The patterns are described using the following conventions:

The standard IUPAC one-letter codes for the amino acids are used.
The symbol 'x' is used for a position where any amino acid is accepted.
Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses '[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
Ambiguities are also indicated by listing between a pair of curly brackets '{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.
Each element in a pattern is separated from its neighbor by a '-'.
Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x.
When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a '<' symbol or respectively ends with a '>' symbol. In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.
A period ends the pattern.

Examples:

PA   [AC]-x-V-x(4)-{ED}.

This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}

PA   <A-x-[ST](2)-x(0,1)-V.

This pattern, which must be in the N-terminal of the sequence ('<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val

III.F. The MA line

MA (MAtrix) lines are unique to PROSITE.DAT entries. They contain the definition of a PROSITE profile (or matrix) entry. The exact format content of this line is fully described in a specific document ( profile.txt ) which is part of the PROSITE distribution files.

IV.G. The PP line

The PP line is unique to PROSITE.DAT entries.
PROSITE profiles normally use two cut-off levels, a reliable cut-off (LEVEL=0) and a low confidence cut-off (LEVEL=-1). The low level cut-off usually covers the twilight zone where few true positives, that cannot be separated from false positives, might be present. The output of the pfsearch and the pfscan programs indicate strong matches (level 0) with '!' and weak matches (level -1) with '?'. This specific tagging in the match list can be used in post-processing, to validate some true positives present in the twilight zone or to eliminate some false positives detected with significant score.

We have already started to introduce some contextual information for the detection of repeat units, where a weak match can be promoted in some particular cases (see Methodology to identify repeats ) and we have now generalized this approach to other contexts. To do so, we have introduced a new line type, PP (for Post Processing), that defines the conditions to retrieve matches in post processing.

Four different types of post processing are defined as bellow:


PP   /COMPETES_HIT_WITH: PS50001; PS50002(2); ...;

Overlapping matches between a profile and the one(s) listed in its PP line are in competition. For each region of the protein matched by competing profiles only the match with the highest normalized score is kept. The minimal size of the region of overlap to consider the match as overlapping can be specified between parentheses. If no size is specified, no overlap is tolerated.


PP   /COMPETES_SEQ_WITH: PS50001; PS50002; ...;

For each sequence matched by the two profiles only the one that produces the highest normalized score is kept to annotate the protein.


PP   /PROMOTED_BY: PS50001; PS50002; ...;

Weak matches (?) with the profile containing the PP line are promoted by the presence in the protein of a strong match (!) with the profile(s) defined in the PP line.


PP   /DEMOTED_BY: PS50001; PS50002; ...;

Strong matches (!) with the profile containing the PP line are demoted by the presence in the protein of a match with the profile(s) defined in the PP line.

The PP line is located just after the last MA line as shown in the following example:


MA   /I: E1=0; IE=-105; DE=-105;

PP   /COMPETES_HIT_WITH: PS51192; PS51193; PS51194;

NR   /RELEASE=50.1,223100;

IV.H. The NR line

The NR (Numerical Results) line is unique to PROSITE.AUX entries since 2024_03 of 29 May 2024. It contain information relevant to the results of the scan with a pattern on UniProtKB/Swiss-Prot. The format of the NR line is:


            NR   /QUALIFIER=data; /QUALIFIER=data; .......

The qualifiers that are currently defined are:

/RELEASE	UniProtKB release number and total number of sequence entries in UniProtKB/Swiss-Prot of that release.
/TOTAL	Total number of hits in UniProtKB/Swiss-Prot.
/POSITIVE	Number of hits on proteins that are known to belong to the set in consideration.
/UNKNOWN	Number of hits on proteins that could possibly belong to the set in consideration.
/FALSE_POS	Number of false hits (on unrelated proteins).
/FALSE_NEG	Number of known missed hits.
/PARTIAL	Number of partial sequences which belong to the set in consideration, but which are not hit by the pattern or profile because they are partial (fragment) sequences.

The syntax of the /RELEASE qualifier is:

/RELEASE=nn,seq_num;

where 'nn' is a UniProtKB release number and 'seq_num' the total number of UniProtKB/Swiss-Prot entries in that release.

For all other qualifiers the syntax is:

/QUALIFIER=x(y);

/QUALIFIER=y;

where 'x' represents the number of hits and 'y' the number of sequences. In the majority of pattern entries 'x' will be equal to 'y', but for those patterns that are designed to detect domains that can be repeated more than once in a given sequence (for example: zinc-fingers, EF-hand regions, kringle domain, etc.), 'x' can be larger than 'y'. Such a situation is described in the following example:


NR   /RELEASE=40.7,103373;

NR   /TOTAL=123(56); /POSITIVE=115(51); /UNKNOWN=5(2); /FALSE_POS=3(3);

NR   /FALSE_NEG=3; /PARTIAL=2;

In the above example the scan for the pattern (or profile) was done on release 40.7 of UniProtKB/Swiss-Prot which contained 103373 sequence entries, that pattern (or profile) was found 123 times in 56 different sequences (/TOTAL). Out of those 123 'hits', 115 were produced by 51 sequences that belong to the set under consideration (/POSITIVE), 5 hits were produced by two sequences which could possible belong to the set (/UNKNOWN) and 3 hits were produced by 3 other sequences (/FALSE_POS). That particular pattern missed 3 sequences (/FALSE_NEG) and there were two partial sequences that belong to the set under consideration but which do not include the region that contains that pattern (or profile) (/PARTIAL).

Note: for some degenerate patterns (as for example the N-glycosylation consensus pattern), the NR lines are not provided as they would not yield any useful information.

IV.I. The CC line

The CC (Comments) lines contains various types of comments. The format of the CC line is:

CC    /QUALIFIER=data; /QUALIFIER=data; .......

The qualifiers that are currently defined are:

/TAXO_RANGE	Taxonomic range.
/MAX-REPEAT	Maximum known number of repetitions of the pattern or profile in a single protein.
/SITE	Indication of an `interesting' site in a pattern.
/SKIP-FLAG	Indication of an entry that can be, in some cases, ignored by a program (because it is too unspecific).
/VERSION	The version number of a pattern or a profile.

There are 5 qualifiers specific to profile entries:

/MATRIX_TYPE	Describes the region of the protein identified by the profile.
/SCALING_DB	Scaling database used to calibrate the profile.
/AUTHOR	Author of the profile.
/FT_KEY	Feature key to describe the region covered by the profile.
/FT_DESC	Feature description of the region covered by the profile.

IV.I.1. The /TAXO-RANGE qualifier

This qualifier is unique to PROSITE.AUX entries since 2024_03 of 29 May 2024. It is used to indicate the taxonomic range of a pattern or matrix. The syntax of that qualifier is the following:

/TAXO-RANGE=ABEPV;

where:

'A' stands for archaea
'B' stands for bacteriophages
'E' stands for eukaryotes
'P' stands for prokaryotes (bacteria)
'V' stands for eukaryotic viruses

When the pattern or matrix entry has no relevance to one of the above taxonomic classes a question mark ('?') replaces the corresponding letter symbol. Example:

/TAXO-RANGE=A?E??

would be used in an entry relevant to proteins of archeal ('A') and eukaryotic ('E') origin.

Note: the /TAXO-RANGE qualifier does not take into account false positive hits. For example: if a pattern produces one or more false positive hit(s) on bacteriophage protein(s) but no true positive results were obtained on any bacteriophage proteins, a question mark will be present instead of the 'B' in the second position of the /TAXO-RANGE qualifier.

IV.I.2. The /MAX-REPEAT qualifier

This qualifier is unique to PROSITE.DAT entries. It is used to indicate the maximum number of times a given pattern or profile has been found in a single protein sequence. The syntax of that qualifier is the following:

/MAX-REPEAT=nn;

For example, in the CC lines of the pattern entry to detect an EF-hand calcium-binding domain we have:

/MAX-REPEAT=8

This indicates that up to 8 copies of the EF-hand domain are known to be present in at least one protein sequence.

Notes: One should not make the assumption that the value indicated by this qualifier is equivalent to the maximum number of hits that will be obtained by the pattern or profile being described; it is not uncommon that a pattern or a profile will not detect all occurrences of a repeated domain.

IV.i.3.The /SITE qualifier

This qualifier is unique to PROSITE.DAT entries. It is used to indicate the position of an 'interesting' site in a pattern or a profile. For example, if a pattern includes an active site residue, the /SITE qualifier will be used to indicate the position of that residue in the pattern. The syntax of this qualifier is the following:

/SITE=nn,text_description;

where 'nn' is the position in the pattern or the profile of the site being described and 'text_description' a textual description of that site. Examples:


/SITE=3,active_site;

/SITE=5,disulfide;

Notes:

For pattern entries, the position numbering is indicated in pattern element units. For example if we want to indicate that the 'C' in the pattern '<A-[ILMV]-x(2,4)-A-C-P' is involved in a disulfide bond we would indicate '/SITE=5,disulfide;', the 'C' being the fifth element in the pattern.

For profile (matrix) entries, the position numbering relates to match positions.

If necessary there can be more than one /SITE qualifier in the CC line(s) of an entry. For example in the pattern entry specific to proteins of the cytochrome c family, the pattern 'C-{CPWHF}-{CPWR}-C-H-{CFWY}' has the following /SITE qualifiers in its CC lines:

/SITE=1,heme; /SITE=4,heme; /SITE=5,heme_iron;

This to indicate that the two 'C's are the residues that bind the heme group and that the 'H' is an axial ligand to the heme iron.

If the presence of a site is assumed, but experimental data is lacking, a '(?)' is appended at the end of the text description. For example if we have the pattern 'A-x(2)-C-R' and the cysteine in that pattern is thought to be involved in a disulfide bond, it would be indicated as:

/SITE=3,disulfide(?);

IV.I.4. The /SKIP-FLAG qualifier

This qualifier is unique to some PROSITE.DAT entries.
Some PROSITE entries such as those describing commonly found post-translational modifications (a typical example is N-glycosylation) are found in the majority of known protein sequences. While it is generally useful to note their presence, some programs may want, in some cases, to ignore those entries. For this purpose these entries are indicated with the following qualifier in their CC lines:

/SKIP-FLAG=TRUE;

IV.I.5. The /VERSION qualifier

This qualifier is unique to PROSITE.DAT entries. The version number (an integer) is incremented only when a modification takes place in PA or MA lines. Version numbers have been introduced in release 19.0 and were all set to version one. Example:

 /VERSION=1;

IV.I.6. The /MATRIX_TYPE qualifier

This qualifier is unique to PROSITE.DAT entries. It describes the type of region in the protein identified by the profile.

Example:

/MATRIX_TYPE=protein_domain;

The matrix type can be protein_domain, repeat_region, localization_signal or composition where:

Protein_domain	Describes a profile directed against a conserved region of a protein.
Repeat_region	Describes a profile directed against a run of repeat units.
Localization_signal	Describes a profile directed against a region important for the localization of protein in the cell.
Composition	Describes a profile directed against a region of low complexity or enriched in a given amino acid.

IV.I.7. The /SCALING_DB qualifier

This qualifier is unique to PROSITE.DAT entries. It indicates which database was used to calibrate the profile.

Example:

/SCALING_DB=window20_shuffled;

Scaling databases currently used are:

reversed	Is a protein database, randomized by taking the reverse sequence of each individual entry.
window20	Is a protein database, locally shuffled in windows of 20 residues.
window20_shuffled	Is a small version of a window20 protein database.
db_global	Is a protein database, globally shuffled in windows of 20 residues.

IV.I.8. The /AUTHOR qualifier

This qualifier is unique to PROSITE.DAT entries. It is used to indicate the author that created or updated the profile.

Example:

/AUTHOR=K_Hofmann, P_Bucher;

The first name is the author of the profile, the second one the author of the last update.

IV.I.9. The /FT_KEY and /FT_DESC qualifiers

These qualifiers are unique to PROSITE.DAT entries. They are used to give a computer readable short description of the region identified by the profile. They are based on the UniProtKB Feature Table key and Feature Table description currently used to define the region identified by the profile.

Example:

/FT_KEY=DOMAIN; /FT_DESC=KRINGLE.

FT_KEY can be NP_BIND, MOTIF, DOMAIN, REPEAT, DNA_BIND or ZN_FING. More details can be found on feature keys and feature descriptions in the UniProtKB user manual .

IV.J. The DR line

The DR (Database Reference) lines are unique to PROSITE.AUX entries since 2024_03 of 29 May 2024. They are used as pointers to the UniProtKB/Swiss-Prot entries that are picked up (or missed) by the pattern being described in the entry. The format of the DR line is:

DR   AC_NB, ENTRY_NAME, C; AC_NB, ENTRY_NAME, C; AC_NB, ENTRY_NAME, C;

where:

'AC_NB' is the UniProtKB/Swiss-Prot primary accession number of the entry to which reference is being made.
'ENTRY_NAME' is the UniProtKB/Swiss-Prot entry name.
'C' is a one character flag that can be one of the following:

T	For a true positive.
P	For a 'potential' hit; a sequence that belongs to the set under consideration, but which was not picked up because the region(s) that are used as a 'fingerprint' (pattern or profile) is not yet available in the database (partial sequence).
N	For a false negative; a sequence which belongs to the set under consideration, but which has not been picked up by the pattern or profile.
?	For an unknown; a sequence which possibly could belong to the set under consideration.
F	For a false positive; a sequence which does not belong to the set in consideration.

Example:


DR   O08775, VGFR2_RAT  , T; P35916, VGFR3_HUMAN, T; P35917, VGFR3_MOUSE, T;

DR   P13388, XMRK_XIPMA , T; O29592, Y665_ARCFU , T; P00527, YES_AVISY  , T;

DR   Q28923, YES_CANFA  , T; P09324, YES_CHICK  , T; P07947, YES_HUMAN  , T;

DR   Q04736, YES_MOUSE  , T; P10936, YES_XENLA  , T; P27447, YES_XIPHE  , T;

DR   Q02977, YRK_CHICK  , T; Q19238, YS3J_CAEEL , T; Q11112, YX05_CAEEL , T;

DR   P43403, ZAP70_HUMAN, T; P43404, ZAP70_MOUSE, T;

DR   P13387, EGFR_CHICK , P; P55245, EGFR_MACMU , P; Q61526, ERBB3_MOUSE, P;

DR   Q61527, ERBB4_MOUSE, P; Q29000, IGF1R_PIG  , P; Q64716, INSRR_RAT  , P;

DR   Q28516, INSR_MACMU , P; Q01621, LCK_RAT    , P;

DR   P51451, BLK_HUMAN  , N; P16277, BLK_MOUSE  , N; Q90344, EPHB2_COTJA, N;

DR   P21860, ERBB3_HUMAN, N; O61460, VAB1_CAEEL , N; P83097, WSCK_DROME , N;

DR   Q58530, GCP_METJA  , ?; O27476, GCP_METTH  , ?; Q68101, GCVK_HCMVT , ?;

DR   P18150, APHE_STRGR , F; P00551, KKA1_ECOLI , F; Q03447, KKA1_SALTY , F;

DR   P00555, KKA5_STRFR , F; P14509, KKA8_ECOLI , F; P13250, KKA9_STRRI , F;

In the above example, we have pointers to 17 UniProtKB/Swiss-Prot sequences which are true positives ('T'), eight which are potential hits ('P'), six which have been missed ('N'), three sequences that may belong to the set under consideration ('?'), and six sequences that are false positives ('F').

IV.K. The 3D line

The 3D (3D-structure) line is unique to PROSITE.AUX entries since 2024_03 of 29 May 2024. It is used to list the code(s) of the Protein Data Bank (PDB) entries that contain structural data corresponding the sequence region described in a PROSITE entry. The format of the 3D line is:

3D   name; [name2;...]

Example:

3D   7WGA; 9WGA; 1WGC; 2WGC;

IV.L. The PR line

PROSITE is now complemented with a set of rules, ProRule , which are used to give extra meaningful information when a match with a PROSITE profile or pattern is detected. Each rule is triggered by a PROSITE entry and contains information linked to the domain or protein family covered by the profile/pattern. This information can be general, e.g. always associated with the domain or protein family, or conditional, depending on the presence of particular residues in functionally or structurally critical positions. The rule(s) associated with a profile/pattern is cross-referenced in the profile/pattern entry in a new line type (PR line).
The PR line is unique to PROSITE.DAT entries.

Example:

PR   PRU00001;

The PR line is located just before the DO line as shown in the following example:


3D   1V87; 1WEO; 1WIM; 1X4J; 1Z6U; 2CSY; 2CT2;

PR   PRU00175;

DO   PDOC00449;

IV.M. The DO line

The DO (DOcumentation) line is unique to PROSITE.DAT entries. It contains a pointer to the entry in the PROSITE documentation file that describes the entry. The format of the DO line is:

DO   PDOCnnnnn;

where 'PDOC' stands for PROSITE DOCumentation and 'nnnnn' is a five digit number. Example:

DO   PDOC00128;

IV.N. The termination line

The // (terminator) line contains no data or comments. It designates the end of an entry.