ScanProsite tool manual
The ScanProsite tool allows to scan protein sequence(s) against the PROSITE database. The user can provide either an UniProt Knowledgebase or Protein Data Bank (PDB) sequence identifier(s) (AC(s) or/and ID(s)) or a sequence(s) in fasta or UniProtKB format. By default the motifs to search for the occurrence are PROSITE patterns and profiles. PROSITE profiles may optionally be excluded.

The ScanProsite tool also allows to search for hits by specific motif(s) in protein sequence database(s). The motif(s) to search for hits may either be PROSITE pattern(s) and/or profile(s), or provided by the user (Note that the program PRATT allows to generate your own patterns). By default, the protein sequence database to be scanned is UniProtKB/Swiss-Prot, including splice variants. Others protein sequence databases may also be scanned, such as UniProtKB/TrEMBL and/or PDB. Furthermore, you can adjust your search limits by specifying filter(s) and pattern option(s).

Finally, the ScanProsite tool allows to scan protein sequence(s) for the occurrence of motif(s).

ScanProsite may be used alternatively into two modes a quick scan mode and an advanced scan mode.
(For programmatic access, see ScanProsite REST web service)

Quick Scan mode

You may enter up to 8 sequences to scan against PROSITE patterns and profiles.

Procedure:

Paste your sequences in the text box. You can enter either:
-UniProtKB AC or ID (e.g. P01621 or KV303_HUMAN) - each on a different line - up to 8
-Sequences in fasta format (cf for instance: P01621 fasta) - up to 8
-ONE raw sequence (only amino acids, 1 letter code, no numbers) - you can only enter ONE raw sequence.

Click on 'Scan'.

If you want to scan a UniProtKB (Swiss-Prot or TrEMBL)/PDB sequence against all PROSITE motifs :

Paste your sequence identifier, either: AC (e.g. P01621) or ID (e.g. KV3C_HUMAN) in the text box. Click on 'Scan'.
In both cases the sequence(s) will be scanned against all PROSITE motifs. If you want to scan a sequence against all PROSITE motifs except the ones with a high probability of occurence, please select the suitable check box under the scan button. Hits from PROSITE profiles with a score greater than the profile defined cutoff score will be shown. Results will be diplayed in HTML 'graphical rich view' mode.

Advanced Scan mode

Guide to the most frequent operation

If you want to scan your sequence against all PROSITE motifs :

Paste your sequence in the text box of the section 'Sequence(s) to be scanned'. The sequence must be either raw (only amino acids, 1 letter code, no numbers), in fasta format, or in UniProtKB format.
Now click on 'START THE SCAN'. Your sequence will be scanned against all PROSITE motifs except the ones with a high probability of occurence. By default, hits from PROSITE profiles with a score greater than the profile defined cutoff score will be shown, and results will be diplayed in HTML 'graphical rich view' mode. But you can alter these parameters.

If you want to scan a UniProtKB (Swiss-Prot or TrEMBL)/PDB sequence against all PROSITE motifs :

Enter the sequence identifier, either: AC (e.g. P05130) or ID (e.g. ENTK_HUMAN), or paste the sequence in the text box of the section 'Sequence(s) to be scanned'.
Click on 'START THE SCAN' button...

If you want to scan the UniProt Knowledgebase with a particular PROSITE motif :

Enter the PROSITE motif identifier, either: AC (e.g. PS50240) or ID (e.g.TRYPSIN_DOM).
Click on 'START THE SCAN' button...

If you want to scan the UniProt Knowledgebase with a particular pattern :

Type your pattern in the text box of the section 'Motifs(s) to scan for'. You should use the PROSITE pattern syntax and type your pattern on one line (don't type 'enter'/'return' inside the pattern).
Click on 'START THE SCAN' button...

Scan with multiple sequences or motifs

Multiple sequences :

You can scan multiple sequences at the same time (maximum 8 sequences if the scan is against all PROSITE motifs, 16 if the scan is against more than 1 motifs, 1000 if against a single motif). Put each UniProtKB (Swiss-Prot or TrEMBL) and/or PDB sequence identifier on a new line. If you want to scan several of your own sequences, you must enter them in fasta or UniProtKB format.

Multiple motifs :

You can scan against multiple motifs at the same time (maximum 8 motifs if the scan is against (a) protein database(s), or 16 if the scan is against (a) particular protein(s)). Separate PROSITE motif identifiers or patterns with white spaces (new line, space, tab). Hits from any of those motifs will be shown (implicit logical OR).

Explicit logical expression with motifs:

You can use logical operators: ... and ..., ... or ..., not ... (with parentheses if needed)
e.g. PS50110 and ( PS50043 or PS51294 )
e.g. PS50240 and not PS01180
n.b. Operator Precedence : The (innermost) parentheses are handled first. The not is right associative (what's on the not right is evaluated before the not); the and/or are left associative (what's on the and/or left is evaluated first)
n.b. A root not is not allowed e.g. not PS50240: forbidden (would give too much matches).
n.b. If you use parentheses, put a space before and after each of them.
n.b. When you use logical operators, all your expressions should be explicit: no mix of white space separated motifs (implicit OR) with explicit elements.

Settings

Pattern syntax

  1. The standard IUPAC one-letter codes for the amino acids are used in PROSITE.
  2. The symbol `x' is used for a position where any amino acid is accepted.
  3. Ambiguities are indicated by listing the acceptable amino acids for a given position, between square brackets `[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
  4. Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.
  5. Each element in a pattern is separated from its neighbor by a `-'.
  6. Repetition of an element of the pattern can be indicated by following that element with a numerical value or, if it is a gap ('x'), by a numerical range between parentheses.
    Examples:
    x(3) corresponds to x-x-x
    x(2,4) corresponds to x-x or x-x-x or x-x-x-x
    A(3) corresponds to A-A-A
    Note: You can only use a range with 'x', i.e. A(2,4) is not a valid pattern element.
  7. When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.
The following extended syntax is allowed for scanProsite:
  • If your pattern consists of one-letter amino acid codes only, without any ambiguous residues, you need not specify the '-', i.e. you can directly copy/paste peptide sequences into the text field.
    Example: M-A-S-K-E can be written as MASKE.

  • To search all sequences which do not contain a certain amino acid, e.g Cys, you can use <{C}*>.
Examples : [AC]-x-V-x(4)-{ED}
This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}

<A-x-[ST](2)-x(0,1)-V
This pattern, which must be in the N-terminal of the sequence (`<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val

<{C}*>
This pattern describes all sequences which do not contain any Cysteines.

IIRIFHLRNI
This pattern describes all sequences which contain the subsequence 'IIRIFHLRNI'.

Database randomization

Note: can only be used in scans against patterns (no profile allowed).

It is often useful to be able to search a pattern against a random database in order to evaluate its specificity. It is desirable that the database be not completely random, but comparable to the databases which are to be scanned in terms of amino acid frequency and local compositional bias. ScanProsite can randomize the scanned databases on the fly, using one of two methods:
  • reverse sequences - randomize by taking the reverse sequence of each individual entry
  • shuffle - randomize by local shuffling of the residues in windows of 20 residues
The reverse sequences method is generally recommendable, but it is not adapted for patterns which are strongly enriched in one amino acid (e.g. C-C-C-[LIV]) or which are palindromic ( M-L-L-M). Sample randomized databases and the scripts used to generate them are available at ftp://ftp.isrec.isb-sib.ch/pub/databases/shuffled/.

Pattern matching mode

Three parameters allow to finely tune the behaviour of the pattern-matching engine. These are :

greed :
extend at most variable-length pattern elements

overlap :
allow partially overlapping matches

include :
allow matches included within one another (implies overlap)

The default behavior is greedy, allows overlaps but not included matches. This means that two overlapping matches are rejected if one is entirely contained within the other.

For example, consider the sequence ``ABACADAEAFA'' and the simple pattern ``A-x(1,3)-A''. The six possible combinations of the switches produce the following results:

  • greed=1, overlap=1, include=0 (default) : 4 matches
    
      ABACADAEAFA
    
      ooooo......
    
      ..ooooo....
    
      ....ooooo..
    
      ......ooooo

  • greed=1, overlap=1, include=1 : 5 matches
    
      ABACADAEAFA
    
      ooooo......
    
      ..ooooo....
    
      ....ooooo..
    
      ......ooooo
    
      ........ooo

  • greed=1, overlap=0 : 2 matches
    
      ABACADAEAFA
    
      ooooo......
    
      ......ooooo

  • greed=0, overlap=1, include=0 or 1 : 5 matches
    
      ABACADAEAFA
    
      ooo........
    
      ..ooo......
    
      ....ooo....
    
      ......ooo..
    
      ........ooo

  • greed=0, overlap=0 : 3 matches
    
      ABACADAEAFA
    
      ooo........
    
      ....ooo....
    
      ........ooo

Filters

You can adjust the ScanProsite parameters by specifying filters on taxonomic lineage (OC) or species (OS). PROSITE uses the same Taxonomy database as UniProtKB. In case you want to define a filter including different taxa/species, separate them with a semicolon (e.g. Eukaryota;Escherichia coli;). Filters can not be specifyed for PDB sequences.

Ouput

Exclude motifs with a high probability of occurrence :

Does not scan against motifs with a high probability of occurrence.

In the result output the information about a matching frequently occuring profile will be marked as 'occurs frequently' in 'simple html' / text output or will be put under the 'hits by frequently occuring profiles' category in the 'rich view' output. Default state: ON.
Note: If you scan against (a) particular motif(s), this setting is ignored; it is only used for scan against all PROSITE motifs.

Do not scan profiles :

Scans only against PROSITE patterns but not profiles. Default state: OFF.
Note: If you scan against (a) particular motif(s), this setting is ignored; It is only used in scans against all PROSITE motifs.

Show low level score :

Shows 'weak' hits from profile where the score is below the normal cut off profile score; uses level -1 cut-off (PROSITE profiles have at least 2 score cut-offs: one confident cut-off (level 0) and one 'border line' cut-off that produces more false positives (level -1), see PROSITE user manual). In the result output those weak hits will be marked as 'hit with a low confidence level (-1)' in the 'rich view' output or 'low confidence' in simple html/ text output. Default state: OFF.

Format

Graphical rich view :
HTML view with graphical representation of hits on proteins (as downloadable images) and prediction (for certain profiles) of features inside matches; see 'Rich View' manual.
Simple HTML output:
Simple HTML view of results without graphical representation of hits and feature prediction.

Plain text output:
Text-only view (without any html link).

Plain text fasta output:
Text only view, in fasta format: each hit is shown as a fasta sequence where the sequence header/name is:
[the matched protein]/[hit start]-[hit-stop]/[the matching PROSITE motif]/score (only for profiles)/confidence level tag (if any).
Note: If 'Retrieve complete sequences' is selected, the complete protein sequence will replace the matched sequence (and only 1 'hit' per matched protein will be shown).

Show only sequences with at least X hit(s) :

In the results show only proteins (and their matched regions) that are hit at least X times. Default value: undefined (X = 1).

Maximum of matched sequences :

Maximum number of distinct matched proteins that can be shown in the output. Default value: 1000.
Note: If value is set to greater than 1000, results won't be shown in your web browser (security to prevent too much data being sent to your web browser) and have to be sent (as plain text) to you via email: please enter you email address in the 'Your e-mail' box.

Retrieve complete sequences :

Add the complete protein sequence to the information displayed for each matched protein. Default state: OFF.
Note: In plain text fasta output mode, the complete protein sequence will replace the matched sequence!
Note: If 'Graphical rich view' is the selected output mode, it will be changed to 'simple HTML' because the rich view doesn't display retrieved sequences (only scanned sequences), so that the generated HTML code is not too big.

Your e-mail :

If filled out (and valid), the results will be emailed to this address (in plain text) instead of being displayed in your browser. Default value: undefined (= no email, interactive mode).