ScanProsite - user manual


ScanProsite allows to scan proteins for matches against the PROSITE collection of motifs as well as against user-defined patterns.

At the beginning the user has to choose between three options:

Option 1 - Submit PROTEIN sequences to scan them against the PROSITE collection of motifs.
Option 2 - Submit MOTIFS to scan them against a PROTEIN sequence database.
Option 3 - Submit PROTEIN sequences and MOTIFS to scan them against each other.





Quick Scan

The Quick Scan mode of ScanProsite corresponds to a simplified version of 'Option 1 - Submit PROTEIN sequences to scan them against the PROSITE collection of motifs' that is available from the PROSITE homepage.
Enter or paste up to 10 protein sequences in the textarea.
The accepted input is:
  • UniProtKB accessions e.g. P98073 or identifiers e.g. ENTK_HUMAN
  • PDB identifiers e.g. 4DGJ
  • Sequences in FASTA format
Your input sequences will be scanned against all PROSITE motifs including or excluding the ones with a high probability of occurence (see the Exclude motifs with a high probability of occurrence option) depending of whether you check (exclude) or uncheck (include) the checkbox below the textarea.
Once the scan carried out, the results will be displayed in the 'Graphical view' output format.


Main operations

Submit PROTEIN sequences

You can either enter or paste protein sequences in the textarea or submit a protein database.
If you choose to enter sequences in the textarea, the accepted input is:
  • UniProtKB accessions e.g. P98073 or identifiers e.g. ENTK_HUMAN
  • PDB identifiers e.g. 4DGJ
  • Sequences in FASTA format
If your in 'Option 1' (scan against all PROSITE motifs), the maximum number of sequences that you can submit is 10; while if your in 'Option 3' (scan against specified motifs) the maximum number of sequence you can enter is 1'000 if you submit 1 motif and 50 if you submit a combination of motifs.

If you want the scan to be carried out against your own sequence database either enter a database code or submit a file in FASTA (max. 16MB). Once your file uploaded, you will receive a code that you can use for repeated scans on the database you've just submitted, the database will remain on our server for a period of 1 month.

Submit MOTIFS (Enter a MOTIF or a combination of MOTIFS)

Enter a motif or a combination of motifs in the textarea, the supported input is:
  • A PROSITE accession e.g. PS50240 or identifier e.g. TRYPSIN_DOM
  • Your own pattern e.g. P-x(2)-G-E-S-G(2)-[AS]
  • A combination of PROSITE accessions/identifiers e.g. PS50240 and PS50068, e.g. PS50240 and not ( PS00134 or PS00135 )
  • A combination of PROSITE accessions/identifiers and your own pattern e.g. PS50240 and P-x(2)-G-E-S-G(2)-[AS]
Then you have the possibility to modify a couple of default scanning parameters (scanning options)
Pattern syntax
  • The standard IUPAC one letter code for the amino acids is used in PROSITE.
  • The symbol 'x' is used for a position where any amino acid is accepted.
  • Ambiguities are indicated by listing the acceptable amino acids for a given position, between square brackets '[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
  • Ambiguities are also indicated by listing between a pair of curly brackets '{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for all any amino acid except Ala and Met.
  • Each element in a pattern is separated from its neighbor by a '-'.
  • Repetition of an element of the pattern can be indicated by following that element with a numerical value or, if it is a gap ('x'), by a numerical range between parentheses.
    Examples:
    • x(3) corresponds to x-x-x
    • x(2,4) corresponds to x-x or x-x-x or x-x-x-x
    • A(3) corresponds to A-A-A
  • When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern respectively starts with a '<' symbol or ends with a '>' symbol.
    In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' is equivalent to 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>'.
Note:
  • Ranges can only be used with with 'x', for instance 'A(2,4)' is not a valid pattern element.
  • Ranges of 'x' are not accepted at the beginning or at the end of a pattern unless resticted/anchored to respectively the N- or C-terminal of a sequence, for instance 'P-x(2)-G-E-S-G(2)-[AS]-x(0,200)' is not accepted but 'P-x(2)-G-E-S-G(2)-[AS]-x(0,200)>' is.


Extended syntax for ScanProsite:
  • If your pattern does not contain any ambiguous residues, you don't need to specify separation with '-'.
    Example: M-A-S-K-E can be written as MASKE.
    It means that in such a case you can directly copy/paste peptide sequences into the textfield.
  • To search all sequences which do not contain a certain amino acid, e.g. Cys, you can use <{C}*>.
You can use the program PRATT to generate your own pattern.

Pattern Explanation
[AC]-x-V-x(4)-{ED} [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
<A-x-[ST](2)-x(0,1)-V Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val at the N-terminal of the sequence
<{C}*> No Cys from the N-terminal to the C-terminal
i.e. All sequences that do not contain any Cys.
IIRIFHLRNI Ile-Ile-Arg-Ils-Phe-His-Leu-Arg-Asn-Ile


Combination of MOTIFS

You can submit multiple motifs at the same time. The upper limit is 8 motifs for a scan against a protein database (Option 2 - Step 1) and 16 for a scan against specified sequences (Option 3 - Step 2).
You can use logical operators: 'and', 'or' and 'not' with parentheses if needed.

Examples of logical expressions
PS50240 PS50068
PS50240 and PS50068
PS50240 and P-x(2)-G-E-S-G(2)-[AS]
PS50240 and not PS50068
PS50240 and ( PS00134 or PS00135 )
PS50240 and not ( PS00134 or PS00135 )
  • The 'or' is implicit which means that for instance 'PS50240 PS50068' is equivalent to 'PS50240 or PS50068' if you want to look for sequences matched by both PS50240 and PS50068, you must use 'PS50240 and PS50068'.
  • (Innermost) parentheses are handled first.
  • The 'not' is right associative, which means that what's on ther right of the 'not' is evaluated before the 'not'.
  • The 'and' and 'or' are left associative, which means that what's on the left of an 'and' or an 'or' is evaluated before the 'and' or 'or'.
  • A root 'not' like in 'not PS50240' is not allowed because it would give too many matches.
  • If you use parentheses, put a space before and after each of them. For instance 'PS50240 and not ( PS00134 or PS00135 )' is correct while 'PS50240 and not (PS00134 or PS00135)' is wrong.
  • If you use logical operators, all your expressions must be explicit, i.e. you cannot use white spaces standing for 'or'. For instance 'PS50240 and not ( PS00134 or PS00135 )' is correct while 'PS50240 and not ( PS00134 PS00135 )' is wrong.


Select a PROTEIN sequence database

Select between these PROTEIN sequences databases If you want the scan to be carried out against your own sequence database either enter a database code or submit a file in FASTA (max. 16MB). Once your file uploaded, you will receive a code that you can use for repeated scans on the database you've just submitted, the database will remain on our server for a period of 1 month.

Randomized UniProtKB/Swiss-Prot
It is often useful to be able to search a pattern against a random database in order to evaluate its specificity. It is desirable for that database not to be completely random, but comparable to the databases which are to be scanned in terms of amino acid frequency and local compositional bias. ScanProsite can randomize scanned databases on the fly, using one of two methods:
  • reverse: reverse sequences - created by taking the reverse sequence of each individual entry.
  • window20: shuffled sequences - created by local shuffling of each individual sequence entry using a window width of 20 residues
The reverse sequences method is generally recommendable, but it is not adapted for patterns which are strongly enriched in one amino acid e.g. C-C-C-[LIV] or palindromic ones e.g. M-L-L-M. Sample randomized databases and the scripts used to generate them are available ftp://ftp.isrec.isb-sib.ch/pub/databases/shuffled/.

Note: Scanning a randomized sequence database only makes sense against patterns.

Filters

Filter Usage Database application
length >= than Specifies a minimal length
Must be a positive integer or zero, e.g. 150
UniProtKB (Swiss-Prot and TrEMBL) and PDB
length <= than Specifies a maximal length
Must be a positive integer, e.g. 500
UniProtKB (Swiss-Prot and TrEMBL) and PDB
Taxonomy Enter a taxonomical term e.g. 'Homo sapiens', e.g. 'Fungi; Arthropoda' or corresponding NCBI TaxID e.g. 9606, e.g. '4751; 6656' that you can obtain from the NCBI or the UniProt taxonomy databases.
Multiple terms must be separated by a semicolon.
UniProtKB (Swiss-Prot and TrEMBL)
Description e.g. protease Looks for the term entered in the description (DE) line of the scanned sequences. UniProtKB (Swiss-Prot and TrEMBL)
Tissue expression Choose a term in the list e.g. 'brain'.
See the Bgee data.
Only works on Human, Xenopus, Mouse and Zebrafish; adult stage
Does not consider splice variants of UniProtKB/Swiss-Prot.
UniProtKB (Swiss-Prot and TrEMBL)


Scanning options

Description Default value
Exclude motifs with a high probability of occurrence Does not scan against motifs with a high probability of occurence. On
Exclude profiles Does not scan against profiles.
=> Scans only against patterns.
Off
Run the scan at high sensitivity Runs the scan at a low level (shows weak matches).
Concerns profiles only.
Off
Minimal number of hits per matched sequence Defines how many hits there must be in a sequence for the matched sequence to be displayed. 1
Match mode Defines the match mode for pattern matching.
Concerns patterns only.
Greedy, overlaps, no includes


Exclude motifs with a high probability of occurrence

Description Default value
Does not scan against patterns with a high probability of occurence.
Concerns patterns only.
On

Motifs with a high probability of occurence are in most cases patterns that are found in many protein sequences. Some of them describe for example commonly found post-translational modifications and some others compositionally biased regions.
While it is generally useful to note their presence, some programs may want, in some cases, to ignore those entries. For this purpose these entries are indicated with the following qualifier in their CC lines: '/SKIP-FLAG=TRUE>;', like in the following entry:

        ID   ASN_GLYCOSYLATION; PATTERN.

        AC   PS00001;

        DT   APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE).

        DE   N-glycosylation site.

        PA   N-{P}-[ST]-{P}.

        CC   /SITE=1,carbohydrate;

        CC   /SKIP-FLAG=TRUE;

        CC   /VERSION=1;

        PR   PRU00498;

        DO   PDOC00001;

        //

        
Matches by frequently occuring motifs are displayed under 'hits by patterns/profiles with a high probability of occurrence' if the output format is 'Graphical view'. If the output format is 'Simple view' or 'Text', each motif accession number is tagged with '[occurs frequently]'.

Exclude profiles

Description Default value
Does not scan against profiles.
=> Scans only against patterns.
Off


Run the scan at high sensitivity

Description Default value
Runs the scan at a low level (shows weak matches).
Concerns profiles only.
Off

PROSITE profiles normally use two cut-off levels, a reliable cut-off (LEVEL=0) and a low confidence cut-off (LEVEL=-1) [more].

Runs the scan at a low confidence cut-off (LEVEL=-1) and hence shows matches that are below the the reliable cut-off (LEVEL=0).
Weak hits are tagged with '[warning: hit with a low confidence level (-1)]' if the output format is 'Graphical view' and '[low confidence]' if the output format is 'Simple view' or 'Text'.

Minimal number of hits per matched sequence

Description Default value
Defines how many hits there must be in a sequence for the matched sequence to be displayed. 1

Match mode


Three parameters allow to finely tune the behaviour of the pattern-matching engine:
parameter action
greed extends at most variable-length pattern elements
overlap allows partially overlapping matches
include allows matches included within one another (implies overlap)

The default behavior is greedy, allows overlaps but not included matches. This means that two overlapping matches are rejected if one is entirely contained within the other.
For example, consider the sequence "ABACADAEAFA" and the simple pattern "A-x(1,3)-A". The six possible combinations of the switches produce the following results:
  • greed=1, overlap=1, include=0 (default) : 4 matches
    
      ABACADAEAFA
    
      ooooo......
    
      ..ooooo....
    
      ....ooooo..
    
      ......ooooo
    
                    
  • greed=1, overlap=1, include=1 : 5 matches
    
      ABACADAEAFA
    
      ooooo......
    
      ..ooooo....
    
      ....ooooo..
    
      ......ooooo
    
      ........ooo
    
                    
  • greed=1, overlap=0 : 2 matches
    
      ABACADAEAFA
    
      ooooo......
    
      ......ooooo
    
                    
  • greed=0, overlap=1, include=0 or 1 : 5 matches
    
      ABACADAEAFA
    
      ooo........
    
      ..ooo......
    
      ....ooo....
    
      ......ooo..
    
      ........ooo
    
                    
  • greed=0, overlap=0 : 3 matches
    
      ABACADAEAFA
    
      ooo........
    
      ....ooo....
    
      ........ooo
    
                    


Output formats

Graphical view

HTML view with a graphical representation of hits on proteins (as downloadable images) and prediction (for certain profiles) of features inside matches.


Graphical view


This Web tool displays for each hit within a protein sequence: the hit sequence, the score (for hits against a profile), the PROSITE description and link. In addition, if predicted; biological features associated with each matched sequence are also indicated.
Results are separated into different kinds of hits: hits by 'profiles', 'profiles with a high probability of occurrence', 'patterns', 'patterns with a high probability of occurrence or 'user-defined patterns'. Inside each of these categories, hits by protein are sorted by their N-ter position but multiple hits against a similar motif are grouped together.
In addition for each matched protein, a graphical view in form of a downloadable png (Portable Network Graphics) image represents all its matches (of the aforementioned type) and detected features. Profile hits are represented as colored shapes with their PROSITE name; pattern hits are shown (separated) as thin colored bars without text.
If a match overlaps with the previous one, it will be shown on a different line or if the overlap size is smaller than 10% of the match size, the match will be shown on the same line, its overlapping start will be truncated and replaced by a vertical red bar (indicating that there is a small overlap).

Biological features:
For certain profiles, additional biologically meaningful information about residues inside matches is defined. This additional information comes from the mapping of biologically meaningful residues to PROSITE profiles. It is used to make functional/structural predictions of profile matches more accurate (as profiles show enhanced sensitivity over patterns, but because of their relaxed stringency loose functional/structural discriminativity).
If certain conditions expected for the functional and/or structural properties associated with the domain are fulfilled the properties are shown as 'Predicted features'. For each feature, the UniProtKB feature key, the position/range, the feature description (if any), and the condition that triggered the detection are shown.
Conditions can be specific amino acid inside hit, group of sub-conditions in which all conditions must be true in order for the group condition to be true, case between different sub-conditions/groups etc...
Features associated with conditions that were not fulfilled are shown as 'Absent features' in the same way as for predicted ones except that condition here shows why the feature has not been detected (condition/case not true and/or incomplete group).
On the graphical view, features are shown on top of hits; depending on their type as bridges, horizontal bars, vertical pins.


Graphical view legend


Individual view:
For a scan of more than one sequence against all PROSITE motifs (Option 1), you can click on 'individual view' next to the graphical display so as to see only hits against the protein sequence in question.

View all PROSITE motifs hits on sequence:
For a scan of specific sequences against specific motifs (Option 3), you can click on 'View all PROSITE motifs hits on sequence' in order to sea all PROSITE motifs matches against the protein in question (except for the ones with a high probability of occurence and at a regular level of sensitivity for profile matches).

Match/sequence highlighting:
When hits for only one protein are shown, and if you have a Mozilla based web browser (Mozilla, FireBird/Fox, Netscape 7) you'll be able to see feature residues highlighted (green for predicted features, gray for absent features) on both the match and the full protein sequence (if shown) when you move your mouse cursor over a feature line. In addition if the full sequence of the protein is shown (if you click on 'Individual view' or 'View all PROSITE motifs hits on sequence' or if you submitted only one protein), the match region in the protein sequence will be highlighted in yellow when you move your mouse cursor over that match in the graphical view or the text view.
Highlights are persistent as long as you don't move your cursor over another match/feature (note that left/right margins are immune to cursor moves).

Simple view

Simple HTML view of results without graphical representation of hits and feature prediction.

Text

Text-only view (without any html link).

FASTA

Text only view, in FASTA format, each hit is shown as a FASTA sequence where the sequence header/name is:
[the matched protein]/[hit start]-[hit stop]/[the matching PROSITE motif]/the score (only for profiles)/the confidence level (if any).
Note: If 'Retrieve complete sequence' is selected, the complete protein sequence replaces the matched sequence and only one hit per matched sequence is represented.

Table

Text view containing for each hit on a sequence:
[the matched protein] [hit start] [hit stop] [the matching PROSITE motif] [the score (only for profiles)] [the confidence level (if any)] [the matched region]
Note: If 'Retrieve complete sequence' is selected, the complete protein sequence replaces the matched sequence and only one hit per matched sequence is represented.

Match list

List of matches (UniProtKB accessions if you submitted UniProtKB accessions or identifiers, PDB identifiers if you submitted PDB identifiers, first space delimited word of the FASTA header if you submitted FASTA sequences).

Miniprofiles

PROSITE pattern hits are validated by automatically generated 'miniprofiles' that assign a status to pattern matches.

Most PROSITE patterns have an associated miniprofile. Miniprofiles are stored in evaluator.dat and their accession number (AC) is the same as the pattern from which they originate except for the replacement of 'PS' by 'MP'. Example: the miniprofile for 'PS00134' is 'MP00134'.
When there's a hit by a given pattern, the sequence is scanned against the pattern's associated miniprofile: if the miniprofile also matches the region matched by the pattern, credit is added to the relevance of the pattern's match.

The table below shows, for each output format, what is displayed when the pattern's hit is also matched or respectively not matched by the pattern's associated miniprofile.

Output format matched by miniprofile not matched by miniprofile
Graphical view confidence level: (0) confidence level: (-1)
Simple view confidence level: (0) confidence level: (-1)
Text view confidence level: (0) confidence level: (-1)
FASTA (0) (-1)
Table (0) (-1)
Matchlist / /

For more information on miniprofiles, please consult "The 20 years of PROSITE".


Output options

Maximum number of displayed matches

The maximum number of distinct matched proteins that can be shown in the output.
This number is by default set to 10'000. If you choose 100'000 the results won't be shown in your web browser as a security measure to prevent too much data being send to your browser, you will then have to submit an email address for the results to be sent to you by email.

Retrieve complete sequences

Adds the complete protein sequence to the information displayed for each matched protein.
This option limits the choices of output formats to 'Simple view', 'Text', 'FASTA' and 'Table'; it also limits the 'Maximum number of displayed matches' to 1'000.
Note: For the output formats 'FASTA' and 'Table', the complete protein sequence replaces the matched sequence and only one hit per matched sequence is represented.

Email and job title

Results returned by email limits the choice of output format to 'Text', 'FASTA', 'Table' and 'Matchlist'.
If the chosen 'Maximum number of displayed matches' is 1'000, results have to be send by email and a valid email address is then required. In other situations ScanProsite ignores what you've entered in the email textbox unless it is a valid email address.

Job title: If you've entered a valid email address and you fill in this field, the 'Job title' will appear in the subject of the email you receive for that job.


Programmatic acces: REST web service

REST introduction

REST: REpresentational State Transfer

REST originally referred to a collection of architectural principles, but now the acronym is often coined to describe any simple web-based interface for programmatic access that uses XML (or YAML, JSON, plain text) over HTTP without the extra abstractions of MEP-based approaches like the web services SOAP protocol.
The 'naked' data, without any envelope is retrieved as the content of the HTTP query response.
The options for the operation to be performed are part of the HTTP query parameters, the target URL representing the resource being accessed.
The REST philosophy also implies using HTTP 'verbs' (PUT, GET, POST, DELETE) to perform distinct operations (respectively: Create, Read, Update, Delete) on the target resources (url).
For more information on REST, consult the the Wikipedia REST article.

For ScanProsite, as it is a scanning tool, some of the resources are provided by the users (sequences or/and patterns); to minimize the number of required queries / simplify the system, the service doesn't fully follow aforementioned REST principles (that would be e.g. PUTing the user resources on the server first, then GETing the scan results). Instead users directly POST/GET all their data to get the scan results in the response (n.b. direct system; no ticket/job id: do increase connection time-out for complex queries).
Note: in the ScanProsite service, POST is not used to update data, but like GET, just to (pass input data and parameters and) read scan result data.

REST usage for ScanProsite

Make an HTTP GET or POST query to the service; retrieve scan output data (in XML or JSON) in the HTTP response content.

e.g. (GET) just query for: http://www.expasy.org/cgi-bin/prosite/PSScan.cgi?seq=ENTK_HUMAN&output=xml

Service url: http://prosite.expasy.org/cgi-bin/prosite/PSScan.cgi

Parameters:

GET or POST parameters (name, description):

seq Sequence(s) to be scanned: UniProtKB accessions e.g. P98073 or identifiers e.g. ENTK_HUMAN or PDB identifiers e.g. 4DGJ or sequences in FASTA format or UniProtKB/Swiss-Prot format.
Do not repeat parameter; multiple sequences can be specified by separating them with new lines (%0A in url).
sig Motif(s) to scan against: PROSITE accession e.g. PS50240 or identifier e.g. TRYPSIN_DOM or your own pattern e.g. P-x(2)-G-E-S-G(2)-[AS].
If not specified, all PROSITE motifs are used.
Do not repeat parameter; multiple motifs can be specified by separating them with new lines (%0A in url).

db Target protein database for scans of motifs against whole protein databases: 'sp' (UniProtKB/Swiss-Prot) or 'tr' (UniProtKB/TrEMBL) or 'pdb' (PDB).
Only work if 'seq' is not defined. Parameter can be repeated; 1 target db by 'db' parameter.
varsplic If true (defined, non empty, non zero): includes UniProtKB/Swiss-Prot splice variants.
Only works on scans against UniProtKB/Swiss-Prot.
lineage Any taxonomical term e.g. 'Homo sapiens', e.g. 'Fungi; Arthropoda' or corresponding NCBI TaxID e.g. 9606, e.g. '4751; 6656'
Separate multiple terms with a semicolon.
Only works on scans against UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
description Description (DE) filter: e.g. protease.
Only works on scans against UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
max_x Number of X characters in a scanned sequence that can be matched by a conserved position in a pattern.
Only works if 'sig' is defined, i.e. on scans of specific sequences/protein database(s) against specific motif(s).
Only works on scans against patterns.

output Output format: 'xml' or 'json' (or 'txt')
skip If true (defined, non empty, non zero): excludes motifs with a high probability of occurrence.
Default: on.
Only works if 'seq' is defined and 'sig' is not defined, i.e. on scans of specific sequence(s) against all PROSITE motifs.
lowscore If true (defined, non empty, non zero): shows matches with low level scores.
Default: off.
Only works with PROSITE profiles.
noprofile If true (defined, non empty, non zero): does not scan against profiles.
Only works if 'seq' is defined and 'sig' is not defined, i.e. on scans of specific sequence(s) against all PROSITE motifs.
minhits Mimimal number of hits per matched sequences.
Only works if 'sig' and 'db' are defined, i.e. on scans of protein database(s) against specific motif(s).