PROSITE logo

ScanProsite - user manual

ScanProsite allows to scan proteins for matches against the PROSITE collection of motifs as well as against user-defined patterns.

At the beginning the user has to choose between three options:

Option 1 - Submit PROTEIN sequences to scan them against the PROSITE collection of motifs .
Option 2 - Submit MOTIFS to scan them against a PROTEIN sequence database .
Option 3 - Submit PROTEIN sequences and MOTIFS to scan them against each other.


Quick Scan

The Quick Scan mode of ScanProsite corresponds to a simplified version of 'Option 1 - Submit PROTEIN sequences to scan them against the PROSITE collection of motifs ' that is available from the PROSITE homepage.
Enter or paste up to 10 protein sequences in the textarea.
The accepted input is: *All UniProtKB/Swiss-Prot accessions/identifiers and all UniProtKB/TrEMBL accessions/identifiers of entries belonging to reference proteomes are accepted.

Your input sequences will be scanned against all PROSITE motifs including or excluding the ones with a high probability of occurrence (see the Exclude motifs with a high probability of occurrence option) depending of whether you check (exclude) or uncheck (include) the checkbox below the textarea.
Once the scan carried out, the results will be displayed in the ' Graphical view ' output format.


Main operations

Submit PROTEIN sequences

You can either enter or paste protein sequences in the textarea or submit a protein database.
If you choose to enter sequences in the textarea, the accepted input is: *All UniProtKB/Swiss-Prot accessions/identifiers and all UniProtKB/TrEMBL accessions/identifiers of entries belonging to reference proteomes are accepted.

If your in 'Option 1' (scan against all PROSITE motifs), the maximum number of sequences that you can submit is 10; while if your in 'Option 3' (scan against specified motifs) the maximum number of sequence you can enter is 1'000 if you submit 1 motif and 50 if you submit a combination of motifs.

If you want the scan to be carried out against your own sequence database either enter a database code or submit a file in FASTA (max. 16MB). Once your file uploaded, you will receive a code that you can use for repeated scans on the database you've just submitted, the database will remain on our server for a period of 1 month.

Submit MOTIFS (Enter a MOTIF or a combination of MOTIFS)

Enter a motif or a combination of motifs in the textarea, the supported input is: Then you have the possibility to modify a couple of default scanning parameters ( scanning options )
Pattern syntax
Note:

Extended syntax for ScanProsite: You can use the program PRATT to generate your own pattern.

Pattern Explanation
[AC]-x-V-x(4)-{ED} [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
<A-x-[ST](2)-x(0,1)-V Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val at the N-terminal of the sequence
<{C}*> No Cys from the N-terminal to the C-terminal
i.e. All sequences that do not contain any Cys.
IIRIFHLRNI Ile-Ile-Arg-Ils-Phe-His-Leu-Arg-Asn-Ile


Combination of MOTIFS

You can submit multiple motifs at the same time. The upper limit is 8 motifs for a scan against a protein database (Option 2 - Step 1) and 16 for a scan against specified sequences (Option 3 - Step 2).
You can use logical operators: 'and', 'or' and 'not' with parentheses if needed.

Examples of logical expressions
PS50240 PS50068
PS50240 and PS50068
PS50240 and P-x(2)-G-E-S-G(2)-[AS]
PS50240 and not PS50068
PS50240 and ( PS00134 or PS00135 )
PS50240 and not ( PS00134 or PS00135 )


Select a PROTEIN sequence database

Select between these PROTEIN sequences databases *For UniProtKB/TrEMBL, only entries belonging to reference proteomes are included in the set.

If you want the scan to be carried out against your own sequence database either enter a database code or submit a file in FASTA (max. 16MB). Once your file uploaded, you will receive a code that you can use for repeated scans on the database you've just submitted, the database will remain on our server for a period of 1 month.

Randomized UniProtKB/Swiss-Prot

It is often useful to be able to search a pattern against a random database in order to evaluate its specificity. It is desirable for that database not to be completely random, but comparable to the databases which are to be scanned in terms of amino acid frequency and local compositional bias. ScanProsite can randomize scanned databases on the fly, using one of two methods: The reverse sequences method is generally recommendable, but it is not adapted for patterns which are strongly enriched in one amino acid e.g. C-C-C-[LIV] or palindromic ones e.g. M-L-L-M.

Note: Scanning a randomized sequence database only makes sense against patterns.

Filters

Filter Usage Database application
length >= than Specifies a minimal length
Must be a positive integer or zero, e.g. 150
UniProtKB (Swiss-Prot and TrEMBL) and PDB
length <= than Specifies a maximal length
Must be a positive integer, e.g. 500
UniProtKB (Swiss-Prot and TrEMBL) and PDB
Taxonomy Enter a taxonomical term e.g. 'Homo sapiens', e.g. 'Fungi; Arthropoda' or corresponding NCBI TaxID e.g. 9606, e.g. '4751; 6656' that you can obtain from the NCBI or the UniProt taxonomy databases.
Multiple terms must be separated by a semicolon.
UniProtKB (Swiss-Prot and TrEMBL)


Scanning options

Description Default value
Exclude motifs with a high probability of occurrence Does not scan against motifs with a high probability of occurrence. On
Exclude profiles Does not scan against profiles.
=> Scans only against patterns.
Off
Run the scan at high sensitivity Runs the scan at a low level (shows weak matches).
Concerns profiles only.
Off
Minimal number of hits per matched sequence Defines how many hits there must be in a sequence for the matched sequence to be displayed. 1
Match mode Defines the match mode for pattern matching.
Concerns patterns only.
Greedy, overlaps, no includes


Exclude motifs with a high probability of occurrence

Description Default value
Does not scan against patterns with a high probability of occurrence.
Concerns patterns only.
On

Motifs with a high probability of occurrence are in most cases patterns that are found in many protein sequences. Some of them describe for example commonly found post-translational modifications and some others compositionally biased regions.
While it is generally useful to note their presence, some programs may want, in some cases, to ignore those entries. For this purpose these entries are indicated with the following qualifier in their CC lines: '/SKIP-FLAG=TRUE>;', like in the following entry:

        ID   ASN_GLYCOSYLATION; PATTERN.

        AC   PS00001;

        DT   APR-1990 (CREATED); APR-1990 (DATA UPDATE); APR-1990 (INFO UPDATE).

        DE   N-glycosylation site.

        PA   N-{P}-[ST]-{P}.

        CC   /SITE=1,carbohydrate;

        CC   /SKIP-FLAG=TRUE;

        CC   /VERSION=1;

        PR   PRU00498;

        DO   PDOC00001;

        //

        
Matches by frequently occuring motifs are displayed under 'hits by patterns/profiles with a high probability of occurrence' if the output format is 'Graphical view'. If the output format is 'Simple view' or 'Text', each motif accession number is tagged with '[occurs frequently]'.

Exclude profiles

Description Default value
Does not scan against profiles.
=> Scans only against patterns.
Off


Run the scan at high sensitivity

Description Default value
Runs the scan at a low level (shows weak matches).
Concerns profiles only.
Off

PROSITE profiles normally use two cut-off levels, a reliable cut-off (LEVEL=0) and a low confidence cut-off (LEVEL=-1) [ more ].

Runs the scan at a low confidence cut-off (LEVEL=-1) and hence shows matches that are below the the reliable cut-off (LEVEL=0).
Weak hits are tagged with '[warning: hit with a low confidence level (-1)]' if the output format is 'Graphical view' and '[low confidence]' if the output format is 'Simple view' or 'Text'.

Minimal number of hits per matched sequence

Description Default value
Defines how many hits there must be in a sequence for the matched sequence to be displayed. 1

Match mode


Three parameters allow to finely tune the behaviour of the pattern-matching engine:
parameter action
greed extends at most variable-length pattern elements
overlap allows partially overlapping matches
include allows matches included within one another (implies overlap)

The default behavior is greedy, allows overlaps but not included matches. This means that two overlapping matches are rejected if one is entirely contained within the other.
For example, consider the sequence "ABACADAEAFA" and the simple pattern "A-x(1,3)-A". The six possible combinations of the switches produce the following results:

Output formats

Graphical view

HTML view with a graphical representation of hits on proteins (as downloadable images) and prediction (for certain profiles) of features inside matches.


Graphical view


This Web tool displays for each hit within a protein sequence: the hit sequence, the score (for hits against a profile), the PROSITE description and link. In addition, if predicted; biological features associated with each matched sequence are also indicated.
Results are separated into different kinds of hits: hits by 'profiles', 'profiles with a high probability of occurrence', 'patterns', 'patterns with a high probability of occurrence or 'user-defined patterns'. Inside each of these categories, hits by protein are sorted by their N-ter position but multiple hits against a similar motif are grouped together.
In addition for each matched protein, a graphical view in form of a downloadable png (Portable Network Graphics) image represents all its matches (of the aforementioned type) and detected features. Profile hits are represented as colored shapes with their PROSITE name; pattern hits are shown (separated) as thin colored bars without text.
If a match overlaps with the previous one, it will be shown on a different line or if the overlap size is smaller than 10% of the match size, the match will be shown on the same line, its overlapping start will be truncated and replaced by a vertical red bar (indicating that there is a small overlap).

Biological features:
For certain profiles, additional biologically meaningful information about residues inside matches is defined. This additional information comes from the mapping of biologically meaningful residues to PROSITE profiles. It is used to make functional/structural predictions of profile matches more accurate (as profiles show enhanced sensitivity over patterns, but because of their relaxed stringency loose functional/structural discriminativity).
If certain conditions expected for the functional and/or structural properties associated with the domain are fulfilled the properties are shown as 'Predicted features'. For each feature, the UniProtKB feature key , the position/range, the feature description (if any), and the condition that triggered the detection are shown.
Conditions can be specific amino acid inside hit, group of sub-conditions in which all conditions must be true in order for the group condition to be true, case between different sub-conditions/groups etc...
Features associated with conditions that were not fulfilled are shown as 'Absent features' in the same way as for predicted ones except that condition here shows why the feature has not been detected (condition/case not true and/or incomplete group).
On the graphical view, features are shown on top of hits; depending on their type as bridges, horizontal bars, vertical pins.


Graphical view legend


Individual view:
For a scan of more than one sequence against all PROSITE motifs (Option 1), you can click on 'individual view' next to the graphical display so as to see only hits against the protein sequence in question.

View all PROSITE motifs hits on sequence:
For a scan of specific sequences against specific motifs (Option 3), you can click on 'View all PROSITE motifs hits on sequence' in order to sea all PROSITE motifs matches against the protein in question (except for the ones with a high probability of occurrence and at a regular level of sensitivity for profile matches).

Match/sequence highlighting:
When hits for only one protein are shown, and if you have a Mozilla based web browser (Mozilla, FireBird/Fox, Netscape 7) you'll be able to see feature residues highlighted (green for predicted features, gray for absent features) on both the match and the full protein sequence (if shown) when you move your mouse cursor over a feature line. In addition if the full sequence of the protein is shown (if you click on 'Individual view' or 'View all PROSITE motifs hits on sequence' or if you submitted only one protein), the match region in the protein sequence will be highlighted in yellow when you move your mouse cursor over that match in the graphical view or the text view.
Highlights are persistent as long as you don't move your cursor over another match/feature (note that left/right margins are immune to cursor moves).

Simple view

Simple HTML view of results without graphical representation of hits and feature prediction.

Text

Text-only view (without any html link).

FASTA

Text only view, in FASTA format, each hit is shown as a FASTA sequence where the sequence header/name is:
[the matched protein]/[hit start]-[hit stop]/[the matching PROSITE motif]/the score (only for profiles)/the confidence level (if any).
Note: If 'Retrieve complete sequence' is selected, the complete protein sequence replaces the matched sequence and only one hit per matched sequence is represented.

Table

Text view containing for each hit on a sequence:
[the matched protein] [hit start] [hit stop] [the matching PROSITE motif] [the score (only for profiles)] [the confidence level (if any)] [the matched region]
Note: If 'Retrieve complete sequence' is selected, the complete protein sequence replaces the matched sequence and only one hit per matched sequence is represented.

Match list

List of matches (UniProtKB accessions if you submitted UniProtKB accessions or identifiers, PDB identifiers if you submitted PDB identifiers, first space delimited word of the FASTA header if you submitted FASTA sequences).

Miniprofiles

PROSITE pattern hits are validated by automatically generated 'miniprofiles' that assign a status to pattern matches.

Most PROSITE patterns have an associated miniprofile. Miniprofiles are stored in evaluator.dat and their accession number (AC) is the same as the pattern from which they originate except for the replacement of 'PS' by 'MP'. Example: the miniprofile for 'PS00134' is 'MP00134'.
When there's a hit by a given pattern, the sequence is scanned against the pattern's associated miniprofile: if the miniprofile also matches the region matched by the pattern, credit is added to the relevance of the pattern's match.

The table below shows, for each output format, what is displayed when the pattern's hit is also matched or respectively not matched by the pattern's associated miniprofile.

Output format matched by miniprofile not matched by miniprofile
Graphical view confidence level: (0) confidence level: (-1)
Simple view confidence level: (0) confidence level: (-1)
Text view confidence level: (0) confidence level: (-1)
FASTA (0) (-1)
Table (0) (-1)
Matchlist / /

For more information on miniprofiles, please consult " The 20 years of PROSITE ".


Output options

Maximum number of displayed matches

The maximum number of distinct matched proteins that can be shown in the output.
This number is by default set to 10'000. If you choose 100'000 the results won't be shown in your web browser as a security measure to prevent too much data being send to your browser, you will then have to submit an email address for the results to be sent to you by email.

Retrieve complete sequences

Adds the complete protein sequence to the information displayed for each matched protein.
This option limits the choices of output formats to 'Simple view', 'Text', 'FASTA' and 'Table'; it also limits the 'Maximum number of displayed matches' to 1'000.
Note: For the output formats 'FASTA' and 'Table', the complete protein sequence replaces the matched sequence and only one hit per matched sequence is represented.

Email and job title

Results returned by email limits the choice of output format to 'Text', 'FASTA', 'Table' and 'Matchlist'.
If the chosen 'Maximum number of displayed matches' is 1'000, results have to be send by email and a valid email address is then required. In other situations ScanProsite ignores what you've entered in the email textbox unless it is a valid email address.

Job title: If you've entered a valid email address and you fill in this field, the 'Job title' will appear in the subject of the email you receive for that job.


Programmatic acces: REST web service

REST introduction

REST: REpresentational State Transfer

REST originally referred to a collection of architectural principles, but now the acronym is often coined to describe any simple web-based interface for programmatic access that uses XML (or YAML, JSON, plain text) over HTTP without the extra abstractions of MEP-based approaches like the web services SOAP protocol.
The 'naked' data, without any envelope is retrieved as the content of the HTTP query response.
The options for the operation to be performed are part of the HTTP query parameters, the target URL representing the resource being accessed.
The REST philosophy also implies using HTTP 'verbs' (PUT, GET, POST, DELETE) to perform distinct operations (respectively: Create, Read, Update, Delete) on the target resources (url).
For more information on REST, consult the the Wikipedia REST article .

For ScanProsite, as it is a scanning tool, some of the resources are provided by the users (sequences or/and patterns); to minimize the number of required queries / simplify the system, the service doesn't fully follow aforementioned REST principles (that would be e.g. PUTing the user resources on the server first, then GETing the scan results). Instead users directly POST/GET all their data to get the scan results in the response (n.b. direct system; no ticket/job id: do increase connection time-out for complex queries).
Note: in the ScanProsite service, POST is not used to update data, but like GET, just to (pass input data and parameters and) read scan result data.

REST usage for ScanProsite

Make an HTTP GET or POST query to the service; retrieve scan output data (in XML or JSON) in the HTTP response content.

e.g. (GET) just query for: https://prosite.expasy.org/cgi-bin/prosite/scanprosite/PSScan.cgi?seq=ENTK_HUMAN&output=xml

Service url: https://prosite.expasy.org/cgi-bin/prosite/scanprosite/PSScan.cgi

Parameters:

GET or POST parameters (name, description):

Name Correspondence in ScanProsite form ) Description
seq Submit PROTEIN sequences Sequence(s) to be scanned: UniProtKB accessions e.g. P98073 or identifiers e.g. ENTK_HUMAN* or PDB identifiers e.g. 4DGJ or sequences in FASTA format.
Do not repeat parameter; multiple sequences can be specified by separating them with new lines (%0A in url).
'seq' takes precedence over 'db', i.e. that if they're both specified, 'db' will be ignored.

*For UniProtKB/TrEMBL accessions and identifiers, only the ones of entries belonging to references proteomes are accepted.

Default: seq="" (empty)

Examples:
db Select a PROTEIN sequence database Target protein database for scans of motifs against whole protein databases: 'sp' (UniProtKB/Swiss-Prot) or 'tr' (UniProtKB/TrEMBL reference proteomes sequences) or 'pdb' (PDB).
'seq' takes precedence over 'db', i.e. that if they're both specified, 'db' will be ignored.

Default: db=sp (if no "seq" and no "db" are specified, the scan is carried out agains UniProKB/Swiss-Prot)

Examples:
varsplic Include isoforms If on (varsplic=1): includes UniProtKB/Swiss-Prot splice variants.
Only relevant on scans against UniProtKB/Swiss-Prot.

Default: varsplic=0 (off, UniProtKB/Swiss-Prot splice variants are not scanned)

Examples:
sig Enter a MOTIF or a combination of MOTIFS Motif(s) to scan against: PROSITE accession e.g. PS50240 or identifier e.g. TRYPSIN_DOM or your own pattern e.g. P-x(2)-G-E-S-G(2)-[AS]. Combinations of motifs can also be used.
If not specified, all PROSITE motifs are used.
Do not repeat parameter; multiple motifs can be specified by separating them with new lines (%0A in url).

Default: sig="" (empty)

Examples:
lineage Filters
  • On taxonomy
Any taxonomical term e.g. 'Homo sapiens', e.g. 'Fungi%3BArthropoda' or corresponding NCBI TaxID e.g. 9606, e.g. '4751%3B6656'
Separate multiple terms with a '%3B'.
Only works on scans against UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.

Default: lineage="" (empty)

Examples:
max_x Number of X characters in a scanned sequence that can be matched by a conserved position in a pattern Number of X characters in a scanned sequence that can be matched by a conserved position in a pattern.
Only relevant if 'sig' is defined and is a pattern.

Default: max_x=0 (no X character in a scanned sequence that can be matched by a conserved position in a pattern)
output Output format txt, xml, json, nice, html, plain, fasta, tabular, list

Default: output=plain

Examples:
skip Exclude motifs with a high probability of occurrence from the scan If on (defined, non empty, non zero): excludes motifs with a high probability of occurrence.
Only relevant if 'seq' is defined and 'sig' is not defined, i.e. on scans of specific sequence(s) against all PROSITE motifs.

Default: skip=1 (on, PROSITE motifs with are high probability of occurrences are excluded from the scan)

Examples:
lowscore Run the scan at a high sensitivity (show weak matches for profiles) If on (lowscore=1): shows matches with low level scores.
Only relevant for PROSITE profiles.

Default: lowscore=0 (off, PROSITE profiles are scanned with cut-off of level 0)

Examples:
noprofile Exclude profiles from the scan If on (noprofile=1): does not scan against profiles.
Only works if 'seq' is defined and 'sig' is not defined, i.e. on scans of specific sequence(s) against all PROSITE motifs.

Default: noprofile=0 (off, PROSITE profiles are included in the scan)

Examples:
minhits Mimimal number of hits per matched sequences Mimimal number of hits per matched sequences.
Only works if 'sig' and 'db' are defined, i.e. on scans of protein database(s) against specific motif(s).

Default: minhits=1 (Scanned sequences with one match or more are reported in the results)

Examples: