PrattWWW: a WWW Interface for the Pratt pattern discovery program

Kristian Sturzrehm and Inge Jonassen

Department of Informatics
University of Bergen
Norway

1. Introduction

This report describes the result of a short project in the Autumn of 1996 at the bioinformatics group, Dept. of Informatics, University of Bergen, Norway. The project was funded by a grant from the Norwegian Research Council (grant 111032/410 Bioinformatics).

Pratt (Jonassen et al, 1995; Jonassen, 1996) is a tool for discovering patterns in sequences. This program takes as input a set of $n$ unaligned sequences, a set of parameters (defining constraints on the class of patterns that can be discovered), and outputs patterns matching some minimum number $k$ ($k$ is chosen by the user) of the $n$ input sequences. The patterns are ranked according to their information content (a measure of the strength of the patterns). Pratt is able to discover patterns of the type used in the PROSITE database. Pratt has earlier been made available to the research community by making the source code (ANSI C) available via anonymous ftp. The motivation for this project was:

To make it possible for potential users to try out Pratt over the WWW, and in this way making it easier for new users to try out the tool.
To develop a more user-friendly interface to the program.

We think that the project has been successful in that PrattWWW (the developed WWW based interface to the Pratt program) seems to have the required functionality. PrattWWW allows the user to use a form based WWW page to give sequences and parameters to Pratt. PrattWWW uses the parameters to start the Pratt program, and when Pratt finishes it presents a new WWW page with the results from Pratt. The presentation of the results includes the plain text-file output as provided by Pratt, and a graphical presentation given by a Java applet PatSeq. PatSeq is an interactive graphical tool for visulisation of patterns in a set of sequences. It shows the patterns and the location of each pattern in each of the sequences. One difficulty is that different patterns may have overlapping matches in a sequence. This has been handled by drawing overlapping patterns vertically offset from eachother. We believe that PatSeq makes it much easier to interpret the output from the Pratt program. We plan to further develop PatSeq to make it a powerful and general tool for visualistion of patterns in sequences. One limitation of PatSeq at the moment, is that it is not able to handle more than 20 patterns. Therefore we limit the number of patterns that can be handled by PrattWWW to 20.

The structure of this report is as follows. In Section 2 we describe the logical structure of PrattWWW, and in Section 3 we describe the implementation. The Appendix gives more low-level technical information about the actual programs.

2. Logical structure

The figure below shows the main components of the PrattWWW system. The WWW pages are shown on the top, and the programs (Pratt, the CGI script, and PatSeq) below, and the information flow is shown with arrows.

sorry, no image

2.1. Input (HTML page)

The user sets the attributes and parameters with a form. He/she should know a bit about the biological background (Sequences, Patterns), and can choose values for the following parameters:

Format ; (Fasta / Swissprot) - tells Pratt, in which format the sequence will be.
Minimal percentage ; (100 .. 50) - tells Pratt, in how many sequences the pattern must exist.
Parameter Refinement ; (on / off) - tells Pratt, if it has to refine the results or not.
Parameter "Minimum Information content" ; (10 / 20) - tells Pratt, how "good" a pattern has to be.

In the extended version, the user can set all Pratt parameters. (Jonassen et al, 1995; Jonassen, 1996) The possibility is there, but it is not implemented in the processing part yet. For some parameter it is necessary to hand over more values. This must be programmed later.

The user also needs to say which source Pratt should use to get the sequence data. There are 3 possibilities:

Pratt takes sequences from a file, stored at the server (this option is intended for local users).
The user cut-and-pastes or types any sequences in a textarea field and Pratt uses this data as sequence input file.
The user inputs (using method 1 or 2) a Pratt output file that he/she has obtained earlier, and PrattWWW will visulalise the results in this file using the PatSeq applet.

We also intend to allow the user to use more directly a file on his/her own computer (implemented using file-uploading), but this has not been implemented yet. File-uploading is a new feature in HTML3; when it is well documented and working this will be added to PrattWWW. When the user has chosen all the parameters, he/she can press the "go" button, and the CGI script takes over control.

2.2. Processing

The script takes all parameter information from the Web page. If necessary, it stores the pasted sequences in a file. If it is supposed to run Pratt, it constructs a Pratt command line and saves it in a prattcommand file. Afterwards it executes this file. The next step for every alternatives is, to extract the data from the Pratt output file, interprete them and write all necessary information for the presentation in arrays.

2.3. Output

The output is splittet in two parts; the textual display (generated by the CGI script programmed in Perl), and the graphical presentation (generated by the Java applet PatSeq).

2.3.1. Textual Presentation

The first step of the output part is to create and print the head and the title of the output WWW page. The script prepares the structure of the result page

some general information and the graphical presentation of the Result - main output
textual presentation of the Result - original Pratt output
query form - possibility to send a new query

and prints the result of the interpretation to the webbrowser. The general Information are a few sentences to say something about the query. The original Pratt output is stored in a file on the server, it is printed for checking the results. With the query form the user can modify his query or create a new query and start Pratt again.

2.3.2. Graphical Presentation - PatSeq

The CGI script hands over all relevant data about the result to the JAVA applet and calls it. The applet creates with these information an image with 3 parts.

The Patterns List - it lists the patterns discoved by Pratt; each pattern is printed using PROSITE pattern notation, and the status, length, shape and colour of the graphical representation of the pattern is given. The status of a pattern is either ON or OFF, and a pattern is drawn in the sequence list if it has status ON.
The Ruler - this helps to identify the sequence position of each pattern in the sequences. A ruler is given above and below the list of sequences.
The Sequences - it prints the name and the lenght of all sequences. For each sequence it draws a line proportional in length to the lenght of the sequence, and for each pattern occuring in the sequence, it draws a box or an ellipse on top of the line in the appropritate position. The colour and shape of the pattern representation uniquely identifies the pattern in the patterns list.

There are some possibilities for interaction between the user and the applet.

When the user

clicks on an icon in the patterns list, the status of this pattern will be changed. If it was previously ON then the pattern status will become OFF, and the pattern will not be shown in the sequence list. In this way the user can chosse which patterns he/she would like to see in the sequence list.
moves the mouse over a pattern in a sequence, he will see the pattern identification number and the name of the sequence at the status line.
clicks on a pattern in a sequence, he will see the name of the sequence and the PROSITE format of the patterns.

3. Implementation

For the tool are used 3 different languages. These are connected; Perl is used to generate HTML code for the output page, and Perl controls the Java applet:

HTML
Perl
Java

3.1. HTML (input and output page)

There are two types of HTML WWW pages:

static page (input page)
dynamically created page (output page)

The first type is a HTML page which is stored on the server. This is the Input page (see Section 2.1). It contains a form. With this form, the user can build the query for Pratt as he want, while he sets the parameter for the query. The form refers to the CGI script.

The second type is more complex. After the processing, the browser will show the result at a HTML page. This page is created during the processing. The basic structure of the result pages of different queries is standard (see Section 2.3):

General information.
Graphical representation - PatSeq.
Original result.
Form for a new query.

The content of the single parts depends on the query. The general information contains the running time of Pratt, the number of matched sequences and the file where the sequences are stored. The image with the result is presented by the PatSeq JAVA applet. The original result is a preformated plain ASCII text as generated by Pratt. The form is a copy of the INPUT page. The user can reach the single parts also with local links.

3.2. Perl

The perl script is structured modular (see Appendix C1). The Main module contains the variables declaration. The script takes the parameter from the Input page (see Section 2.1HTML) to get the values. They are stored in an array for the options and variables for the other parameters. This is realised by the modul 'ReadParse' (created by Steve E. Brenner). The next steps will be executed only if necessary. Dependend on the the input information - where the sequences comes from, the script creates the sequence file and the pratt_command file. They contain respectively the sequence information and the Pratt command line. Then the script executes the pratt_command file.

Afterwards it creates the structure of the output HTML page (see Section 2.3). It calls two of the single parts (detailed result information, new form) submodules. For the presentation of the original it prints the Pratt result file (output.dat) as preformated text.

Now comes the first step in the processing part. It loads the Pratt output file and defines a variable for it. The format of the data file is ASCII. The script splits the data in lines and stores them in an array. Then it executes a 'while' loop for every line. It splits the line in single fields, delimited by space characters and looks for keywords. Each keyword defines a place where the script finds information. The script stores this information in a number of variables and arrays (see Appendix C2).

When this part is finished, the inividual output of the query will be created (see Section 2.3). First the basic information about the query will be printed. Then the script takes the information about the patterns and sequences and stores this information in a set of new arrays (see Appendix C3). The information in these arrays are handed over to the JAVA applet PatSeq.

Next, the script calculates some variables that are important for the dimensioning of arrays in the JAVA applet, and for the physical size of the individual image parts to be generated by the applet (see Appendix C4). Amongst other things, we need to calculate the overlaps of the patterns in each sequence in order to find out how much vertical space is needed for the graphical representation of each sequence. The script calls the JAVA applet with all the necassary data, the parameter transfer is realised with two loops for each sequence and pattern.

3.3. Java

In the JAVA applet there are 3 parts of the result to draw.

An overview about the different patterns
Detailled images of all sequences
Rulers to show the dimension of the Sequences (Top, Bottom)

The code for JAVA applet consists of 3 main parts:

the variables definition,
the variables initialisation, and
the main program execution. This can be further divided into three subparts;
- the module 'paint',
- the modules for the event handling,
- and help modules.

The variables definition should be clear, all necessary variables are be defined. They are organized by the purpose (Pattern, Ruler, Sequence).

In the variables initialisation, the JAVA script gets all parameter which are handed over from the CGI script. The program executes the loops with the running variables (see Appendix D1) to get the data about the different patterns and sequences (see Appendix D2). It is necessary to work with flexible loops, because there can exits different number of Sequences and Patterns in every query. There are also some other necessary variables (see Appendix D3) for the applet. Initially the status of all patterns is set to ON, hence all patterns will be shown in the sequences. The colours (10 different) are initialised using a separe module.

One event handler involves changing the status of individual patterns, and calls a helping module in order to recalculate the coordinates for the drawing of the patterns in the sequences. When new patterns are switched on, new overlaps might result, and consequently patterns may have to be 'pushed down'. Analogously, when patterns are switched on, less vertical space may be needed for each sequence, and hence the view becomes more compact.

References

Finding flexible patterns in unaligned protein sequences.
Inge Jonassen, John F. Collins, Desmond Higgins.
Protein Science 1995;4(8):1587-1595.

Efficient discovery of conserved patterns using a pattern graph.
Inge Jonassen.
Dept. of Informatics, Univ. of Bergen, Reports in Informatics no 118, March 1996.

For more information see publications list.