Pred2ary Protein Formats

Sequences can be loaded and saved in several formats:

Multiple sequences

Predictions done simultaneously on a set of homologous sequences are much (5%) more accurate than predictions done on single sequences. All sequences should be aligned with each other before being loaded. Gaps in this alignment should be indicated using the period (.) character; these should also be used as filler before and after sequences, to ensure that all sequences are the same length.

MSF format

The GCG MSF format is a common format for saving multiple sequence aligments (it is the default output of the GCG program Pileup, and many other alignment editors). To autodetect this format, please be sure your file starts with either "!!AA_MULTIPLE_ALIGNMENT" or "MSF" and the name of the protein.

Here is an example of this format (some aligned sequences for BPTI):

MSF this is my profile of bpti sequences
sequence0  RPTFCNLLPE TGRCNALIPA FYYNSHLHKC QKFNYGGCGG NANNFKTIDE CQRTC...
anotherseq ....CTSPPV TGPCRAGFKR YNYNTRTKQC EPFKYGGCKG NGNRYKSEQD CLDACSG.
sequence2  .REVCSEQAE TGPCRAMISR WYFDVTEGKC APFFYGGCGG NRNNFDTEEY CMAVCGSA
morebpti   ..EVCSEQAE TGPCRAMISR WYFDVTEGKC APFFYGGCGG NRNNFDTEEY CMAVCG..
stillmore  PPDLCQLPQA RGPCKAALLR YFYNSTSNAC EPFTYGGCQG NNBNFETTEM CLPPECIR
lotsoseqs  KPDFCFLEED PGICRGYITR YFYNNQSKQC ERFKYGGCLG NLNNFESLEE CKNTCENP
You can also load sequence data with interleaved lines and line numbers (which are ignored).

HSSP format

The HSSP format was designed by Chris Sander and Reinhard Schneider, and is used by the PredictProtein server (created by Burkhard Rost, Antoine de Daruvar, Chris Sander, and Reinhard Schneider). If you have used this server, you probably already have your data in HSSP format.

Pred2ary can't save files in this format, but it can load them correctly (except for insertions in the sequence of interest, which are ignored).

BLAST format

This isn't really meant to be a file format, so NCBI says it may change randomly in future releases of BLAST. However, Pred2ary will read BLAST output produced by versions 1.4.11 up through 2.0.7 (the current version at the time this is being written). If you tell BLAST to write output to a file using the -o option, Pred2ary can read the file directly. Also, if you use the NCBI web server and save results to a file (even in HTML), that file can be read. Files produced by "blastpgp" (i.e. PSI-BLAST) also work.

Pred2ary doesn't even try to save files in this format!

Multiple FASTA files

A profile can also be loaded as multiple FASTA format files, as long as you click on "Load together" rather than "Load separately." The key thing to remember here is that the sequences must be aligned already (with . characters), and all have to be the same length. Pad the ends with .'s if they're not. Here is the above MSF example in FASTA format:
>sequence0
RPTFCNLLPE TGRCNALIPA FYYNSHLHKC QKFNYGGCGG NANNFKTIDE CQRTC...

>anotherseq
....CTSPPV TGPCRAGFKR YNYNTRTKQC EPFKYGGCKG NGNRYKSEQD CLDACSG.

>sequence2
.REVCSEQAE TGPCRAMISR WYFDVTEGKC APFFYGGCGG NRNNFDTEEY CMAVCGSA

>morebpti
..EVCSEQAE TGPCRAMISR WYFDVTEGKC APFFYGGCGG NRNNFDTEEY CMAVCG..

>stillmore
PPDLCQLPQA RGPCKAALLR YFYNSTSNAC EPFTYGGCQG NNBNFETTEM CLPPECIR

>lotsoseqs
KPDFCFLEED PGICRGYITR YFYNNQSKQC ERFKYGGCLG NLNNFESLEE CKNTCENP

YAPF format

The is my own "Yet Another Profile Format"; it is much simpler than HSSP or MSF, and can store additional information, such as the real or predicted secondary structure. Information is stored in records (like in the PDB), allowing easy parsing of the file and expandability to include new information without messing up the format of the file. It is meant to be really easy for people to read, even though this means the file is bigger than some formats intended only to be read by computers.

Because this can store both profile info and the results on predicted secondary structure, it is the default format for saving output. An example of YAPF output is here, with my comments in italics:

YAPF crambin
The file starts off with a name for the whole profile.
NALIGN 60
This line shows how many sequences are in the profile, just like in HSSP.
SEQNAME     1 emb|CAA57353| (X81709) Thionin class 1 [Tulipa gesneriana]
SEQNAME     2 bbs|85043 thionin [Hordeum jubatum, Peptide, 137 aa] >gi|246216|
bbs|85042thionin [Hordeum marinum=barley, leaf, Peptide, 137 aa]
This part shows the full names of each sequences. These names get truncated in a MSF file. In YAPF format, each is on one line, so the line might be really long... however, no info gets lost. (58 more sequence names deleted for clarity)
SEQ     1 T T------------T-T---T--------T-------------------T-S---------
SEQ     2 T TSSSSSSSSSSSST-TSSSTSS-SSSSSTSSSSSSSSSSSS--T---STSS-S-------
SEQ     3 C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC--CCTCCCCCC-C-C
This is the sequence at each position for each member of the profile. (remainder of sequence deleted for clarity)
PREDSS     1 T - 0.011084 0.022133
PREDSS     2 T - 0.019588 0.056546
PREDSS     3 C - 0.024238 0.060772
PREDSS     4 C - 0.035713 0.060518
PREDSS     5 P - 0.065223 0.061858
PREDSS     6 S - 0.069242 0.057937
This shows the predicted secondary structure at each position ('-' means coil). The first column of floating point numbers contains the predicted helix probabilities (all 1-7% in this example), and the second such column shows the predicted strand probabilities. The coil probabilities is not shown explicitly in this file format; subtract the sum of the other two from 100% to calculate it.

(other secondary structure predictions deleted for clarity)

END
The format ends with an 'END' record, so it's easy to store multiple predictions in one file.

Single sequence formats

This server can also do predictions on individual sequences. This is less accurate, so if possible, try to find some homologous sequences in a sequence database, and load them as a set.

EA (Estimated Accuracy) format

This is a simple format showing results; proteins can also be loaded in this format, although it doesn't support multiple sequences.

Lines beginning with * indicate the beginning of a new protein, and give the protein name.

Expected accuracy (and actual accuracy, if you loaded proteins in a format that contains the real secondary structure) are printed on the next couple of lines, in comments (comments in EA files begin with # characters).

Every subsequent line contains the sequence number, the consensus residue, the actual secondary structure (if supplied to the program; otherwise a '?' is shown), and the predicted secondary structure ('H' for helix, 'E' for extended, or strand, and '-' for coil). The final three numbers are the estimated probabilities of finding helix, strand, or coil at that position.

*9pti
# expected accuracy is 77.87%
# accuracy is 93.10%
    1 R   - 1.28% 2.13% 96.59%
    2 P   - 0.68% 0.68% 98.63%
    3 D G - 2.79% 3.91% 93.30%
    4 F G - 3.49% 16.91% 79.60%
    5 C G - 3.96% 23.74% 72.30%
    6 L G - 3.59% 34.08% 62.33%
    7 E S - 1.26% 21.38% 77.36%
    8 P   - 3.39% 7.63% 88.98%
    9 P   - 0.63% 5.03% 94.34%
   10 Y   - 2.79% 3.91% 93.30%
   (more deleted)

FASTA format

The easiest format to load proteins in is FASTA format. The beginning of a sequence is indicated with a new line containing the '>' character and the name of the sequence. Subsequent lines should give the sequence in one letter code. Whitespace is ignored.

Example:

>LCA_HUMAN
     MRFFVPLFLV GILFPAILAK QFTKCELSQL LKDIDGYGGI ALPELICTMF HTSGYDTQAI
     VENNESTEYG LFQISNKLWC KSSQVPQSRN ICDISCDKFL DDDITDDIMC AKKILDIKGI
     DYWLAHKALC TEKLEQWLCE KL

PDB format

Sequences can also be read from PDB files. If you use this method, you only need to include the SEQRES lines; other parts of the file are ignored.

Example:

SEQRES   1     58  ARG PRO ASP PHE CYS LEU GLU PRO PRO TYR THR GLY PRO  9PTI  32
SEQRES   2     58  CYS LYS ALA ARG ILE ILE ARG TYR PHE TYR ASN ALA LYS  9PTI  33
SEQRES   3     58  ALA GLY LEU CYS GLN THR PHE VAL TYR GLY GLY CYS ARG  9PTI  34
SEQRES   4     58  ALA LYS ARG ASN ASN PHE LYS SER ALA GLU ASP CYS MET  9PTI  35
SEQRES   5     58  ARG THR CYS GLY ALA                                  9PTI  36

Sequences can be saved to PDB files, but there is no standard way to put secondary structure predictions in PDB files. Pred2ary uses a non-standard method of inserting JMCSTR records (which contain the same data as the EA files above).