FASTA file format
Crux searches protein sequence databases given in FASTA format. The format is very simple. Every entry consists of a sequence identifier (ID), an optional comment (COMMENT), and a sequence (SEQUENCE). The format looks like this:
>ID COMMENT SEQUENCE
The special character ">" marks the beginning of a new sequence. The ">" character is followed immediately by the sequence identifier. The rest of that line is occupied by the optional comment. Subsequent lines contain the sequence itself.
Some rules about representing sequences:
- A single protein sequence can span multiple lines. The > character occurring at the beginning of the line indicates the end of the sequence.
- Case doesn't matter. Crux Suite converts everything to uppercase.
- White space (spaces and newlines) within the sequence are ignored.
- Characters should be from the amino acid alphabet, which contains twenty characters for amino acids ("ACDEFGHIKLMNPQRSTVWY") and is augmented by four more ambiguous characters ("BUXZ"):
A alanine P proline B aspartate or asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U any G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate or glutamine L leucine X any M methionine N asparagine
Here is an example of three sequences in FASTA format.