barista
Usage:
crux barista [options] <database> <fragmentation spectra> <search results>
Description:
Barista is a protein identification algorithm that combines two different tasks–peptide-spectrum match (PSM) verification and protein inference–into a single learning algorithm. The program requires three inputs: a set of MS2 spectra, a protein database, and the results of searching the spectra against the database. Barista produces as output three ranked lists of proteins, peptides and PSMs, based on how likely the proteins and peptides are to be present in the sample and how likely the PSMs are to be correct. Barista can jointly analyze the results of multiple shotgun proteomics experiments, corresponding to different experiments or replicate runs.
Barista uses a machine learning strategy that requires that the database search be carried out on target and decoy proteins. The searches may be carried out on a concatenated database or, using the --separate-searches
option, separate target and decoy databases. The crux tide-index
command can be used to generate a decoy database.
Barista assigns two types of statistical confidence estimates, q-values and posterior error probabilities, to identified PSMs, peptides and proteins. For more information about these values, see the documentation for assign-confidence.
More details on the Barista algorithm are provided in
Marina Spivak, Jason Weston, Michael J. MacCoss and William Stafford Noble. "Direct maximization of protein identifications from tandem mass spectra." Molecular and Cellular Proteomics. 11(2):M111.012161, 2012.
Input:
database
– The program requires the FASTA format protein database files against which the search was performed. The protein database input may be a concatenated database or separate target and decoy databases; the latter is supported with the --separate-searches option, described below. In either case, Barista distinguishes between target and decoy proteins based on the presence of a decoy prefix on the sequence identifiers (see the --decoy-prefix option, below). The database can be provided in three different ways: (1) as a a single FASTA file with suffix ".fa", ".fsa" or ".fasta", (2) as a text file containing a list of FASTA files, one per line, or (3) as a directory containing multiple FASTA files (identified via the filename suffixes ".fa", ".fsa" or ".fasta").fragmentation spectra
– The fragmentation spectra must be provided in MS2, mzXML, or MGF format.search results
– Search results in the tab-delimited text format produced by Crux or in SQT format. Like the spectra, the search results can be provided as a single file, a list of files or a directory of files. Note, however, that the input mode for spectra and for search results must be the same; i.e., if you provide a list of files for the spectra, then you must also provide a list of files containing your search results. When the MS2 files and tab-delimited text files are provided via a file listing, it is assumed that the order of the MS2 files matches the order of the tab-delimited files. Alternatively, when the MS2 files and tab-delimited files are provided via directories, the program will search for pairs of files with the same root name but different extensions (".ms2" and ".txt").
Output:
The program writes files to the folder crux-output
by default. The name of the output folder can be set by the user using the --output-dir
option. The following files will be created:
barista.xml
– an XML file that contains four information about proteins, subset proteins, peptides, and PSMs.barista.target.proteins.txt
– a tab-delimited file containing a ranked list of groups of indistinguishable target proteins with associated Barista scores and q-values and with peptides that contributed to the identification of the protein group.barista.target.subset-proteins.txt
– a tab-delimited file containing groups of indistinguishable proteins.barista.target.peptides.txt
– a tab-delimited file containing a ranked list of target peptides with the associated Barista scores and q-values.barista.target.psm.txt
– a tab-delimited file format containing a ranked list of target peptide-spectrum matches with the associated Barista scores and q-values.barista.log.txt
– a file where the program reports its progress.barista.params.txt
– a file with the values of all the options given to the current run.
Options:
-
barista options
--separate-searches <string>
– If the target and decoy searches were run separately, rather than using a concatenated database, then the program will assume that the database search results provided as a required argument are from the target database search. This option then allows the user to specify the location of the decoy search results. Like the required arguments, these search results can be provided as a single file, a list of files or a directory. However, the choice (file, list or directory) must be consistent for the MS2 files and the target and decoy tab-delimited files. Also, if the MS2 and tab-delimited files are provided in directories, then Q-ranker will use the MS2 filename (foo.ms2) to identify corresponding target and decoy tab-delimited files with names like foo*.target.txt and foo*.decoy.txt. This naming convention allows the target and decoy txt files to reside in the same directory. Default =<empty>
.--skip-cleanup T|F
– Analysis begins with a pre-processsing step that creates a set of lookup tables which are then used during training. Normally, these lookup tables are deleted at the end of the analysis, but setting this option to T prevents the deletion of these tables. Subsequently, analyses can be repeated more efficiently by specifying the --re-run option. Default =false
.--re-run <string>
– Re-run a previous analysis using a previously computed set of lookup tables. For this option to work, the --skip-cleanup option must have been set to true when the program was run the first time. Default =<empty>
.--use-spec-features T|F
– Use an enriched feature set, including separate features for each ion type. Default =true
.--optimization protein|peptide|psm
– Specifies whether to do optimization at the protein, peptide or psm level. Default =protein
.
-
Enzymatic digestion
--enzyme no-enzyme|trypsin|trypsin/p|chymotrypsin|elastase|clostripain|cyanogen-bromide|iodosobenzoate|proline-endopeptidase|staph-protease|asp-n|lys-c|lys-n|arg-c|glu-c|pepsin-a|elastase-trypsin-chymotrypsin|lysarginase|custom-enzyme
– Specify the enzyme used to digest the proteins in silico. Available enzymes (with the corresponding digestion rules indicated in parentheses) include no-enzyme ([X]|[X]), trypsin ([RK]|{P}), trypsin/p ([RK]|[]), chymotrypsin ([FWYL]|{P}), elastase ([ALIV]|{P}), clostripain ([R]|[]), cyanogen-bromide ([M]|[]), iodosobenzoate ([W]|[]), proline-endopeptidase ([P]|[]), staph-protease ([E]|[]), asp-n ([]|[D]), lys-c ([K]|{P}), lys-n ([]|[K]), arg-c ([R]|{P}), glu-c ([DE]|{P}), pepsin-a ([FL]|{P}), elastase-trypsin-chymotrypsin ([ALIVKRWFY]|{P}), lysarginase ([]|[KR]). Specifying --enzyme no-enzyme yields a non-enzymatic digest. Warning: the resulting index may be quite large. Default =trypsin
.
-
Input and output
--decoy-prefix <string>
– Specifies the prefix of the protein names that indicate a decoy. Default =decoy_
.--fileroot <string>
– The fileroot string will be added as a prefix to all output file names. Default =<empty>
.--output-dir <string>
– The name of the directory where output files will be created. Default =crux-output
.--overwrite T|F
– Replace existing files if true or fail when trying to overwrite a file if false. Default =false
.--pepxml-output T|F
– Output a pepXML results file to the output directory. Default =false
.--txt-output T|F
– Output a tab-delimited results file to the output directory. Default =true
.--parameter-file <string>
– A file containing parameters. See the parameter documentation page for details. Default =<empty>
.--verbosity <integer>
– Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default =30
.--list-of-files T|F
– Specify that the search results are provided as lists of files, rather than as individual files. Default =false
.--feature-file-out T|F
– Output the computed features in tab-delimited Percolator input (.pin) format. The features will be normalized, using either unit norm or standard deviation normalization (depending upon the value of the unit-norm option). Default =false
.--spectrum-parser pwiz|mstoolkit
– Specify the parser to use for reading in MS/MS spectra. The default, ProteoWizard parser can read the MS/MS file formats listed here. The alternative is MSToolkit parser. If the ProteoWizard parser fails to read your files properly, you may want to try the MSToolkit parser instead. Default =pwiz
.