percolator
Usage:
crux percolator [options] <peptide-spectrum matches>
Description:
Percolator is a semi-supervised learning algorithm that dynamically learns to separate target from decoy peptide-spectrum matches (PSMs). The algorithm is described in this article:
Lukas Käll, Jesse Canterbury, Jason Weston, William Stafford Noble and Michael J. MacCoss. "Semi-supervised learning for peptide identification from shotgun proteomics datasets." Nature Methods. 4(11):923-925, 2007.
Percolator requires as input two collections of PSMs, one set derived from matching observed spectra against real ("target") peptides, and a second derived from matching the same spectra against "decoy" peptides. The output consists of ranked lists of PSMs, peptides and proteins. Peptides and proteins are assigned two types of statistical confidence estimates: q-values and posterior error probabilities.
The features used by Percolator to represent each PSM are summarized here.
Percolator also includes code from Fido, whch performs protein-level inference. The Fido algorithm is described in this article:
Oliver Serang, Michael J. MacCoss and William Stafford Noble. "Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data." Journal of Proteome Research. 9(10):5346-5357, 2010.
Crux includes code from Percolator. Crux Percolator differs from the stand-alone version of Percolator in the following respects:
- In addition to the native Percolator XML file format, Crux Percolator supports additional input file formats (SQT, PepXML, tab-delimited text).To maintain consistency with the rest of the Crux commands, Crux Percolator uses different parameter syntax than the stand-alone version of Percolator.
- Like the rest of the Crux commands, Crux Percolator writes its files to an output directory, logs all standard error messages to a log file, and is capable of reading parameters from a parameter file.
- Reading from XML and stdin are not supported at this time.
Input:
peptide-spectrum matches
– One or more collections of target and decoy peptide-spectrum matches (PSMs). Input may be in one of four formats: PIN, SQT, pepXML, or Crux tab-delimited text. Note that if the input is provided as SQT, pepXML, or Crux tab-delimited text, then a PIN file will be generated in the output directory prior to execution. Crux determines the format of the input file by examining its filename extension.
For PIN files, target and decoy PSMs are assumed to appear in the same file. For other file types, decoy PSMs can be provided to Percolator in two ways: either as a separate file or embedded within the same file as the target PSMs. Percolator will first search for target PSMs in a separate file. The decoy file name is constructed from the target name by replacing "target" with "decoy". For example, if search.target.txt is provided as input, then Percolator will search for a corresponding file named search.decoy.txt. If no decoy file is found, then Percolator will assume that the given input file contains a mix of target and decoy PSMs. Within this file, decoys are identified using a prefix (specified via --decoy-prefix) on the protein name.
Output:
The program writes files to the folder crux-output
by default. The name of the output folder can be set by the user using the --output-dir
option. The following files will be created:
percolator.target.proteins.txt
– a tab-delimited file containing the target protein matches. See here for a list of the fields.percolator.decoy.proteins.txt
– a tab-delimited file containing the decoy protein matches. See here for a list of the fields.percolator.target.peptides.txt
– a tab-delimited file containing the target peptide matches. See here for a list of the fields.percolator.decoy.peptides.txt
– a tab-delimited file containing the decoy peptide matches. See here for a list of the fields.percolator.target.psms.txt
– a tab-delimited file containing the target PSMs. See here for a list of the fields.percolator.decoy.psms.txt
– a tab-delimited file containing the decoy PSMs. See here for a list of the fields.percolator.params.txt
– a file containing the name and value of all parameters for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the --parameter-file option for other crux programs.percolator.log.txt
– a log file containing a copy of all messages that were printed to standard error.
Options:
-
percolator options
--max-charge-feature <integer>
– Specifies the maximum charge state feature. When set to zero, use the maximum observed charge state. Default =0
.--no-terminate T|F
– Do not stop execution when encountering questionable SVM inputs or results. "percolator.weights.txt". Default =false
.--protein-name-separator <string>
– Determines the character to separate the protein IDs in the tab-delimited output format Default =,
.--spectral-counting-fdr <float>
– Report the number of unique PSMs and total (including shared peptides) PSMs as two extra columns in the protein tab-delimited output. Default =0
.--train-best-positive T|F
– Enforce that, for each spectrum, at most one PSM is included in the positive set during each training iteration. Note that if the user only provides one PSM per spectrum, then this option will have no effect. Default =false
.
-
Protein inference options
--picked-protein <string>
– Use the picked protein-level FDR to infer protein probabilities, provide the fasta file as the argument to this flag. Default =<empty>
.--protein-enzyme no_enzyme|elastase|pepsin|proteinasek|thermolysin|trypsinp|chymotrypsin|lys-n|lys-c|arg-c|asp-n|glu-c|lysarginase|trypsin
– Type of enzyme Default =trypsin
.--protein-report-duplicates T|F
– If multiple database proteins contain exactly the same set of peptides, then Percolator will randomly discard all but one of the proteins. If this option is set, then the IDs of these duplicated proteins will be reported as a comma-separated list. Not available for Fido. Default =false
.--protein-report-fragments T|F
– By default, if the peptides associated with protein A are a proper subset of the peptides associated with protein B, then protein A is eliminated and all the peptides are considered as evidence for protein B. Note that this filtering is done based on the complete set of peptides in the database, not based on the identified peptides in the search results. Alternatively, if this option is set and if all of the identified peptides associated with protein B are also associated with protein A, then Percolator will report a comma-separated list of protein IDs, where the full-length protein B is first in the list and the fragment protein A is listed second. Not available for Fido. Default =false
.
-
Fido options
--fido-alpha <float>
– Specify the probability with which a present protein emits an associated peptide. Set by grid search (see --fido-gridsearch-depth parameter) if not specified. Default =0
.--fido-beta <float>
– Specify the probability of the creation of a peptide from noise. Set by grid search (see --fido-gridsearch-depth parameter) if not specified. Default =0
.--fido-empirical-protein-q T|F
– Estimate empirical p-values and q-values for proteins using target-decoy analysis. Default =false
.--fido-fast-gridsearch <float>
– Apply the specified threshold to PSM, peptide and protein probabilities to obtain a faster estimate of the alpha, beta and gamma parameters. Default =0
.--fido-gamma <float>
– Specify the prior probability that a protein is present in the sample. Set by grid search (see --fido-gridsearch-depth parameter) if not specified. Default =0
.--fido-gridsearch-depth <integer>
– Set depth of the grid search for alpha, beta and gamma estimation. The values considered, for each possible value of the --fido-gridsearch-depth parameter, are as follows:- 0: alpha = {0.01, 0.04, 0.09, 0.16, 0.25, 0.36, 0.5}; beta = {0.0, 0.01, 0.15, 0.025, 0.035, 0.05, 0.1}; gamma = {0.1, 0.25, 0.5, 0.75}.
- 1: alpha = {0.01, 0.04, 0.09, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.025, 0.035, 0.05}; gamma = {0.1, 0.25, 0.5}.
- 2: alpha = {0.01, 0.04, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.030, 0.05}; gamma = {0.1, 0.5}.
- 3: alpha = {0.01, 0.04, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.030, 0.05}; gamma = {0.5}.
0
.--fido-gridsearch-mse-threshold <float>
– Q-value threshold that will be used in the computation of the MSE and ROC AUC score in the grid search. Default =0.05
.--fido-no-split-large-components T|F
– Do not approximate the posterior distribution by allowing large graph components to be split into subgraphs. The splitting is done by duplicating peptides with low probabilities. Splitting continues until the number of possible configurations of each subgraph is below 2^18 Default =false
.--fido-protein-truncation-threshold <float>
– To speed up inference, proteins for which none of the associated peptides has a probability exceeding the specified threshold will be assigned probability = 0. Default =0.01
.--protein T|F
– Use the Fido algorithm to infer protein probabilities. Must be true to use any of the Fido options. Default =false
.
-
General options
--only-psms T|F
– Report results only at the PSM level. This flag causes Percolator to skip the step that selects the top-scoring PSM per peptide; hence, peptide-level results are left out and only PSM-level results are reported. Default =false
.--search-input auto|separate|concatenated
– Specify the type of target-decoy search. Using 'auto', percolator attempts to detect the search type automatically. Using 'separate' specifies two searches: one against target and one against decoy protein db. Using 'concatenated' specifies a single search on concatenated target-decoy protein db. Default =auto
.--tdc T|F
– Use target-decoy competition to assign q-values and PEPs. When set to F, the mix-max method, which estimates the proportion pi0 of incorrect target PSMs, is used instead. Default =true
.
-
SVM training options
--c-neg <float>
– Penalty for mistake made on negative examples. If not specified, then this value is set by cross validation over {0.1, 1, 10}. Default =0
.--c-pos <float>
– Penalty for mistakes made on positive examples. If this value is set to 0, then it is set via cross validation over the values {0.1, 1, 10}, selecting the value that yields the largest number of PSMs identified at the q-value threshold set via the --test-fdr parameter. Default =0
.--maxiter <integer>
– Maximum number of iterations for training. Default =10
.--percolator-seed <string>
– When given a unsigned integer value seeds the random number generator with that value. When given the string "time" seeds the random number generator with the system time. Default =1
.--quick-validation T|F
– Quicker execution by reduced internal cross-validation. Default =false
.--static T|F
– Use the provided initial weights as a static model. If used, the --init-weights option must be specified. Default =false
.--subset-max-train <integer>
– Only train Percolator on a subset of PSMs, and use the resulting score vector to evaluate the other PSMs. Recommended when analyzing huge numbers (>1 million) of PSMs. When set to 0, all PSMs are used for training as normal. Default =0
.--test-each-iteration T|F
– Measure performance on test set each iteration. Default =false
.--test-fdr <float>
– False discovery rate threshold used in selecting hyperparameters during internal cross-validation and for reporting the final results. Default =0.01
.--train-fdr <float>
– False discovery rate threshold to define positive examples in training. Default =0.01
.
-
SVM feature input options
--default-direction <string>
– In its initial round of training, Percolator uses one feature to induce a ranking of PSMs. By default, Percolator will select the feature that produces the largest set of target PSMs at a specified FDR threshold (cf. --train-fdr). This option allows the user to specify which feature is used for the initial ranking, using the name as a string from this table. The name can be preceded by a hyphen (e.g. "-XCorr") to indicate that a lower value is better. Default =<empty>
.--init-weights <string>
– Read the unnormalized initial weights from the third line of the given file. This can be the output of the --output-weights option from a previous Percolator analysis. Note that the weights must be in the same order as features in the PSM input file(s) Default =<empty>
.--klammer T|F
– Use retention time features calculated as in "Improving tandem mass spectrum identification using peptide retention time prediction across diverse chromatography conditions" by Klammer AA, Yi X, MacCoss MJ and Noble WS. (Analytical Chemistry. 2007 Aug 15;79(16):6111-8.). Default =false
.--output-weights T|F
– Output final weights to a file named "percolator.weights.txt". Default =false
.--override T|F
– By default, Percolator will examine the learned weights for each feature, and if the weight appears to be problematic, then percolator will discard the learned weights and instead employ a previously trained, static score vector. This switch allows this error checking to be overriden. Default =false
.--unitnorm T|F
– Use unit normalization (i.e., linearly rescale each PSM's feature vector to have a Euclidean length of 1), instead of standard deviation normalization. Default =false
.
-
Input and output
--decoy-prefix <string>
– Specifies the prefix of the protein names that indicate a decoy. Default =decoy_
.--decoy-xml-output T|F
– Include decoys (PSMs, peptides, and/or proteins) in the XML output. Default =false
.--feature-file-out T|F
– Output the computed features in tab-delimited Percolator input (.pin) format. The features will be normalized, using either unit norm or standard deviation normalization (depending upon the value of the unit-norm option). Default =false
.--fileroot <string>
– The fileroot string will be added as a prefix to all output file names. Default =<empty>
.--mzid-output T|F
– Output an mzIdentML results file to the output directory. Default =false
.--output-dir <string>
– The name of the directory where output files will be created. Default =crux-output
.--overwrite T|F
– Replace existing files if true or fail when trying to overwrite a file if false. Default =false
.--parameter-file <string>
– A file containing parameters. See the parameter documentation page for details. Default =<empty>
.--pout-output T|F
– Output a Percolator pout.xml format results file to the output directory. Default =false
.--top-match <integer>
– Specify the number of matches to report for each spectrum. Default =5
.--txt-output T|F
– Output a tab-delimited results file to the output directory. Default =true
.--verbosity <integer>
– Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default =30
.