# percolator

## Usage:

crux percolator [options] <peptide-spectrum matches>

## Description:

Percolator is a semi-supervised learning algorithm that dynamically learns to separate target from decoy peptide-spectrum matches (PSMs). The algorithm is described in this article:

Lukas Käll, Jesse Canterbury, Jason Weston, William Stafford Noble and Michael J. MacCoss. "Semi-supervised learning for peptide identification from shotgun proteomics datasets." Nature Methods. 4(11):923-925, 2007.

Percolator requires as input two collections of PSMs, one set derived from matching observed spectra against real ("target") peptides, and a second derived from matching the same spectra against "decoy" peptides. The output consists of ranked lists of PSMs, peptides and proteins. Peptides and proteins are assigned two types of statistical confidence estimates: q-values and posterior error probabilities.

The features used by Percolator to represent each PSM are summarized here.

Percolator also includes code from Fido, whch performs protein-level inference. The Fido algorithm is described in this article:

Oliver Serang, Michael J. MacCoss and William Stafford Noble. "Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data." Journal of Proteome Research. 9(10):5346-5357, 2010.

Crux includes code from Percolator. Crux Percolator differs from the stand-alone version of Percolator in the following respects:

• In addition to the native Percolator XML file format, Crux Percolator supports additional input file formats (SQT, PepXML, tab-delimited text) and output file formats (PepXML, mzIdentML, tab-delimited text). To maintain consistency with the rest of the Crux commands, Crux Percolator uses different parameter syntax than the stand-alone version of Percolator.
• Like the rest of the Crux commands, Crux Percolator writes its files to an output directory, logs all standard error messages to a log file, and is capable of reading parameters from a parameter file.
• Reading from XML and stdin are not supported at this time.

## Input:

• peptide-spectrum matches – A collection of target and decoy peptide-spectrum matches (PSMs). Input may be in one of five formats: PIN, SQT, pepXML, Crux tab-delimited text, or a list of files (when list-of-files=T). Note that if the input is provided as SQT, pepXML, or Crux tab-delimited text, then a PIN file will be generated in the output directory prior to execution.Crux determines the format of the input file by examining its filename extension.
Decoy PSMs can be provided to Percolator in two ways: either as a separate file or embedded within the same file as the target PSMs. Percolator will first search for target PSMs in a separate file. The decoy file name is constructed from the target name by replacing "target" with "decoy". For example, if search.target.txt is provided as input, then Percolator will search for a corresponding file named search.decoy.txt. If no decoy file is found, then Percolator will assume that the given input file contains a mix of target and decoy PSMs. Within this file, decoys are identified using a prefix (specified via --decoy-prefix) on the protein name.

## Output:

The program writes files to the folder crux-output by default. The name of the output folder can be set by the user using the --output-dir option. The following files will be created:

• percolator.target.proteins.txt – a tab-delimited file containing the target protein matches. See here for a list of the fields.
• percolator.decoy.proteins.txt – a tab-delimited file containing the decoy protein matches. See here for a list of the fields.
• percolator.target.peptides.txt – a tab-delimited file containing the target peptide matches. See here for a list of the fields.
• percolator.decoy.peptides.txt – a tab-delimited file containing the decoy peptide matches. See here for a list of the fields.
• percolator.target.psms.txt – a tab-delimited file containing the target PSMs. See here for a list of the fields.
• percolator.decoy.psms.txt – a tab-delimited file containing the decoy PSMs. See here for a list of the fields.
• percolator.params.txt – a file containing the name and value of all parameters for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the --parameter-file option for other crux programs.
• percolator.pep.xml – a file containing the PSMs in pepXML format. This file can be used as input to some of the tools in the Transproteomic Pipeline.
• percolator.mzid – a file containing the protein, peptide, and spectrum matches in mzIdentML format.
• percolator.log.txt – a log file containing a copy of all messages that were printed to standard error.

## Options:

• ### Protein inference options

• --picked-protein <string> – Use the picked protein-level FDR to infer protein probabilities, provide the fasta file as the argument to this flag. Default = <empty>.
• --protein-enzyme no_enzyme|elastase|pepsin|proteinasek|thermolysin|trypsinp|chymotrypsin|lys-n|lys-c|arg-c|asp-n|glu-c|trypsin – Type of enzyme Default = trypsin.
• --protein-report-fragments T|F – By default, if the peptides associated with protein A are a proper subset of the peptides associated with protein B, then protein A is eliminated and all the peptides are considered as evidence for protein B. Note that this filtering is done based on the complete set of peptides in the database, not based on the identified peptides in the search results. Alternatively, if this option is set and if all of the identified peptides associated with protein B are also associated with protein A, then Percolator will report a comma-separated list of protein IDs, where the full-length protein B is first in the list and the fragment protein A is listed second. Not available for Fido. Default = false.
• --protein-report-duplicates T|F – If multiple database proteins contain exactly the same set of peptides, then Percolator will randomly discard all but one of the proteins. If this option is set, then the IDs of these duplicated proteins will be reported as a comma-separated list. Not available for Fido. Default = false.
• ### Fido options

• --protein T|F – Use the Fido algorithm to infer protein probabilities. Must be true to use any of the Fido options. Default = false.
• --fido-alpha <float> – Specify the probability with which a present protein emits an associated peptide. Set by grid search (see --fido-gridsearch-depth parameter) if not specified. Default = 0.
• --fido-beta <float> – Specify the probability of the creation of a peptide from noise. Set by grid search (see --fido-gridsearch-depth parameter) if not specified. Default = 0.
• --fido-gamma <float> – Specify the prior probability that a protein is present in the sample. Set by grid search (see --fido-gridsearch-depth parameter) if not specified. Default = 0.
• --fido-empirical-protein-q T|F – Estimate empirical p-values and q-values for proteins using target-decoy analysis. Default = false.
• --fido-gridsearch-depth <integer> – Set depth of the grid search for alpha, beta and gamma estimation. The values considered, for each possible value of the --fido-gridsearch-depth parameter, are as follows:
• 0: alpha = {0.01, 0.04, 0.09, 0.16, 0.25, 0.36, 0.5}; beta = {0.0, 0.01, 0.15, 0.025, 0.035, 0.05, 0.1}; gamma = {0.1, 0.25, 0.5, 0.75}.
• 1: alpha = {0.01, 0.04, 0.09, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.025, 0.035, 0.05}; gamma = {0.1, 0.25, 0.5}.
• 2: alpha = {0.01, 0.04, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.030, 0.05}; gamma = {0.1, 0.5}.
• 3: alpha = {0.01, 0.04, 0.16, 0.25, 0.36}; beta = {0.0, 0.01, 0.15, 0.030, 0.05}; gamma = {0.5}.
Default = 0.
• --fido-gridsearch-mse-threshold <float> – Q-value threshold that will be used in the computation of the MSE and ROC AUC score in the grid search. Default = 0.05.
• --fido-fast-gridsearch <float> – Apply the specified threshold to PSM, peptide and protein probabilities to obtain a faster estimate of the alpha, beta and gamma parameters. Default = 0.
• --fido-protein-truncation-threshold <float> – To speed up inference, proteins for which none of the associated peptides has a probability exceeding the specified threshold will be assigned probability = 0. Default = 0.01.
• --fido-no-split-large-components T|F – Do not approximate the posterior distribution by allowing large graph components to be split into subgraphs. The splitting is done by duplicating peptides with low probabilities. Splitting continues until the number of possible configurations of each subgraph is below 2^18 Default = false.
• ### General options

• --only-psms T|F – Do not remove redundant peptides; keep all PSMs and exclude peptide level probability. Default = false.
• --tdc T|F – Use target-decoy competition to assign q-values and PEPs. When set to F, the mix-max method, which estimates the proportion pi0 of incorrect target PSMs, is used instead. Default = true.
• --search-input auto|separate|concatenated – Specify the type of target-decoy search. Using 'auto', percolator attempts to detect the search type automatically. Using 'separate' specifies two searches: one against target and one against decoy protein db. Using 'concatenated' specifies a single search on concatenated target-decoy protein db. Default = auto.
• ### SVM training options

• --subset-max-train <integer> – Only train Percolator on a subset of PSMs, and use the resulting score vector to evaluate the other PSMs. Recommended when analyzing huge numbers (>1 million) of PSMs. When set to 0, all PSMs are used for training as normal. Default = 0.
• --c-pos <float> – Penalty for mistakes made on positive examples. If this value is set to 0, then it is set via cross validation over the values {0.1, 1, 10}, selecting the value that yields the largest number of PSMs identified at the q-value threshold set via the --test-fdr parameter. Default = 0.
• --c-neg <float> – Penalty for mistake made on negative examples. If not specified, then this value is set by cross validation over {0.1, 1, 10}. Default = 0.
• --train-fdr <float> – False discovery rate threshold to define positive examples in training. Default = 0.01.
• --test-fdr <float> – False discovery rate threshold used in selecting hyperparameters during internal cross-validation and for reporting the final results. Default = 0.01.
• --maxiter <integer> – Maximum number of iterations for training. Default = 10.
• --quick-validation T|F – Quicker execution by reduced internal cross-validation. Default = false.
• --test-each-iteration T|F – Measure performance on test set each iteration. Default = false.
• --percolator-seed <string> – When given a unsigned integer value seeds the random number generator with that value. When given the string "time" seeds the random number generator with the system time. Default = 1.
• ### SVM feature input options

• --output-weights T|F – Output final weights to a file named "percolator.weights.txt". Default = false.
• --init-weights <string> – Read initial weights from the given file (one per line). Default = <empty>.
• --default-direction <string> – In its initial round of training, Percolator uses one feature to induce a ranking of PSMs. By default, Percolator will select the feature that produces the largest set of target PSMs at a specified FDR threshold (cf. --train-fdr). This option allows the user to specify which feature is used for the initial ranking, using the name as a string from this table. The name can be preceded by a hyphen (e.g. "-XCorr") to indicate that a lower value is better. Default = <empty>.
• --unitnorm T|F – Use unit normalization (i.e., linearly rescale each PSM's feature vector to have a Euclidean length of 1), instead of standard deviation normalization. Default = false.
• --override T|F – By default, Percolator will examine the learned weights for each feature, and if the weight appears to be problematic, then percolator will discard the learned weights and instead employ a previously trained, static score vector. This switch allows this error checking to be overriden. Default = false.
• --klammer T|F – Use retention time features calculated as in "Improving tandem mass spectrum identification using peptide retention time prediction across diverse chromatography conditions" by Klammer AA, Yi X, MacCoss MJ and Noble WS. (Analytical Chemistry. 2007 Aug 15;79(16):6111-8.). Default = false.
• ### Input and output

• --fileroot <string> – The fileroot string will be added as a prefix to all output file names. Default = <empty>.
• --output-dir <string> – The name of the directory where output files will be created. Default = crux-output.
• --overwrite T|F – Replace existing files if true or fail when trying to overwrite a file if false. Default = false.
• --txt-output T|F – Output a tab-delimited results file to the output directory. Default = true.
• --pout-output T|F – Output a Percolator pout.xml format results file to the output directory. Default = false.
• --mzid-output T|F – Output an mzIdentML results file to the output directory. Default = false.
• --pepxml-output T|F – Output a pepXML results file to the output directory. Default = false.
• --feature-file-out T|F – Output the computed features in tab-delimited text format. Default = false.
• --list-of-files T|F – Specify that the search results are provided as lists of files, rather than as individual files. Default = false.
• --parameter-file <string> – A file containing parameters. See the parameter documentation page for details. Default = <empty>.
• --feature-file-in T|F – When set to T, interpret the input file as a PIN file. Default = false.
• --decoy-xml-output T|F – Include decoys (PSMs, peptides, and/or proteins) in the XML output. Default = false.
• --decoy-prefix <string> – Specifies the prefix of the protein names that indicate a decoy. Default = decoy_.
• --verbosity <integer> – Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default = 30.
• --top-match <integer> – Specify the number of matches to report for each spectrum. Default = 5.