assign-confidence

Usage:

crux assign-confidence [options] <target input>

Description:

Given target and decoy scores, estimate a q-value for each target score. The q-value is analogous to a p-value but incorporates false discovery rate multiple testing correction. The q-value associated with a score threshold T is defined as the minimal false discovery rate (FDR) at which a score of T is deemed significant. In this setting, the q-value accounts for the fact that we are analyzing a large collection of scores. For confidence estimation afficionados, please note that this definition of "q-value" is independent of the notion of "positive FDR" as defined in (Storey Annals of Statistics 31:2013-2015:2003).

To estimate FDRs, assign-confidence uses one of two different procedures. Both require that the input contain both target and decoy scores. The default, target-decoy competition (TDC) procedure is described in this article:

Josh E. Elias and Steve P. Gygi. "Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry." Nature Methods. 4(3):207-14, 2007.

Note that assign-confidence implements a variant of the protocol proposed by Elias and Gygi: rather than reporting a list that contains both targets and decoys, assign-confidence reports only the targets. The FDR estimate is adjusted accordingly (by dividing by 2).

The alternative, mix-max procedure is described in this article:

Uri Keich, Attila Kertesz-Farkas and William Stafford Noble. "An improved false discovery rate estimation procedure for shotgun proteomics." Journal of Proteome Research. 14(8):3148-3161, 2015.

Note that the mix-max procedure requires as input calibrated scores, such as Comet E-values or p-values produced using Tide-s exact-p-value option.

The mix-max procedure requires that scores are reported from separate target and decoy searches. Thus, this approach is incompatible with a search that is run using the --concat T option to tide-search or the --decoy_search 2 option to comet. On the other hand, the TDC procedure can take as input searches conducted in either mode (concatenated or separate). If given separate search results and asked to do TDC estimation, assign-confidence will carry out the target-decoy competition as part of the confidence estimation procedure.

In each case, the estimated FDRs are converted to q-values by sorting the scores and then taking, for each score, the minimum of the current FDR and all of the FDRs below it in the ranked list.

If tide-index was used to create multiple decoys per target using the num-decoys-per-target and the estimation-method is set to tdc, then assign-confidence will automatically carry out the average TDC (aTDC) procedure, which aims to reduce decoy-induced variability in the FDR estimates produced by TDC. The aTDC procedure is described in

Uri Keich, Kaipo Tamura, and William Stafford Noble. "Averaging strategy to reduce variability in target-decoy estimates of false discovery rate." Journal of Proteome Research, 18(2):585-593, 2019.

A primer on multiple testing correction can be found here:

William Stafford Noble. "How does multiple testing correction work?" Nature Biotechnology. 27(12):1135-1137, 2009.

Input:

target input – One or more files, each containing a collection of peptide-spectrum matches (PSMs) in tab-delimited text, PepXML, or mzIdentML format. In tab-delimited text format, only the specified score column is required. However if --estimation-method is tdc, then the columns "scan" and "charge" are required, as well as "protein ID" if the search was run with concat=F. Furthermore, if the --estimation-method is specified to peptide-level is set to T, then the column "peptide" must be included, and if --sidak is set to T, then the "distinct matches/spectrum" column must be included.
Note that multiple files can also be provided either on the command line or using the --list-of-files option.
Decoys can be provided in two ways: either as a separate file or embedded within the same file as the targets. Crux will first search the given file for decoys using a prefix (specified via --decoy-prefix) on the protein name. If no decoys are found, then Crux will search for decoys in a separate file. The decoy file name is constructed from the target file name by replacing "target" with "decoy". For example, if tide-search.target.txt is provided as input, then Crux will search for a corresponding file named "tide-search.decoy.txt."
Note that if decoys are provided in a separate file, then assign-confidence will first carry out a target-decoy competition, identifying corresponding pairs of targets and decoys and eliminating the one with the worse score. In this case, the column/tag called "delta_cn" will be eliminated from the output.

Output:

The program writes files to the folder crux-output by default. The name of the output folder can be set by the user using the --output-dir option. The following files will be created:

assign-confidence.target.txt – a tab-delimited text file that contains the targets, sorted by score. The file will contain one new column, named "<method> q-value", where <method> is either "tdc" or "mix-max".
assign-confidence.log.txt – a log file containing a copy of all messages that were printed to stderr.
assign-confidence.params.txt – a file containing the name and value of all parameters/options for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the --parameter-file option for other crux programs.

Options:

assign-confidence options
- --estimation-method mix-max|tdc|peptide-level – Specify the method used to estimate q-values. The mix-max procedure or target-decoy competition apply to PSMs. The peptide-level option eliminates any PSM for which there exists a better scoring PSM involving the same peptide, and then uses decoys to assign confidence estimates. Default = tdc.
- --score <string> – Specify the column (for tab-delimited input) or tag (for XML input) used as input to the q-value estimation procedure. If this parameter is unspecified, then the program searches for "xcorr score", "evalue" (comet), "exact p-value" score fields in this order in the input file. Default = <empty>.
- --sidak T|F – Adjust the score using the Sidak adjustment and reports them in a new column in the output file. Note that this adjustment only makes sense if the given scores are p-values, and that it requires the presence of the "distinct matches/spectrum" feature for each PSM. Default = false.
- --top-match-in <integer> – Specify the maximum rank to allow when parsing results files. Matches with ranks higher than this value will be ignored (a value of zero allows matches with any rank). Default = 0.
- --combine-charge-states T|F – Specify this parameter to T in order to combine charge states with peptide sequencesin peptide-centric search. Works only if estimation-method = peptide-level. Default = false.
- --combine-modified-peptides T|F – Specify this parameter to T in order to treat peptides carrying different or no modifications as being the same. Works only if estimation = peptide-level. Default = false.
Input and output
- --decoy-prefix <string> – Specifies the prefix of the protein names that indicate a decoy. Default = decoy_.
- --verbosity <integer> – Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default = 30.
- --parameter-file <string> – A file containing parameters. See the parameter documentation page for details. Default = <empty>.
- --overwrite T|F – Replace existing files if true or fail when trying to overwrite a file if false. Default = false.
- --output-dir <string> – The name of the directory where output files will be created. Default = crux-output.
- --list-of-files T|F – Specify that the search results are provided as lists of files, rather than as individual files. Default = false.
- --fileroot <string> – The fileroot string will be added as a prefix to all output file names. Default = <empty>.