assign-confidence
Usage:
crux assign-confidence [options] <target input>
Description:
Given target and decoy scores, estimate a q-value for each target score. The q-value is analogous to a p-value but incorporates false discovery rate multiple testing correction. The q-value associated with a score threshold T is defined as the minimal false discovery rate (FDR) at which a score of T is deemed significant. In this setting, the q-value accounts for the fact that we are analyzing a large collection of scores. For confidence estimation afficionados, please note that this definition of "q-value" is independent of the notion of "positive FDR" as defined in (Storey Annals of Statistics 31:2013-2015:2003).
To estimate FDRs, assign-confidence
uses one of two different procedures. Both require that the input contain both target and decoy scores. The default, target-decoy competition (TDC) procedure is described in this article:
Josh E. Elias and Steve P. Gygi. "Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry." Nature Methods. 4(3):207-14, 2007.
Note that assign-confidence
implements a variant of the protocol proposed by Elias and Gygi: rather than reporting a list that contains both targets and decoys, assign-confidence
reports only the targets. The FDR estimate is adjusted accordingly (by dividing by 2).
The alternative, mix-max procedure is described in this article:
Uri Keich, Attila Kertesz-Farkas and William Stafford Noble. "An improved false discovery rate estimation procedure for shotgun proteomics." Journal of Proteome Research. 14(8):3148-3161, 2015.
Note that the mix-max procedure requires as input calibrated scores, such as Comet E-values or p-values produced using Tide-s exact-p-value
option.
The mix-max procedure requires that scores are reported from separate target and decoy searches. Thus, this approach is incompatible with a search that is run using the --concat T
option to tide-search
or the --decoy_search 2
option to comet
. On the other hand, the TDC procedure can take as input searches conducted in either mode (concatenated or separate). If given separate search results and asked to do TDC estimation, assign-confidence
will carry out the target-decoy competition as part of the confidence estimation procedure.
In each case, the estimated FDRs are converted to q-values by sorting the scores and then taking, for each score, the minimum of the current FDR and all of the FDRs below it in the ranked list.
If tide-index
was used to create multiple decoys per target using the num-decoys-per-target
and the estimation-method
is set to tdc
, then assign-confidence
will automatically carry out the average TDC (aTDC) procedure, which aims to reduce decoy-induced variability in the FDR estimates produced by TDC. The aTDC procedure is described in
Uri Keich, Kaipo Tamura, and William Stafford Noble. "Averaging strategy to reduce variability in target-decoy estimates of false discovery rate." Journal of Proteome Research, 18(2):585-593, 2019.
A primer on multiple testing correction can be found here:
William Stafford Noble. "How does multiple testing correction work?" Nature Biotechnology. 27(12):1135-1137, 2009.
Input:
target input
– One or more files, each containing a collection of peptide-spectrum matches (PSMs) in tab-delimited text, PepXML, or mzIdentML format. In tab-delimited text format, only the specified score column is required. However if --estimation-method is tdc, then the columns "scan" and "charge" are required, as well as "protein ID" if the search was run with concat=F. Furthermore, if the --estimation-method is specified to peptide-level is set to T, then the column "peptide" must be included, and if --sidak is set to T, then the "distinct matches/spectrum" column must be included.
Note that multiple files can also be provided either on the command line or using the --list-of-files option.
Decoys can be provided in two ways: either as a separate file or embedded within the same file as the targets. Crux will first search the given file for decoys using a prefix (specified via --decoy-prefix) on the protein name. If no decoys are found, then Crux will search for decoys in a separate file. The decoy file name is constructed from the target file name by replacing "target" with "decoy". For example, if tide-search.target.txt is provided as input, then Crux will search for a corresponding file named "tide-search.decoy.txt."
Note that if decoys are provided in a separate file, then assign-confidence will first carry out a target-decoy competition, identifying corresponding pairs of targets and decoys and eliminating the one with the worse score. In this case, the column/tag called "delta_cn" will be eliminated from the output.
Output:
The program writes files to the folder crux-output
by default. The name of the output folder can be set by the user using the --output-dir
option. The following files will be created:
assign-confidence.target.txt
– a tab-delimited text file that contains the targets, sorted by score. The file will contain one new column, named "<method> q-value", where <method> is either "tdc" or "mix-max".assign-confidence.log.txt
– a log file containing a copy of all messages that were printed to stderr.assign-confidence.params.txt
– a file containing the name and value of all parameters/options for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the --parameter-file option for other crux programs.
Options:
-
assign-confidence options
--estimation-method mix-max|tdc|peptide-level
– Specify the method used to estimate q-values. The mix-max procedure or target-decoy competition apply to PSMs. The peptide-level option eliminates any PSM for which there exists a better scoring PSM involving the same peptide, and then uses decoys to assign confidence estimates. Default =tdc
.--score <string>
– Specify the column (for tab-delimited input) or tag (for XML input) used as input to the q-value estimation procedure. If this parameter is unspecified, then the program searches for "xcorr score", "evalue" (comet), "exact p-value" score fields in this order in the input file. Default =<empty>
.--sidak T|F
– Adjust the score using the Sidak adjustment and reports them in a new column in the output file. Note that this adjustment only makes sense if the given scores are p-values, and that it requires the presence of the "distinct matches/spectrum" feature for each PSM. Default =false
.--top-match-in <integer>
– Specify the maximum rank to allow when parsing results files. Matches with ranks higher than this value will be ignored (a value of zero allows matches with any rank). Default =0
.--combine-charge-states T|F
– Specify this parameter to T in order to combine charge states with peptide sequencesin peptide-centric search. Works only if estimation-method = peptide-level. Default =false
.--combine-modified-peptides T|F
– Specify this parameter to T in order to treat peptides carrying different or no modifications as being the same. Works only if estimation = peptide-level. Default =false
.
-
Input and output
--decoy-prefix <string>
– Specifies the prefix of the protein names that indicate a decoy. Default =decoy_
.--verbosity <integer>
– Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default =30
.--parameter-file <string>
– A file containing parameters. See the parameter documentation page for details. Default =<empty>
.--overwrite T|F
– Replace existing files if true or fail when trying to overwrite a file if false. Default =false
.--output-dir <string>
– The name of the directory where output files will be created. Default =crux-output
.--list-of-files T|F
– Specify that the search results are provided as lists of files, rather than as individual files. Default =false
.--fileroot <string>
– The fileroot string will be added as a prefix to all output file names. Default =<empty>
.