spectral-counts

Usage:

crux spectral-counts [options] <input PSMs>

Description:

Given a collection of scored PSMs, produce a list of proteins or peptides ranked by a quantification score. Spectral-counts supports four types of quantification: Normalized Spectral Abundance Factor (NSAF), Distributed Normalized Spectral Abundance (dNSAF), Normalized Spectral Index (SI_N) and Exponentially Modified Protein Abundance Index (emPAI). The NSAF method is from Paoletti et al. (2006). The SI_N method is from Griffin et al. (2010). The emPAI method was first described in Ishihama et al (2005). The quantification methods are defined below and in the following paper:

S McIlwain, M Mathews, M Bereman, EW Rubel, MJ MacCoss, and WS Noble. "Estimating relative abundances of proteins from shotgun proteomics data." BMC Bioinformatics. 13:308, 2012.

Protein Quantification

For each protein in a given database, the NSAF score is:
$$NSAF_N=\frac{S_N/L_N}{\sum_{i=1}^ns_i/L_i}$$
where:
- N is protein index
- S_N is the number of peptide spectra matched to the protein
- L_N is the length of protein N
- n is the total number of proteins in the input database
For each protein in a given database, the dNSAF score is:
$$NSAF_N=\frac{\frac{uSpc_N+(d)sSpc_N}{uL_N+sL_N}}{\frac{uSpc_i+(d)sSpc_i}{uL_i+sL_i}}$$
where:
- N is the protein index
- uSpc_N is the unique number spectra matched to the protein index
- sSpc_N is the shared number peptide spectra matched to the protein index
- L_N is the length of protein N
- n is the total number of proteins in the input database
- d is the distribution factor of peptide K to protein N, given by
  $$d=\frac{uSpc_N}{\sum_{i=1}^nuSpc_i}$$
For each protein in a given database, the SI_N score is:
$$SI_N=\frac{\sum_{j=1}^{p_N}(\sum_{k=1}^{s_j}i_k)}{L_N(\sum_{j=1}^nSI_j)}$$
where:
- N is protein index
- p_n is the number of unique peptides in protein N
- s_j is the number of spectra assigned to peptide j
- i_k is the total fragment ion intensity of spectrum k
- L_N is the length of protein N
For each protein in a given database, the emPAI score is:
$$emPAI=10^{\frac{N_{observed}}{N_{observable}}}-1$$
where:
- N_observed is the number of experimentally observed peptides with scores above a specified threshold.
- N_observable is the calculated number of observable peptides for the protein given the search constraints.

Peptide Quantification

For each peptide in a given database, the NSAF score is:
$$NSAF_N=\frac{S_N/L_N}{\sum_{i=1}^ns_i/L_i}$$
where:
- N is the peptide index
- S_N is the number spectra matched to peptide N
- L_N is the length of peptide N
- n is the total number of peptides in the input database
For each peptide in a given database, the SI_N score is:
$$SI_N=\frac{(\sum_{k=1}^{S_N}i_k)}{L_N(\sum_{j=1}^nSI_J)}$$
where:
- N is the peptide index
- S_N is the number of spectra assigned to peptide N
- i_k is the total fragment ion intensity of spectrum k
- L_N is the length of peptide N

Input:

input PSMs – A PSM file in either tab delimited text format (as produced by percolator), or pepXML format.

Output:

The program writes files to the folder crux-output by default. The name of the output folder can be set by the user using the --output-dir option. The following files will be created:

spectral-counts.target.txt – a tab-delimited text file containing the protein IDs and their corresponding scores, in sorted order.
spectral-counts.params.txt – a file containing the name and value of all parameters/options for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the --parameter-file option for other Crux programs.
spectral-counts.log.txt – All messages written to standard error.

Options:

spectral-counts options
- --parsimony none|simple|greedy – Perform a parsimony analysis on the proteins, and report a "parsimony rank" column in the output file. This column contains integers indicating the protein's rank in a list sorted by spectral counts. If the parsimony analysis results in two proteins being merged, then their parsimony rank is the same. In such a case, the rank is assigned based on the largest spectral count of any protein in the merged meta-protein. The "simple" parsimony algorithm only merges two proteins A and B if the peptides identified in protein A are the same as or a subset of the peptides identified in protein B. The "greedy" parsimony algorithm does additional merging, by identifying the longest protein (i.e., the protein with the most peptides) that contains one or more shared peptides. The shared peptides are assigned to the identified protein and removed from any other proteins that contain them, and the process is then repeated. Note that, with this option, some proteins end up being assigned no peptides at all; these orphan proteins are not reported in the output. Default = none.
- --threshold <float> – Only consider PSMs with a threshold value. By default, q-values are thresholded using a specified threshold value. This behavior can be changed using the --custom-threshold and --threshold-min parameters. Default = 0.01.
- --threshold-type none|qvalue|custom – Determines what type of threshold to use when filtering matches. none : read all matches, qvalue : use calculated q-value from percolator, custom : use --custom-threshold-name and --custom-threshold-min parameters. Default = qvalue.
- --input-ms2 <string> – MS2 file corresponding to the psm file. Required to measure the SIN. Ignored for NSAF, dNSAF and EMPAI. Default = <empty>.
- --unique-mapping T|F – Ignore peptides that map to multiple proteins. Default = false.
- --quant-level protein|peptide – Quantification at protein or peptide level. Default = protein.
- --measure RAW|NSAF|dNSAF|SIN|EMPAI – Type of analysis to make on the match results: (RAW|NSAF|dNSAF|SIN|EMPAI). With exception of the RAW metric, the database of sequences need to be provided using --protein-database. Default = NSAF.
- --custom-threshold-name <string> – Specify which field to apply the threshold to. The direction of the threshold (<= or >=) is governed by --custom-threshold-min. By default, the threshold applies to the q-value, specified by "percolator q-value", "decoy q-value (xcorr)". Default = <empty>.
- --custom-threshold-min T|F – When selecting matches with a custom threshold, custom-threshold-min determines whether to filter matches with custom-threshold-name values that are greater-than or equal (F) or less-than or equal (T) than the threshold. Default = true.
- --mzid-use-pass-threshold T|F – Use mzid's passThreshold attribute to filter matches. Default = false.
- --protein-database <string> – The name of the file in FASTA format. Default = <empty>.
Input and output
- --verbosity <integer> – Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default = 30.
- --parameter-file <string> – A file containing parameters. See the parameter documentation page for details. Default = <empty>.
- --spectrum-parser pwiz|mstoolkit – Specify the parser to use for reading in MS/MS spectra. The default, ProteoWizard parser can read the MS/MS file formats listed here. The alternative is MSToolkit parser. If the ProteoWizard parser fails to read your files properly, you may want to try the MSToolkit parser instead. Default = pwiz.
- --fileroot <string> – The fileroot string will be added as a prefix to all output file names. Default = <empty>.
- --output-dir <string> – The name of the directory where output files will be created. Default = crux-output.
- --overwrite T|F – Replace existing files if true or fail when trying to overwrite a file if false. Default = false.