search-for-xlinks
Usage:
crux search-for-xlinks [options] <ms2 file> <protein fasta file> <link sites> <link mass>
Description:
This command compares a set of spectra to cross-linked peptides derived from a protein database in FASTA format. For each spectrum, the program generates a list of candidate molecules, including linear peptides, dead-end products, self-loop products and cross-linked products, with masses that lie within a specified range of the spectrum's precursor mass. These candidate molecules are ranked using XCorr, and the XCorr scores are assigned statistical confidence estimates using an empirical curve fitting procedure.
The algorithm is described in more detail in the following article:
Sean McIlwain, Paul Draghicescu, Pragya Singh, David R. Goodlett and William Stafford Noble. "Detecting cross-linked peptides by searching against a database of cross-linked peptide pairs." Journal of Proteome Research. 2010.
In search-for-xlinks, properties of the cross-linker are specified using the two required command line arguments, and . In addition, mass shifts associated with mono-links can be specified using the --mod option. Below are suggested parameter settings for some commonly used cross-linkers:
Linker | Link Mass | Link Sites | Mono Link |
EDC | -18.0156 | K:D,E,cterm | |
BS2 | 96.0211296 | K,nterm:K,nterm | --mono-link 1K+114.0316942:T:T,1K+113.0476524:T:T |
BS3 | 138.0680742 | K,nterm:K,nterm | --mono-link 1K+156.0786:T:T,1K+155.0946278:T:T |
DSS | 138.0680796 | K,nterm:K,nterm | --mono-link 1K+156.0786:T:T,1K+155.0946278 |
AMAS | 137.011 | K,nterm:C | --mono-link 1KC+155.02156:T:T |
GMBS | 165.0422 | K,nterm:C | --mono-link 1KC+183.05276:T:T |
formaldehyde | 9.98435 | K,W,nterm:H,N,Y,K,W,R,nterm | --mono-link 1KW+12.0:T:T,1KW+30.010565:T:T |
- Note that, unlike Tide, search-for-xlinks does not have a "decoy-format" option; instead, shuffled decoy peptides are always created. In particular, the decoy database contains three copies of each target cross-linked species: one in which both peptides are shuffled, one in which only the first peptide is shuffled, and one in which only the second peptide is shuffled. In the tab-delimited output, these different types are indicated in the "target/decoy" column as "target-target," "target-decoy," "decoy-target" or "decoy-decoy."
- In addition to the primary XCorr score, search-for-xlinks reports separate scores for the two cross-linked peptides. The way this calculation is done depends on whether the "top-n" parameter is set to 0 or not. In the top-n=0 case, the XCorr scores of the two participating peptides are computed exactly. When top-n is non-zero, then each peptide's score is calculated by using as the mass shift on the link site the remainder of the precursor mass. The crosslink peptide score is then calculated using the sum of the two peptide scores. In this approach, the true mass shift assigned to each peptide within a crosslinked pair can be different from that peptide's calculated mass shift. This mass shift affects the way the ions masses are calculated around the crosslink sites and will affect the final XCorr score. In particular, the smaller the precursor tolerance window, the closer the "full, slower" XCorr value calculated directly from the crosslinked peptides and the "fast, inaccurate" XCorr value calculated by summing the individual peptide scores using the remainder precursor mass shift will be. This is because the pairs of peptides chosen will have a "precursor remainder" mass shift closer to the true mass shift calculated directly from the crosslinked peptide.
Input:
ms2 file
– The name of one or more files from which to parse the fragmentation spectra, in any of the file formats supported by ProteoWizard.protein fasta file
– The name of the file in FASTA format from which to retrieve proteins.link sites
– Specification of the the two sets of amino acids that the cross-linker can connect. These are specified as two comma-separated sets of amino acids, with the two sets separated by a colon. Cross-links involving the terminus of a protein can be specified by using "nterm" or "cterm". For example, "K,nterm:Q" means that the cross linker can attach K to Q or the protein N-terminus to Q. Note that the vast majority of cross-linkers will operate on the following reactive groups: amine (K,nterm), carboxyl (D,E,cterm), sulfhydrl (C), acyl (Q) or amine+ (K,S,T,Y,nterm).link mass
– The mass modification of the linker when attached to a peptide.
Output:
The program writes files to the folder crux-output
by default. The name of the output folder can be set by the user using the --output-dir
option. The following files will be created:
search-for-xlinks.target.txt
– a tab-delimited text file containing the peptide-spectrum matches (PSMs). See the txt file format for a list of the fields.search-for-xlinks.decoy.txt
– a tab-delimited text file containing the decoy PSMs. See the txt file format for a list of the fields.search-for-xlinks.qvalues.txt
– a tab-delimited text file containing the top ranked PSMs with calculated q-values. See the txt file format for a list of the fields.search-for-xlinks.params.txt
– a file containing the name and value of all parameters/options for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the --parameter-file option for other crux programs.search-for-xlinks.log.txt
– a log file containing a copy of all messages that were printed to stderr.
Options:
-
search-for-xlinks options
--require-xlink-candidate T|F
– If there is no cross-link candidate found, then don't bother looking for linear, self-loop, and dead-link candidates. Default =false
.--xlink-top-n <integer>
– Specify the number of open-mod peptides to consider in the second pass. A value of 0 will search all candiates. Default =250
.
-
Peptide properties
--min-mass <float>
– The minimum mass (in Da) of peptides to consider. Default =200
.--max-mass <float>
– The maximum mass (in Da) of peptides to consider. Default =7200
.--min-length <integer>
– The minimum length of peptides to consider. Default =6
.--max-length <integer>
– The maximum length of peptides to consider. Default =50
.--isotopic-mass average|mono
– Specify the type of isotopic masses to use when calculating the peptide mass. Default =mono
.
-
Amino acid modifications
--mods-spec <string>
– The general form of a modification specification has three components, as exemplified by 1STY+79.966331.
The three components are: [max_per_peptide]residues[+/-]mass_change
In the example, max_per_peptide is 1, residues are STY, and mass_change is +79.966331. To specify a static modification, the number preceding the amino acid must be omitted; i.e., C+57.02146 specifies a static modification of 57.02146 Da to cysteine. Note that Tide allows at most one modification per amino acid. Also, the default modification (C+57.02146) will be added to every mods-spec string unless an explicit C+0 is included. Also note that search-for-xlinks allows two optional Boolean parameters with each modification, indicating whether the modification will (1) prevent enzymatic cleavage at its site, and (2) prevent cross-linking. These are specified like "K+156.0786:T:T". By default, both of these Booleans are set to false. Default =C+57.02146
.--cmod <string>
– Specify a variable modification to apply to C-terminus of peptides. <mass change>:<max distance from protein c-term (-1 for no max)>. Note that this parameter only takes effect when specified in the parameter file. Default =NO MODS
.--nmod <string>
– Specify a variable modification to apply to N-terminus of peptides. <mass change>:<max distance from protein c-term (-1 for no max)>. Note that this parameter only takes effect when specified in the parameter file. Default =NO MODS
.--max-mods <integer>
– The maximum number of modifications that can be applied to a single peptide. Default =255
.--mod-precision <integer>
– Set the precision for modifications as written to .txt files. Default =2
.
-
Decoy database generation
--seed <string>
– When given a unsigned integer value seeds the random number generator with that value. When given the string "time" seeds the random number generator with the system time. Default =1
.
-
Enzymatic digestion
--enzyme no-enzyme|trypsin|trypsin/p|chymotrypsin|elastase|clostripain|cyanogen-bromide|iodosobenzoate|proline-endopeptidase|staph-protease|asp-n|lys-c|lys-n|arg-c|glu-c|pepsin-a|elastase-trypsin-chymotrypsin|lysarginase|custom-enzyme
– Specify the enzyme used to digest the proteins in silico. Available enzymes (with the corresponding digestion rules indicated in parentheses) include no-enzyme ([X]|[X]), trypsin ([RK]|{P}), trypsin/p ([RK]|[]), chymotrypsin ([FWYL]|{P}), elastase ([ALIV]|{P}), clostripain ([R]|[]), cyanogen-bromide ([M]|[]), iodosobenzoate ([W]|[]), proline-endopeptidase ([P]|[]), staph-protease ([E]|[]), asp-n ([]|[D]), lys-c ([K]|{P}), lys-n ([]|[K]), arg-c ([R]|{P}), glu-c ([DE]|{P}), pepsin-a ([FL]|{P}), elastase-trypsin-chymotrypsin ([ALIVKRWFY]|{P}), lysarginase ([]|[KR]). Specifying --enzyme no-enzyme yields a non-enzymatic digest. Warning: the resulting index may be quite large. Default =trypsin
.--custom-enzyme <string>
– Specify rules for in silico digestion of protein sequences. Overrides the enzyme option. Two lists of residues are given enclosed in square brackets or curly braces and separated by a |. The first list contains residues required/prohibited before the cleavage site and the second list is residues after the cleavage site. If the residues are required for digestion, they are in square brackets, '[' and ']'. If the residues prevent digestion, then they are enclosed in curly braces, '{' and '}'. Use X to indicate all residues. For example, trypsin cuts after R or K but not before P which is represented as [RK]|{P}. AspN cuts after any residue but only before D which is represented as [X]|[D]. To prevent the sequences from being digested at all, use {X}|{X}. Default =<empty>
.--digestion full-digest|partial-digest|non-specific-digest
– Specify whether every peptide in the database must have two enzymatic termini (full-digest) or if peptides with only one enzymatic terminus are also included (partial-digest). Default =full-digest
.--missed-cleavages <integer>
– Maximum number of missed cleavages per peptide to allow in enzymatic digestion. Default =0
.
-
Precursor selection
--precursor-window <float>
– Tolerance used for matching peptides to spectra. Peptides must be within +/- 'precursor-window' of the spectrum value. The precursor window units depend upon precursor-window-type. Default =50
.--precursor-window-type mass|mz|ppm
– Specify the units for the window that is used to select peptides around the precursor mass location (mass, mz, ppm). The magnitude of the window is defined by the precursor-window option, and candidate peptides must fall within this window. For the mass window-type, the spectrum precursor m+h value is converted to mass, and the window is defined as that mass +/- precursor-window. If the m+h value is not available, then the mass is calculated from the precursor m/z and provided charge. The peptide mass is computed as the sum of the average amino acid masses plus 18 Da for the terminal OH group. The mz window-type calculates the window as spectrum precursor m/z +/- precursor-window and then converts the resulting m/z range to the peptide mass range using the precursor charge. For the parts-per-million (ppm) window-type, the spectrum mass is calculated as in the mass type. The lower bound of the mass window is then defined as the spectrum mass / (1.0 + (precursor-window / 1000000)) and the upper bound is defined as spectrum mass / (1.0 - (precursor-window / 1000000)). Default =ppm
.
-
Search parameters
--spectrum-min-mz <float>
– The lowest spectrum m/z to search in the ms2 file. Default =0
.--spectrum-max-mz <float>
– The highest spectrum m/z to search in the ms2 file. Default =1e+09
.--spectrum-charge 1|2|3|all
– The spectrum charges to search. With 'all' every spectrum will be searched and spectra with multiple charge states will be searched once at each charge state. With 1, 2, or 3 only spectra with that charge state will be searched. Default =all
.--compute-sp T|F
– Compute the preliminary score Sp for all candidate peptides. Report this score in the output, along with the corresponding rank, the number of matched ions and the total number of ions. This option is recommended if results are to be analyzed by Percolator or Barista. If sqt-output is enabled, then compute-sp is automatically enabled and cannot be overridden. Note that the Sp computation requires re-processing each observed spectrum, so turning on this switch involves significant computational overhead. Default =false
.--precursor-window-weibull <float>
– Search decoy peptides within +/- precursor-window-weibull of the precursor mass. The resulting scores are used only for fitting the Weibull distribution Default =20
.--precursor-window-type-weibull mass|mz|ppm
– Window type to use in conjunction with the precursor-window-weibull parameter. Default =mass
.--min-weibull-points <integer>
– Keep shuffling and collecting XCorr scores until the minimum number of points for weibull fitting (using targets and decoys) is achieved. Default =4000
.--max-ion-charge <string>
– Predict theoretical ions up to max charge state (1, 2, ... ,6) or up to the charge state of the peptide ("peptide"). If the max-ion-charge is greater than the charge state of the peptide, then the maximum is the peptide charge. Default =peptide
.--scan-number <string>
– A single scan number or a range of numbers to be searched. Range should be specified as 'first-last' which will include scans 'first' and 'last'. Default =<empty>
.--mz-bin-width <float>
– Before calculation of the XCorr score, the m/z axes of the observed and theoretical spectra are discretized. This parameter specifies the size of each bin. The exact formula for computing the discretized m/z value is floor((x/mz-bin-width) + 1.0 - mz-bin-offset), where x is the observed m/z value. For low resolution ion trap ms/ms data 1.0005079 and for high resolution ms/ms 0.02 is recommended. Default =0.02
.--mz-bin-offset <float>
– In the discretization of the m/z axes of the observed and theoretical spectra, this parameter specifies the location of the left edge of the first bin, relative to mass = 0 (i.e., mz-bin-offset = 0.xx means the left edge of the first bin will be located at +0.xx Da). Default =0.4
.--mod-mass-format mod-only|total|separate
– Specify how sequence modifications are reported in various output files. Each modification is reported as a number enclosed in square braces following the modified residue; however, the number may correspond to one of three different masses: (1) 'mod-only' reports the value of the mass shift induced by the modification; (2) 'total' reports the mass of the residue with the modification (residue mass plus modification mass); (3) 'separate' is the same as 'mod-only', but multiple modifications to a single amino acid are reported as a comma-separated list of values. For example, suppose amino acid D has an unmodified mass of 115 as well as two moifications of masses +14 and +2. In this case, the amino acid would be reported as D[16] with 'mod-only', D[131] with 'total', and D[14,2] with 'separate'. Default =mod-only
.--use-flanking-peaks T|F
– Include flanking peaks around singly charged b and y theoretical ions. Each flanking peak occurs in the adjacent m/z bin and has half the intensity of the primary peak. Default =false
.--fragment-mass average|mono
– Specify which isotopes to use in calculating fragment ion mass. Default =mono
.--isotope-windows <string>
– Provides a list of isotopic windows to search. For example, -1,0,1 will search in three disjoint windows: (1) precursor_mass - neutron_mass +/- window, (2) precursor_mass +/- window, and (3) precursor_mass + neutron_mass +/- window. The window size is defined from the precursor-window and precursor-window-type parameters. This option is only available when use-old-xlink=F. Default =0
.--compute-p-values T|F
– Estimate the parameters of the score distribution for each spectrum by fitting to a Weibull distribution, and compute a p-value for each xlink product. This option is only available when use-old-xlink=F. Default =false
.
-
Fragment ion parameters
--use-a-ions T|F
– Consider a-ions in the search? Note that an a-ion is equivalent to a neutral loss of CO from the b-ion. Peak height is 10 (in arbitrary units). Default =false
.--use-b-ions T|F
– Consider b-ions in the search? Peak height is 50 (in arbitrary units). Default =true
.--use-c-ions T|F
– Consider c-ions in the search? Peak height is 50 (in arbitrary units). Default =false
.--use-x-ions T|F
– Consider x-ions in the search? Peak height is 10 (in arbitrary units). Default =false
.--use-y-ions T|F
– Consider y-ions in the search? Peak height is 50 (in arbitrary units). Default =true
.--use-z-ions T|F
– Consider z-ions in the search? Peak height is 50 (in arbitrary units). Default =false
.
-
Cross-linking parameters
--use-old-xlink T|F
– Use the old version of xlink-searching algorithm. When false, a new version of the code is run. The new version supports variable modifications and can handle more complex databases. This new code is still in development and should be considered a beta release. Default =true
.--xlink-include-linears T|F
– Include linear peptides in the search. Default =true
.--xlink-include-deadends T|F
– Include dead-end peptides in the search. Default =true
.--xlink-include-selfloops T|F
– Include self-loop peptides in the search. Default =true
.--xlink-include-inter T|F
– Include inter-protein cross-link candidates within the search. Default =true
.--xlink-include-intra T|F
– Include intra-protein cross-link candiates within the search. Default =true
.--xlink-include-inter-intra T|F
– Include crosslink candidates that are both inter and intra. Default =true
.--xlink-prevents-cleavage <string>
– List of amino acids for which the cross-linker can prevent cleavage. This option is only available when use-old-xlink=F. Default =K
.--max-xlink-mods <integer>
– Specify the maximum number of modifications allowed on a crosslinked peptide. This option is only available when use-old-xlink=F. Default =255
.--mono-link <string>
– Provides a list of amino acids and their mass modifications to consider as candidate for mono-/dead- links. Format is the same as mods-spec. Default =<empty>
.
-
Input and output
--file-column T|F
– Include the file column in tab-delimited output. Default =true
.--concat T|F
– When set to T, target and decoy search results are reported in a single file, and only the top-scoring N matches (as specified via --top-match) are reported for each spectrum, irrespective of whether the matches involve target or decoy peptides.Note that when used with search-for-xlinks, this parameter only has an effect if use-old-xlink=F. Default =false
.--spectrum-parser pwiz|mstoolkit
– Specify the parser to use for reading in MS/MS spectra. The default, ProteoWizard parser can read the MS/MS file formats listed here. The alternative is MSToolkit parser. If the ProteoWizard parser fails to read your files properly, you may want to try the MSToolkit parser instead. Default =pwiz
.--use-z-line T|F
– Specify whether, when parsing an MS2 spectrum file, Crux obtains the precursor mass information from the "S" line or the "Z" line. Default =true
.--top-match <integer>
– Specify the number of matches to report for each spectrum. Default =5
.--print-search-progress <integer>
– Show search progress by printing every n spectra searched. Set to 0 to show no search progress. Default =1000
.--output-dir <string>
– The name of the directory where output files will be created. Default =crux-output
.--overwrite T|F
– Replace existing files if true or fail when trying to overwrite a file if false. Default =false
.--parameter-file <string>
– A file containing parameters. See the parameter documentation page for details. Default =<empty>
.--verbosity <integer>
– Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default =30
.