# kojak

## Usage:

crux kojak [options] <input spectra> <protein database>

## Description:

Kojak is a database search tool for the identification of cross-linked peptides from mass spectra. This search engine was developed and is maintained by Michael Hoopmann at the Institute for Systems Biology. Additional information about Kojak can be found at kojak-ms.org.

If you use Kojak, please cite:

"Kojak: Efficient analysis of chemically cross-linked protein complexes." Hoopmann MR, Zelter A, Johnson RS, Riffle M, MacCoss MJ, Davis TN, Moritz RL. J Proteome Res 2015 May 1. doi: 10.1021/pr501321h

## Input:

• input spectra – The name of one or more files from which to parse the spectra. Valid formats include mzXML, mzML, mz5, raw, ms2, and cms2. Files in mzML or mzXML may be compressed with gzip. RAW files can be parsed only under windows and if the appropriate libraries were included at compile time. Multiple files can be included on the command line (space delimited), prior to the name of the database.
• protein database – The name of the fasta file containing the amino acid protein sequences to be searched. Kojak can generate decoy sequences internally, or they may be in this file (see the decoy_filter option for details). It is recommended to include the full path in the name of the file.

## Output:

The program writes files to the folder crux-output by default. The name of the output folder can be set by the user using the --output-dir option. The following files will be created:

• *.kojak.txt – a tab-delimited text file containing the PSMs. See the txt file format for a list of the fields. The "*" indicates the root of the input spectrum file(s).
• *.log – a log file containing specific information pertaining to the analysis of the corresponding spectrum file.
• kojak.params.txt – a file containing the name and value of all parameters/options for the current operation. Not all parameters in the file may have been used in the operation. The resulting file can be used with the --parameter-file option for other crux programs.
• kojak.log.txt – a log file containing a copy of all messages that were printed to standard error. These are mainly crux-specific messages.

## Options:

• ### Amino acid modifications

• --fixed_modification <string> – Specifies a mass adjustment to be applied to all indicated amino acids prior to spectral analysis. Amino acids are identified by their single letter designation. N-terminal and C-terminal fixed modifications are designated by n and c, respectively. The relative mass difference, positive or negative, is listed after the amino acid, separated by a space. If multiple fixed modification masses are desired, provide them as a comma-separated list enclosed in quotation marks. For example: "C 57.02146, nK 42.01057". Default = C 57.02146.
• --fixed_modification_protC <float> – Specifies a mass adjustment to be applied to all protein C-termini prior to spectral analysis. The relative mass difference may be any non-zero number. Default = 0.
• --fixed_modification_protN <float> – Specifies a mass adjustment to be applied to all protein N-termini prior to spectral analysis. The relative mass difference may be any non-zero number. Default = 0.
• --modification <string> – Specifies a dynamic mass adjustment to be applied to all indicated amino acids during spectral analysis. Peptides containing the indicated amino acids are tested with and without the dynamic modification mass. Amino acids are identified by their single letter designation. N-terminal and C-terminal dynamic peptide modifications are designated by n and c, respectively. The relative mass difference, positive or negative, is listed after the amino acid, separated by a space. If multiple dynamic modification masses are desired, including to the same amino acid, provide them as a comma-separated list enclosed with quotation marks. For example: "M 15.9949, STY 79.966331". Default = M 15.9949.
• --modification_protC <string> – Specifies a dynamic mass adjustment to be applied to protein C-terminal amino acids during spectral analysis. Peptides containing the protein C-terminus are tested with and without the dynamic modification mass. The relative mass difference can be any non-zero value. If multiple dynamic protein C-terminal modification masses are desired, provide them as a comma-separated list enclosed in quotation marks. For example, "56.037448, -58.005479". Default = 0.
• --modification_protN <string> – Specifies a dynamic mass adjustment to be applied to protein N-terminal amino acids during spectral analysis. Peptides containing the protein N-terminus are tested with and without the dynamic modification mass. The N-terminus includes both the leading and 2nd amino acid, in case of removal of the leading amino acid. The relative mass difference can be any non-zero value. If multiple dynamic protein N-terminal modification masses are desired, provide them as a comma-separated list enclosed in quotation marks. For example, "42.01055, 0.984016". Default = 0.
• --max_mods_per_peptide <integer> – Indicates the maximum number of differential mass modifications allowed for a peptide sequence. Default = 2.
• ### Enzymatic digestion

• --max_miscleavages <integer> – Number of missed enzyme cleavages allowed. If your digestion enzyme cuts at the same amino acids involved in cross-linking, then this number must be greater than 0 to identify linked peptides. In such cases, a minimum value of 2 is required to identify loop-links. Default = 0.
• --kojak_enzyme <string> – An enzyme string code is used to define amino acid cut sites when parsing protein sequences. Following the code, a separate label can be used to name the enzyme used. The rules for peptide parsing are similar to other database search engines such as X!Tandem: 1) cleavage amino acids are specified in square braces: [], 2) a vertical line, |, indicates N- or C-terminal to the residue, 3) exception amino acids are specified in curly braces: {}. Default = [KR] Trypsin.
• ### Search parameters

• --top_count <integer> – This parameter specifies the number of top scoring peptides to store in the first pass of the Kojak analysis. A second pass follows, pairingcross-linked peptides to these top sequences to produce the final cross-linked peptide score. Setting this number too low will cause cross-linked sequences to be missed. Setting this number too high will degrade the performance of the algorithm. Optimal settings will depend on database size and the number of modifications in the search. Recommended values are between 5 and 50 (20 is probably a good start). Default = 20.
• --e_value_depth <integer> – Specifies the minimum number of tests to be present in the histogram for e-value calculations. A larger number better resolves the histogram and improves the e-value estimation for the peptide sequences in each spectrum. However, larger numbers also increase computation time. The recommended values are between 2000 and 10000 Default = 5000.

• --cross_link <string> – Specifies the sites of cross-linking and mass modification. Four values specify a cross-link. The first two values are one or more amino acid letters (uppercase only) that can be linked. These can be the same or different depending on whether the cross-linker is homobifunctional or heterobifunctional. Use lowercase ‘n’ or ‘c’ if the linker can bind the protein termini. The third value is the net mass value of the cross-linker when bound to the peptides. The mass can be any real number, positive or negative. The identifier is any name desired for the cross-linker. If the data contain multiple cross-linkers, provide them as a comma-separated list enclosed with quotation marks. For example, a sample cross-linked with both BS3 and EDC could be specified as "nK nK 138.068074 BS3, DE nK -18.0106 EDC". Default = nK nK 138.068074 BS3.
• --mono_link <string> – Specifies the sites of incomplete cross-linking (i.e. a mono-link) and mass modification. Two values follow this parameter, separated by spaces. The first value is one or more amino acid letters (uppercase only) that can be linked. Use lowercase ‘n’ or ‘c’ if the linker can bind the protein termini. The second value is the net mass of the incomplete cross-link reaction. The mass can be any real number, positive or negative. If multiple mono-links are possible (e.g. with a heterobifunctional cross-linker), provide them as a comma-separated list enclosed in quotation marks. For example: "nK 156.0786, nK 155.0946". Default = nK 156.0786.
• --diff_mods_on_xl T|F – Searching for differential modifications increases search times exponentially. This increase in computation can be exacerbated when searching for differential modifications on cross-linked peptides. Such computation can be avoided if is known that the cross-linked peptides should not have differential modifications. In these cases, this setting can be turned off. Default = false.
• --mono_links_on_xl T|F – When multiple sites of linkage are available on a peptide, it is possible for that peptide to be linked to a second peptide at one site and contain a mono-link at another site. If such instances are considered rare due to the experimental conditions, then this parameter can be disabled to improve computation time. Default = false.
• ### Database

• --decoy_filter <string> – This parameter requires two values. The first value is a short, case-sensitive string of characters that appears in the name of every decoy protein sequence in the database. The second value is either 0 or 1, where 0 indicates that these decoy sequences are already provided in the FASTA database supplied by the user, and 1 indicates Kojak should automatically generate the decoy sequences and preface the protein names with the characters supplied in the first value. If Kojak is requested to generate decoy sequences, it will save the full complement of target and decoy sequences as a fasta file in the output directory at the end of analysis. Kojak generates decoy sequences by reversing the amino acids between enzymatic cleavage sites in the protein sequence. The sites of enzymatic cleavage are determined by the rules supplied with the kojak_enzyme parameter. The leading methionine in the sequences, however, remains fixed. This approach ensures that, with very few exceptions, the number, length, and mass of the decoy peptides are identical to the target peptides. Default = DECOY_ 1.
• --max_peptide_mass <float> – Maximum peptide mass allowed when parsing the protein sequence database. Peptides exceeding this mass will be ignored in the analysis. Default = 8000.
• --min_peptide_mass <float> – Minimum peptide mass allowed when parsing the protein sequence database. Peptides with a lower mass will be ignored in the analysis. Default = 500.
• --truncate_prot_names <integer> – Exports only the specified number of characters for each protein name in the Kojak output. Otherwise, if set to 0, all characters in the protein name are exported. Default = 0.

• --threads <integer> – Number of threads to use when searching spectra. A value of 0 will automatically match the number of threads to the number of processing cores on the computer. Additionally, negative numbers can be used to specify threads equal to all but that number of cores. Default = 0.
• ### Masses

• --ppm_tolerance_pre <float> – Tolerance used when determining which peptides to search for a given MS/MS spectrum based on its precursor ion mass. The unit is parts-per-million (PPM). Default = 10.
• --auto_ppm_tolerance_pre false|warn|fail – Automatically estimate optimal value for the ppm_tolerance_pre parameter from the spectra themselves. false=no estimation, warn=try to estimate but use the default value in case of failure, fail=try to estimate and quit in case of failure. Default = false.
• --kojak_isotope_error <integer> – Allows the searching of neighboring isotope peak masses for poorly resolve precursors. Up to three alternative isotope peak masses will be searched in addition to the presumed precursor peak mass to correct for errors in monoisotopic precursor peak identification. Default = 1.
• ### Fragment ions

• --fragment_bin_offset <float> – Offset position to start the binning (0.0 to 1.0). Default = 0.4.
• --fragment_bin_size <float> – Determines the accuracy of the scoring algorithm with smaller bins being more strict in determining matches between theoretical and observed spectral peaks. Low-resolution spectra require larger bin sizes to accommodate errors in mass accuracy of the observed peaks. Smaller bins also require more system memory, so caution must be exercised when setting this value for high-resolution spectra. For ion trap (low-res) MS/MS spectra, the recommended values is 1.0005. For high-res MS/MS, the recommended value is 0.03. Default = 0.03.
• --auto_fragment_bin_size false|warn|fail – Automatically estimate optimal value for the fragment_bin_size parameter from the spectra themselves. false=no estimation, warn=try to estimate but use the default value in case of failure, fail=try to estimate and quit in case of failure. Default = false.
• --ion_series_A T|F – Should A-series fragment ions be considered? Default = false.
• --ion_series_B T|F – Should B-series fragment ions be considered? Default = true.
• --ion_series_C T|F – Should C-series fragment ions be considered? Default = false.
• --ion_series_X T|F – Should X-series fragment ions be considered? Default = false.
• --ion_series_Y T|F – Should Y-series fragment ions be considered? Default = true.
• --ion_series_Z T|F – Should Z-series fragment ions be considered Default = false.
• ### Spectral processing

• --enrichment <float> – Values between 0 and 1 to describe 18O atom percent excess (APE). For example, 0.25 equals 25 APE. Default = 0.
• --MS1_centroid T|F – Are the precursor ion (MS1) scans centroided? Default = true.
• --MS2_centroid T|F – Are the fragment ion (MS2) scans centroided? Default = true.
• --MS1_resolution <float> – Resolution at 400 m/z, value ignored if data are already centroided. Default = 30000.
• --MS2_resolution <float> – Resolution at 400 m/z, value ignored if data are already centroided. Default = 25000.
• --min_spectrum_peaks <integer> – Minimum number of MS/MS peaks required to proceed with analysis of a spectrum. If spectrum_processing is enabled, the peak count occurs after the spectrum is processed Default = 12.
• --max_spectrum_peaks <integer> – Maximum number of MS/MS peaks to analyze if using the spectrum_processing parameter. Peaks are kept in order of intensity, starting with the most intense. Setting a value of 0 keeps all peaks. Default = 0.
• --precursor_refinement T|F – Some data files may filter out precursor scans to save space prior to searching. To analyze these files, the precursor analysis algorithms in Kojak must be disabled. It is also possible, though not always recommended, to disable these algorithms even when precursor scans are included in the data files. This parameter toggles the precursor analysis algorithms. Default = true.
• --prefer_precursor_pred <integer> – For some data (such as Thermo Orbitrap data), the MS/MS spectra may have a precursor mass prediction already. With this parameter, the Kojak algorithm can be set to either ignore, use, or supplement the predicted precursor information. There are three options for the parameter: 0 = Ignore all precursor mass predictions and have Kojak make new predictions using its precursor processing algorithms. 1 = Use the existing precursor mass predictions and skip further processing with Kojak. 2 = Use the existing precursor mass predictions and supplement these values with additional results of the Kojak precursor processing algorithms. The recommended value is 2. Supplementing the precursor values performs the following functions. First, the monoisotopic precursor mass may be refined to one determined from a point near the apex of the extracted ion chromatogram, with potentially better mass accuracy. Second, in cases where the monoisotopic peak mass might not have been predicted correctly in the original analysis, a second monoisotopic mass is appended to the spectrum, allowing database searching to proceed checking both possibilities. Third, in cases where there is obvious chimeric signal overlap, a spectrum will be supplemented with the potential monoisotopic peak masses of all presumed precursor ions. Regardless of this parameter setting, MS/MS spectra that do not have an existing precursor mass prediction will be analyzed to identify the monoisotopic precursor mass using functions built into Kojak. Default = 2.
• --spectrum_processing T|F – The MS/MS spectrum processing function will collapse isotope distributions to the monoisotopic peak and reduce the number of peaks to analyze to the number specified with the max_spectrum_peaks parameter. Default = false.
• --kojak_instrument <integer> – Values are: 0=Orbitrap, 1=FTICR (such as Thermo LTQ-FT). Default = 0.
• ### param-medic options

• --pm-min-precursor-mz <float> – Minimum precursor m/z value to use in measurement error estimation. Default = 400.
• --pm-max-precursor-mz <float> – Minimum precursor m/z value to use in measurement error estimation. Default = 1800.
• --pm-min-frag-mz <float> – Minimum fragment m/z value to use in measurement error estimation. Default = 150.
• --pm-max-frag-mz <float> – Maximum fragment m/z value to use in measurement error estimation. Default = 1800.
• --pm-min-scan-frag-peaks <integer> – Minimum fragment peaks an MS/MS scan must contain to be used in measurement error estimation. Default = 40.
• --pm-max-precursor-delta-ppm <float> – Maximum ppm distance between precursor m/z values to consider two scans potentially generated by the same peptide for measurement error estimation. Default = 50.
• --pm-charges <string> – Precursor charge states to consider MS/MS spectra from, in measurement error estimation, provided as comma-separated values. Default = 0,2,3,4.
• --pm-top-n-frag-peaks <integer> – Number of most-intense fragment peaks to consider for measurement error estimation, per MS/MS spectrum. Default = 30.
• --pm-pair-top-n-frag-peaks <integer> – Number of fragment peaks per spectrum pair to be used in fragment error estimation. Default = 5.
• --pm-min-common-frag-peaks <integer> – Number of the most-intense peaks that two spectra must share in order to potentially be generated by the same peptide, for measurement error estimation. Default = 20.
• --pm-max-scan-separation <integer> – Maximum number of scans two spectra can be separated by in order to be considered potentially generated by the same peptide, for measurement error estimation. Default = 1000.
• --pm-min-peak-pairs <integer> – Minimum number of peak pairs (for precursor or fragment) that must be successfully paired in order to attempt to estimate measurement error distribution. Default = 200.
• ### Input and output

• --export_percolator T|F – Exports results in Percolator text format (PIN format). Default = true.
• --export_mzID T|F – Exports results in mzID format. Default = false.
• --export_pepXML T|F – Exports results in pepXML format. Default = false.
• --output-dir <string> – The name of the directory where output files will be created. Default = crux-output.
• --overwrite T|F – Replace existing files if true or fail when trying to overwrite a file if false. Default = false.
• --parameter-file <string> – A file containing parameters. See the parameter documentation page for details. Default = <empty>.
• --fileroot <string> – The fileroot string will be added as a prefix to all output file names. Default = <empty>.
• --verbosity <integer> – Specify the verbosity of the current processes. Each level prints the following messages, including all those at lower verbosity levels: 0-fatal errors, 10-non-fatal errors, 20-warnings, 30-information on the progress of execution, 40-more progress information, 50-debug info, 60-detailed debug info. Default = 30.