Running a Simple Search Using Tide and Percolator

Now that you have your environment set up and the two input files in your working directory, you can conduct the search. The search process compares each spectrum in demo.ms2 to peptides (subsequences of the proteins) in fasta files provided in a dirctory, yeast-index/. Peptides whose precursor mass is close to that of the observed spectrum are scored against that spectrum, and the top scores are reported in the output. To conduct the search, we first create a peptide index using tide-index and then execute the search using tide-search.

$ crux tide-index small-yeast.fasta yeast-index

While generating the peptide index, you will see output like this:

INFO: Writing results to output directory 'crux-output'.
INFO: CPU: guanine.gs.washington.edu
INFO: Crux version: 4.0-ad795d65-2021-08-23
INFO: Tue Sep 14 20:43:36 PDT 2021
INFO: Beginning tide-index.
INFO: Running tide-index...
INFO: Writing results to output directory 'yeast-index'.
INFO: Reading small-yeast.fasta and computing unmodified peptides...
INFO: Generated 1735 targets, including duplicates.
INFO: Generated 1735 decoys.
INFO: Writing decoy fasta...
INFO: Generating 1 decoy per target
INFO: Reading proteins
INFO: Skipped 0 duplicate targets and 0 duplicate decoys.
INFO: Wrote 1735 targets and 1735 decoys.
INFO: Precomputing theoretical spectra...
INFO: Elapsed time: 0.102 s
INFO: Finished crux tide-index.
INFO: Return Code:0

This command produces the peptide index in yeast-index and also produces a directory crux-output containing the following files:

tide-index.decoy.fasta – a set of decoy proteins, derived from the proteins in the input set,
tide-search.params.txt – a record of all the parameters used in the search, and
tide-search.log.txt – a log file containing a copy of all the messages printed to the screen during the search.

Now you can run this command:

$ crux tide-search demo.ms2 yeast-index

While the search is running, you will see output like this:

INFO: Writing results to output directory 'crux-output'.
INFO: CPU: guanine.gs.washington.edu
INFO: Crux version: 4.1-6d021498-2021-10-19
INFO: Wed Oct 27 18:37:40 PDT 2021
INFO: Beginning tide-search.
INFO: Running tide-search...
INFO: Number of Threads: 1
INFO: Reading index yeast-index/
INFO: Read 56 target proteins
INFO: Converting demo.ms2 to spectrumrecords format
INFO: Elapsed time starting conversion: 0.0616 s
INFO: Converting ms_level 2 ...
INFO: Reading spectrum file crux-output/demo.ms2.spectrumrecords.tmp.
INFO: Read 7535 spectra.
INFO: Starting search.
INFO: 1000 spectrum-charge combinations searched, 13% complete
INFO: 2000 spectrum-charge combinations searched, 27% complete
...
INFO: 6000 spectrum-charge combinations searched, 80% complete
INFO: 7000 spectrum-charge combinations searched, 93% complete
INFO: [Thread 0]: Deleted 0 precursor, 0 isotope and 0 out-of-range peaks.
INFO: [Thread 0]: Retained 100% of peaks.
INFO: Time per spectrum-charge combination: 0.003292 s.
INFO: Average number of candidates per spectrum-charge combination: 1.065428
INFO: Elapsed time: 24.8 s
INFO: Finished crux tide-search.
INFO: Return Code:0

The crux-output directory now contains four new files containing the search results:

tide-search.target.txt – search results in tab-delimited format.
tide-search.decoy.txt – search results from a decoy database in tab-delimited format.
tide-search.params.txt – a record of all the parameters used in the search.
tide-search.log.txt – a log file containing a copy of all the messages printed to the screen during the search.

Note that the peptide-spectrum matches (PSMs) in the tide-search.target.txt are sorted by the precursor m/z value associated with the spectrum. If you want to see which PSMs got the highest XCorr scores, you can sort the file using tools such as Python and Excel.

The first lines of the resulting sorted output file should look like this:

file	scan	charge	spectrum precursor m/z	spectrum neutral mass	peptide mass	delta_cn	delta_lcn	xcorr score	xcorr rank	distinct matches/spectrum	sequence	modifications	cleavage type	protein id	flanking aa	target/decoy
demo.ms2	60135	3	1057.8792	3170.6156	3170.6106	0	0	6.69412756	1	1	IALSRPNVEVVALNDPFITNDYAAYMFK		trypsin-full-digest	YGR192C(19)	RY	target
demo.ms2	60355	4	838.1677	3348.6418	3348.6411	0	0	6.36378858	1	1	HEIASEVASFLNGNIIEHDVPEHFFGELAK		trypsin-full-digest	YLR249W(27)	RG	target
demo.ms	257701	3	1190.5835	3568.7287	3568.7202	0	0	6.23474788	1	1	GVLGYTEDAVVSSDFLGDSHSSIFDASAGIQLSPK		trypsin-full-digest	YGR192C(270)	KF	target
demo.ms2	46517	3	739.3639	2215.0698	2215.0659	0	0	6.11194706	1	1	HELSSLADVYINDAFGTAHR		trypsin-full-digest	YCR012W(150)	RA	target
demo.ms2	75478	3	975.1579	2922.4519	2922.4465	0	0	5.96129632	1	1	NMITGTSQADCAILIIAGGVGEFEAGISK	11_S_57.02	trypsin-full-digest	YBR118W(101)	KD	target

The final step is to post-process the search results using Percolator. Each spectrum has been compared to many peptides and we would like to return only the best match for each spectrum. We also expect that some fraction of the spectra will not be identifiable as peptides (due to chemical noise, multiple peptides co-eluting, poor fragmentation, etc.). The analysis step filters out those spectra and ranks the matches by quality.


		$ crux percolator --test-fdr 0.1 crux-output/tide-search.target.txt

While the analysis is running, you will see output like this

INFO: CPU: guanine.gs.washington.edu
INFO: Crux version: 4.1-6d021498-2021-10-19
INFO: Wed Oct 27 18:56:07 PDT 2021
INFO: Beginning percolator.
INFO: Converting input to pin format.
INFO: Parsing crux-output/tide-search.target.txt
INFO: Assigning index 0 to demo.ms2.
INFO: Parsing crux-output/tide-search.decoy.txt
INFO: There are 4014 target matches and 4014 decoys
INFO: Maximum observed charge is 5.
INFO: File conversion complete.
INFO: Percolator version 3.05.nightly-137-e806a0c5, Build Date Aug 17 2021 11:04:02
INFO: Copyright (c) 2006-9 University of Washington. All rights reserved.
INFO: Written by Lukas Käll (lukall@u.washington.edu) in the
INFO: Department of Genome Sciences at the University of Washington.
INFO: Issued command:
INFO: percolator --results-peptides crux-output/percolator.target.peptides.txt --decoy-results-peptides crux-output/percolator.decoy.peptides.txt --results-psms crux-output/percolator.target.psms.txt --decoy-results-psms crux-output/percolator.decoy.psms.txt --verbose 2 --protein-decoy-pattern decoy_ --seed 1 --subset-max-train 0 --trainFDR 0.01 --testFDR 0.1 --maxiter 10 --search-input auto --no-schema-validation --protein-enzyme trypsin --post-processing-tdc crux-output/make-pin.pin
INFO: Started Wed Oct 27 18:56:08 2021
INFO:  on guanine.gs.washington.edu
INFO: Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10
INFO: Reading tab-delimited input from datafile crux-output/make-pin.pin
INFO: Features:
INFO: deltLCn deltCn XCorr PepLen Charge1 Charge2 Charge3 Charge4 Charge5 enzN enzC enzInt lnNumDSP dM absdM
INFO: Found 8028 PSMs
INFO: Separate target and decoy search inputs detected, using target-decoy competition on Percolator scores.
INFO: Train/test set contains 4014 positives and 4014 negatives, size ratio=1 and pi0=1
INFO: Selecting Cpos by cross-validation.
INFO: Selecting Cneg by cross-validation.
INFO: Split 1:  Selected feature 3 as initial direction. Could separate 264 training set positives with q<0.01 in that direction.
INFO: Split 2:  Selected feature 3 as initial direction. Could separate 286 training set positives with q<0.01 in that direction.
INFO: Split 3:  Selected feature 3 as initial direction. Could separate 313 training set positives with q<0.01 in that direction.
INFO: Found 489 test set positives with q<0.1 in initial direction
INFO: Reading in data and feature calculation took 0.2100 cpu seconds or 0 seconds wall clock time.
INFO: ---Training with Cpos selected by cross validation, Cneg selected by cross validation, initial_fdr=0.01, fdr=0.01
INFO: Iteration 1:      Estimated 497 PSMs with q<0.1
INFO: Iteration 2:      Estimated 498 PSMs with q<0.1
INFO: Iteration 3:      Estimated 497 PSMs with q<0.1
INFO: Iteration 4:      Estimated 495 PSMs with q<0.1
INFO: Iteration 5:      Estimated 499 PSMs with q<0.1
INFO: Iteration 6:      Estimated 500 PSMs with q<0.1
INFO: Iteration 7:      Estimated 499 PSMs with q<0.1
INFO: Iteration 8:      Estimated 500 PSMs with q<0.1
INFO: Iteration 9:      Estimated 499 PSMs with q<0.1
INFO: Iteration 10:     Estimated 500 PSMs with q<0.1
INFO: Learned normalized SVM weights for the 3 cross-validation splits:
INFO:  Split1    Split2  Split3 FeatureName
INFO: -0.2393    0.1490 -0.9274 deltLCn
INFO:  0.1964    0.0398  0.9062 deltCn
INFO:  1.7107    2.0151  3.3339 XCorr
INFO:  0.1113   -0.5086 -0.7459 PepLen
INFO:  0.0000    0.0000  0.0000 Charge1
INFO: -0.0370   -0.1876 -0.5067 Charge2
INFO:  0.0155    0.1021  0.3208 Charge3
INFO:  0.0561    0.2259  0.5142 Charge4
INFO:  0.0188    0.0916  0.1388 Charge5
INFO:  0.0000    0.0000  0.0000 enzN
INFO:  0.0000    0.0000  0.0000 enzC
INFO: -0.0209   -0.0580 -0.0997 enzInt
INFO:  0.2998   -0.3144  0.6631 lnNumDSP
INFO: -0.0458    0.1445 -0.3350 dM
INFO:  0.1462    0.7423  0.8114 absdM
INFO: -2.1264   -3.3482 -5.0740 m0
INFO: Found 499 test set PSMs with q<0.1.
INFO: Selected best-scoring PSM per scan+expMass (target-decoy competition): 1787 target PSMs and 1168 decoy PSMs.
INFO: Multiple instantiations of Normalizer
INFO: Multiple instantiations of Normalizer
INFO: Multiple instantiations of Normalizer
INFO: Tossing out "redundant" PSMs keeping only the best scoring PSM for each unique peptide.
INFO: Calculating q values.
INFO: Final list yields 362 target peptides with q<0.1.
INFO: Calculating posterior error probabilities (PEPs).
INFO: Processing took 5.5800 cpu seconds or 4 seconds wall clock time.
INFO: Multiple instantiations of Normalizer
INFO: Multiple instantiations of Normalizer
INFO: Elapsed time: 4.9 s
INFO: Finished crux percolator.
INFO: Return Code:0

The crux-output directory will now contain eight new files:

percolator.target.psms.txt – a list of peptide-spectrum matches (PSMs), ranked by quality,
percolator.target.peptides.txt – a list of peptides, ranked by quality,
percolator.decoy.psms.txt – a ranked list of decoy PSMs,
percolator.decoy.peptides.txt – a ranked list of decoy peptides,
percolator.pout.xml – a single XML output file containing all of the Percolator results,
make-pin.pin.xml: an intermediate XML format file that is used by Percolator.
percolator.params.txt – parameter file, and
percolator.log.txt – log file.

As before, you might want to sort the Percolator output files, this time by the "percolator score" column.

The beginning of the resulting percolator.target.psms.sort.txt file will look like this:

file_idx	scan	charge	spectrum precursor m/z	spectrum neutral mass	peptide mass	percolator score	percolator q-value	percolator PEP	distinct matches/spectrum	sequence	protein id	flanking aa
0	57701	3	1190.5835	3568.7287	3568.7209	3.22302477	0.002617801	5.1679258e-07	1	GVLGYTEDAVVSSDFLGDSHSSIFDASAGIQLSPK	YGR192C	KF
0	28906	3	731.0363	2190.0869	2190.0817	3.05633388	0.002617801	9.7863355e-07	2	HLVHEVTSPQAFEGLENAGR	YGL009C	RK
0	60831	4	838.1685	3348.6450	3348.6414	3.01509804	0.002617801	1.1460915e-06	1	HEIASEVASFLNGNIIEHDVPEHFFGELAK	YLR249W	RG
0	22958	3	520.9552	1559.8437	1559.8420	2.99335675	0.002617801	1.2456261e-06	1	SHINVVVIGHVDSGK	YBR118W	KS
0	28872	4	548.5290	2190.0869	2190.0817	2.98298458	0.002617801	1.2961123e-06	2	HLVHEVTSPQAFEGLENAGR	YGL009C	RK

In this output, the PSMs are ranked by "percolator score," with higher scores indicating a higher quality match. The associated statistical confidence estimate is reported as a "percolator q-value," interpreted as the minimal false discovery rate threshold at which this match is deemed significant. In the list above, all of the matches have q-values of <0.002, meaning that they are highly significant. The meanings of the remaining columns are described here. Note that when you run Percolator on your own computer, the results may be somewhat different than the ones reported here. This is because Percolator involves randomly subdividing the data in a cross-validation scheme (described in detail here.)