An improved machine learning protocol for the identification of correct Sequest search results

BMC Bioinformatics

Table 1 Features used in the machine learning formulation

Group	Name	Meaning	Origin
SEQUEST	XCorr	Rank score from the SEQUEST search.	SEQUEST
	deltaMH	Difference between mass of parent ion and identified peptide mass.	SEQUEST
	deltCn	Difference between XCorr of the highest ranked peptide and the peptide in question	SEQUEST
	SP score	Preliminary score of peptide in search procedure	SEQUEST
	SP rank	Initial rank of peptide based on SP-score	SEQUEST
	Ion fraction	Percentage of ions in the mass spectra that could be correlated with the spectrum	SEQUEST
Published	Number of tryptic	Number of tryptic cleavage sites in the peptide targets (NTT)	Calculated
	Peptide lenght	Residue count of the peptide	Calculated
	Summed Intesity	Sum of peak intensities in the spectra	Calculated
	Mobil proton factor (MPF)	Measure of the proton mobility in peptide	Calculated
	C-terminal Residue	Amino acid residue at c-terminal (Arg = 1, Lys = 2, Other = 3)	Calculated
	Mass-window peptides	# of DB peptides within prespecified mass-window of the parent ion	Calculated
	Proline count	# of Pro residues in the peptide	Calculated
	Arginine count	# of Arg residues in the peptide	Calculated
Novel	Intensity Mean	The mean of the peak intensities	Calculated
	Intensity Std.	Std. of the peak intensities	Calculated
	Intensity bins	The distribution of intensities in 20%-bins	Calculated
	Protein Hit Count (PHC)	Probability score of observing × number of peptides from parent protein	Calculated
	Potential Coverage Ratio	The potential sequence coverage	Calculated
	PTM percentage	The percentage of possible PTMs found in a peptide	Calculated

For each individual feature we give a brief description and indicate whether the feature was obtained from the output of the SEQUEST algorithm or calculated from the identified peptide, the mass spectrum, or database statistics. The features have been divided up into three subgroups SEQUEST, Published, and Novel, denoting those features that can be derived directly from the SEQUEST algorithm output, those used in published studies of the identification problem, and those introduced in this work, respectively.

ISSN: 1471-2105