"An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets" by Daniel Schwartz and Steven Gygi, Gygi lab at Harvard Medical School, published in Nature Biotechnology on November 4, 2005.
[Historical Digression] Edwin Krebs and Edmond Fischer, Nobel Laureates for discovering the regulatory role of protein phosphorylation in the 1950s. Courtesy of the UW Department of Biochemistry.
This paper presents the first method for finding motifs for phosphorylation sites in proteins. Recent refinements in strategies to identify phosphorylation sites, and the resulting increased number of known phosphorylation sites, have made this new "substrate-driven" approach possible. Using a phosphorylated peptide data set and a background peptide data set, the algorithm calculates the probability of observing residue X at position Y in the phosphorylation motif, given a background probabilty of Z calculated from the second data set. It then constructs the motif in a recursive building procedure, in which statistically significant residue/position pairs are identified (recursive motif building, step 1). Next it removes from the phosphorylation and background data sets all the sequences containing the motif (set reduction, step 2). Steps 1 and 2 are repeated until step 1 reveals no further significant residue/position pairs, leaving a final set of significant motifs.
The biological problem addressed here is an important one, and its solution will hopefully lead to the direct discovery of potential phosphorylation sites in proteins of interest. However, the method presented is quite naive. The authors seem to recognize this and describe it as a "starting point for future research" in the Outlook section. I think refinements like taking structure into account or grouping amino acids according to acidic/basic, hydrophobic/philic would be appropriate to improve performance.