I want to be able to predict what proteins in a proteome are more likely to be regulated by phosphorylation and hopefully use mostly sequence information. This post is a quick note to show what I have tried and maybe get some feedback from people that might have tried this before.
The most straightforward way to predict the phospho-proteins is to use existing phospho-site predictors in some way. I have used the GPS 2.0 predictor on the S. cerevisiea proteome with medium cutoff and including only Serine/Threonine kinases. The fraction of tyrosine phosphosites in S. cerevisiae is very low so I decided to for now not try to predict tyrosine phosphorylation.
This produces a ranked list of 4E6 putative phosphosites for the roughly 6000 proteins scored according to the predictor (each site is scored for multiple kinases). My question is how to best make use of these predictions if I mostly want to know what proteins are phosphorylated and not the exact sites. Using a set of known phosphorylated proteins in S. cerevisiae (mostly taken from expasy) I computed different final scores as a function of the of all phospho-site scores:
1) the sum
2) the highest value
3) the average
4) the sum of putative scores if they were above a threshold (4,6,10)
5) the sum of putative phosphosite scores if they were outside ordered protein segments as defined by a secondary structure predictor and above a score threshold
The results are summarized with the area under the ROC curve (known phosphoproteins were considered positives and all other negatives) :
In summary, the sum of all phospho-site scores is the best way that I found so far to predict what proteins are phospho-regulated. My interpretation is that phospho-regulated proteins tend to be multi-phosphorylated and/or regulated by multiple kinases so the maximum site score will not work as well as the sum. As a side note, although there are abundance biases in mass-spec data (the source of most of the phospho-data) protein abundance is a very poor predictor of phospho-regulation (AROC=0.55).
Disregarding putative sites outside predicted secondary structured protein segments did not improve the predictions as I would expect but I should try a few disorder predictors.
Ideas for improvements are welcomed, in particular sequence based methods. I would also like to avoid comparative genomics for now.
Wednesday, May 14, 2008
Prediction of phospho-proteins from sequence
Posted by Pedro Beltrão at 12:26 AM
Labels: bioinformatics, original research
8 comments:
is there anyway to compare data from phospho.ELM or other mass spec databases with phosphorylation prediction servers as a way to train the algorithm to find things?
@Dave Bridges: This is exactly how the phospho-site predictors were created. What I need on the other hand is a trained predictor of phospho-proteins and I don't think there is one available. I guess that would be the next try, instead of trying to empirically test different ways to combine the phospho-site predictions I could try to train some form of machine learning algorithm but this would take me some time since I never did it before. I guess it is a good time to learn :)
Hi Pedro,
I know a thing or two about phosphorylation sites.
As you know, most prediction tools predict sites for a specific family of kinases. The only tool I know that predicts whether a S/T/Y is phosphorylated is DisPhos. Unfortunately their webserver is not much use for batch prediction.
It is certainly worth looking at disorder as a predictor. Of the experimentally-validated sites in either phospho.ELM or UniProt, over 90% are disordered as predicted by at least one of the three DisEMBL algorithms, for instance. See these data at our web site.
Thanks Neil. I have disEMBL running locally since I have used it before for SH3 target site filtering. I will give that a try next.
I did not notice that you were working with Bostjan Kobe. I have been actually using Predikin predictions of yeast kinases.
Yes, Predikin has been my life these past 2 years :) The NAR server paper just came out and we have a more detailed paper in press, BMC Bioinformatics. I'll be blogging about them soon.
The Predikin website is also not great for batch prediction. If you'd like to do any genome-scale yeast analysis, let me know and we can run sequences through our local Perl scripts.
Hi Pedro,
I'm not sure that I agree with your interpretation that the sum of scores works well because proteins tend to be multi-phosphorylated. I think that there is a much simpler statistical explanation for what you observe.
Imagine that GPS predicts one site with a score x. This implies that there is a certain probability, p, that the site is actually a true phosphorylation site, and thus that there is 1-p chance that the prediction is wrong.
Now suppose that you have a different protein with two sites that both get the score x, i.e. the sum of scores is 2x. In this case there is still p chance for each of the two sites to be correct.
However, what you ask in your ROC plot is not whether each phosphorylation site is correct but whether the protein is a phosphoprotein. Given two sites with score x, the probability that at least one of them is a true phosphorylation site is 1-(1-p)*(1-p), which is greater than p (provided that p > 0).
In other words, I believe that your improved performance is simply a consequence of the fact that the more phosphorylation sites you predict in a protein, the more likely it is that at least one of those predictions is correct. But the reliability of each individual prediction is likely unchanged.
Hi Pedro,
I was working on a similar problem except using mostly structural information (short story being that structure - used in a simple feature-based classifier way - doesn't seem to add much on top of sequence), but what we started thinking about towards the end of the project was whether information about kinase docking sites could be used to aid in prediction. These sites have high-affinity for the kinase, are distal to the phosphorylation site(s), and usually precede phosphorylation as they bring the kinase and substrate together in the first place. I think there wasn't enough structure information to do a good analysis but it might be worth checking whether anyone has tried to use docking sites as an additional or parallel feature in predicting phosphorylation sites.
Lars: You might be right, I have to think more about it. I also don't see an easy way to test this. One possible way as to do with spatial clustering of possible sites. Some kinases only phosphorylate "downstream" of other phosphosites. If there is further improvement for analyzing the spatial distribution of the scores across the proteins then it would argue for multiple sites.
Shirley: I guess that was one of those examples for the t-shirts :). Yes, it would be great to use docking site specificity but I don't have an easy way to define these. Do you know of a list of motifs or PSSMs that describe the binding specificity of known kinase docking sites in yeasts ? It is possible to predict them using structural analysis and I think my previous supervisor as a collaboration on this so I have to ask them to see if they have them already.
Post a Comment