Wednesday, May 25, 2011

Predicting kinase specificity from phosphorylation data

Over the past few years, improvements in mass-spectrometry methods have resulted in a big increase in throughput for the identification of post-translational modifications (PTMs). It is even hard to keep up with all the phosphoproteomics papers and the accumulation of phosphorylation data. Most often, improvements in methods result in interesting challenges and opportunities. In this case, how can we make use of this explosion in PTM data ? I will try to explore a fairly straightforward idea, on how to use phosphorylation data to predict kinase substrate specificity. I'll describe here the general idea and just the first stab at it to show that I think it can work.

The inspiration for this is the work by Neduva and colleagues that have show that we can search for enriched motifs within proteins that interact with the domain of interest. For example, we can take a protein containing and SH3 domain, find all of it's interaction partners and you will likely see that they are enriched for proline rich motifs of the type PXXP (x = any amino-acid) that is the known binding preference for this domain. So the very obvious application to kinases would be to take the interaction partners of a kinase and find enriched peptide motifs. The advantage of looking at kinases, over any other type of peptide binding domains, is that we can focus specifically on phosphosites.

As a test case I picked the S.cerevisiae Cdc28p (Cdk1) that is known to phosphorylate the motif  [ST]PXK. I used the STRING database to identify proteins that functionally interact with Cdc28 with a cut-off of 0.9 and retrieved all currently known phosphosites within these proteins. As a quick check I used Motif-X to search for enriched motifs.  The first try was somewhat disappointing but after removing phosphosites that had less than 5 MS spectra and/or experiments supporting it I got back the this logo as the most enriched motif:

This was probably the easiest kinase to try since it is known that it typically phosphorylates its targets at multiple sites and it heavily studied.  Still, I think there is a lot of room for exploration here. If anyone is interested in collaborating on this let me know. If your doing computational work I would be interested in some code/tools for motif enrichment. If your doing experimental work let me know about your favorite kinases/species. 

Thursday, April 28, 2011

In defense of 'Omics

High-throughput studies tend to have a bad reputation. They are often derided as little more than fishing expeditions. Few have summarized these feelings as sharply as Sydney Brenner:
"So we now have a culture which is based on everything must be high-throughput.I like to call it low-input, high-throughput, no-output biology"
Having dealt with these type of data for so long, I am often in the strange position of having to defend the approaches. As I was in a real need to procrastinate, I decide to try to write some of these thoughts down.

Error rates
One of the biggest complaints directed at large-scale methods is that they have very high error rates. Usually these complaints come from scientists interested in studying system X or protein Y, that dig into these datasets only to find out that their protein of interest is missing. Are the error rates high ? While this might be true for some methods it is important to note that the error rates are almost always quantified and that those developing the methods keep pushing the rates down.

When thinking about 'small-scale' studies I could equally ask - why should I trust a single western blot image ? How many westerns were put in the garbage bin before you got that really nice one that is featured in the paper ? In fact, some methods for reducing the error become only feasible when operating in high-throughout. As an example, when conducting pull-down experiments to determine protein-protein interactions, unspecific binding becomes much easier to call. This has lead to the development of analysis tools that cannot be employed on single pull down experiments.

So, by quantifying the error rates and driving these down via experimental or analysis improvements, 'omics research is in fact, on the forefront of data quality. At the very least, you know what the error rate is and can use the information accordingly. Once the methods are improved to an extent that the errors are negligible or manageable they are quietly no longer consider "omics". The best example of this I think is genome sequencing. Even with the current issues with next-gen sequencing, few put 'traditional' genome sequencing in the same bag with the other 'omics tools, although they have quantifiable errors.

Standardization
Related to error quantification is standardization. To put is simply, large-scale data is typically deposited in databases and is available for re-use. What is the point of having really careful experiments if they will only be available for re-use, in any significant way, when a (potentially sloppy) curator digs the info out of papers ? This availability fuels research by others that are not set-up to perform the measurements. This is one of the reasons why bioinformatics thrives. The limitations become the ideas not the experimental observations/measurements. Anyone can sit down, think of a problem and with some luck the required measurements (or proxy of them) have been made by others for some unrelated purpose. This is why publications of large-scale studies are so highly cited, they are re-used over and over again.

Engineering mindset and costs
One other very common complaint about these methods is cost. It is common to feel that 'omics research is 'trendy', expensive and consumes too much of the science budgets. While the part about budget allocation might be true, the issue with costs is most certainly not. Large-scale methods are developed by people with an engineering mindset. The problems in this type of research are typically on how to make the methods work effectively, which includes making them cheaper, smaller, faster, etc. 'Omics research drives costs down.

Cataloging diversity
Besides these technical comments the highest barrier to deal with, when discussing these methods with others is a conceptual one.  Is there such a thing as 'hypothesis free' research ? To address this point let me go off on a small tangent. I am currently reading a neuroscience book - Beyond Boundaries - by Miguel Nicolelis, a researcher at Duke University.  I will leave a proper review for some later post but, at some point, Nicolelis talks about the work of Santiago Ramon y Cajal. Ramon y Cajal is usually referred to as the father of the neuron theory that postulates that the nervous systems is made up of fundamental discrete units (neurons).  His drawings of neuronal circuits of different species are famous and easily recognizable. The amazing level of detail and effort that he put into these drawings really underscores his devotion for cataloging diversity. These observations inspired a revolution in neuroscience, much the same way Darwin's catalogs of diversity impacted biology. Should we not build catalogs of protein-interactions, gene-expression, post-translational modifications, etc ? I would argue that we must. Omics research drives errors and price down, creates catalogs of easily accessible and re-usable observations that fuels research. I actually think that it frees researchers. While a few specialize in method developments others are free to dream up biological problems to solve with the data gathering effort shortened to a digital query.

Miss-understandings
So why the negative connotations ? Part of it is simple backlash against the hype. As we know, most technologies tend to follow a hype cycle where early exaggerated excitement is usually followed by disappointment and backlash when they fail to deliver. A second important aspect is simply a lack of understanding of how to make use of the available data. This model of data generation separated from the problem solving and analysis only makes sense if researchers can query the repositories and integrate the data into their research. It is sad to note that this capacity is far from universal. While new generations are likely to bring with them a different mindset, those developing the large scale methods should also bear the responsibility of improving the re-usability of the data. 

Thursday, March 03, 2011

Structure based prediction of kinase interactions

About a year ago Ben Turk's lab published a large scale experimental effort to determine the substrate recognition preferences of most yeast kinases (Mok et al. Sci. Signal. 2010). They used a peptide screening approach to analyze 61 of about 122 known S. cerevisiae kinases in order to derive, for each one, a position specific scoring matrix (PSSM) describing their substrate recognition preference. In the figure below I show an example for the Hog1 MAPK where it is clear that this kinase prefers to phosphorylate peptides that have proline next to the S/T that is going to be phosphorylated.

Figure 1 - Example of Hog1 substrate recognition preference derive from peptide screens. Each spot in the array contains a mixture of peptides that are randomized at all positions except at marked position (-5 to +4 relative to the phosphorylatable residue).  Strong signal correlates with a preference for phosphorylating peptides containing that amino-acid at the fixed position.

As was previously known, most kinases don't appear to have very striking substrate binding preferences. Still, these matrices should allow for significant predictions of kinase-site interactions. These matrices should allow us also to benchmark previous efforts by Neil and other members of the Kobe lab on the structural based predictions of kinase substrate recognition. For this, I obtained the predicted substrate recognition matrices from the Predikin server and known kinase-site interactions from the PhosphoGrid database. I used this data to compare the predictive power of the experimentally determined kinase matrices (Mok et al.) with the predicted matrices from Predikin. This analysis was done about a year ago when the Mok et al. paper was published but I don't think Phosphogrid was significantly updated since then.

Phosphogrid had 422 kinase-site interactions for the 61 kinases analyzed in Mok et al. of which ~50% of these have in-vivo evidence for kinase recognition. As expected, the known kinase-site interactions have a stronger experimental matrix score than random kinase-site assignments (Fig 2).

Figure 2 - The set of kinase-site interactions used broken down according the kinases with higher representation. These sites were scored using the experimental matrices along with other randomly selected phosphosites and the scores of both populations are summarized in the boxplots.


A random set of kinase-phosphosite interactions of equal size was used to quantify the predictive power of the experimental and the Predikin matrices with a ROC curve (Fig 3).
Figure 3 - Area under the ROC curve values for kinase-site predictions using both types of matrices.

Overall, the accuracy of the predicted matrices from Predikin matched reasonably well with those derived from the peptide array experiments with only a small difference in AROC values. I broke down the predictions for individual kinases with at least 10 sites known. Benchmarking of such low numbers becomes very unreliable but besides the Cka1 kinase, the performance of the Predikin matrices matched reasonably well the experimental results.

I am assuming here that Predikin was not updated with any information from the Mok et al study to derive their predictions. If this is true it would mean that structural based prediction of kinase recognition preferences, as implemented in Predikin, is almost as accurate as preferences derived from peptide library approaches. 

Friday, January 07, 2011

Why would you publish in Scientific Reports ?

The Nature Publishing Group (NPG) is launching a fully open access journal called Scientific Reports. Like the recently launched Nature Communications, this journal is online only and the authors cover (or can choose to cover for Nat Comm) the cost of publishing the articles in an open access format. Where 'Scientific Reports' differs most is that the journal will not reject papers based on their perceived impact. From their FAQ:
"Scientific Reports publishes original articles on the basis that they are technically sound, and papers are peer reviewed on this criterion alone. The importance of an article is determined by its readership after publication."

If that sounds familiar it should. This idea of post-publication peer reviewing was introduced by PLoS ONE and Nature appears to be essentially copying the format from this successful PLoS journal. Even the reviewing practices are the same whereby the academic editors can choose to accept/reject based on their opinion or consult external peer reviews. In fact, if I was working at PLoS I would have walked into work today with a bottle of champagne and I would have celebrated. As they say, imitation is the sincerest form of flattery. NPG is increasing their portfolio of open access or open choice journals and  hopefully they will start working on article level metrics. In all, this is a victory for the open-access movement and to science as a whole.

As I had mentioned in a previous post, PLoS has shown that one way to sustain the costs of open access journals with high rejection rates a publishers needs also to publish higher volume journals. Both BioMedCentral and more recently PLoS have also shown that high-volume open access publishing can be profitable so Nature is now trying to get the best of both worlds. Brand power from high-rejection rate journals with a subscription model and a nice added income with a higher-volume open access journals. If by some chance, founders force a complete move to immediate open access, NPG will have a leg to stand on.

So why would you publish in Scientific Reports ? Seriously, can someone tell me ? Since the journal will not filter on perceived impact, they wont be playing the impact factor game. They did not go as far as naming it Nature X so brand power will not be that high. It is similarly priced (until January 2012) as PLoS ONE and has less author feedback information (i.e. article metrics). I really don't see any compelling reason why I would choose to send a paper to Scientific Reports over PLoS ONE.

Friday, December 31, 2010

End of the year with chemogenomics

Taken from jurvetson at:
www.flickr.com/photos/jurvetson/3156246099/
Around this time of the year it is customary to make an assessment of the year that is ending and to make a mental list of things we wish for in the year ahead. Here is my personal (but work related :) take on this tradition.

My academic year ended with the publication of two works related to chemogenomics. Chemogenomics or chemical genomics tries to study the genome-wide response to a compound. Usually, collections of knock-outs or over-expression of large number of genes are grown in the presence or absence of a small molecule to assess the fitness cost (or advantage) of that perturbation to the drug response. This is what was done in these two works.

In the first one, Laura Kapitzky (a former postdoc colleague in the lab) used a collection of KO strains both in S. cerevisiae and S. pombe to essay for the growth in the presence of different compounds. The objective was to study the evolution of the drug response in these distantly related fungi. In line with what was previously observed in the lab for genetic-interactions and kinase-substrate interactions we found that drug-gene functional interactions were poorly correlated across these two species. Perhaps one interesting highlight from this project was that we could combine data from both fungi to improve the prediction of the mode-of-action of the compounds.

The second project, in which I was only minimally involved in, was a similar chemogenomic screen but at a much larger scale. As the tittle implies "Phenotypic Landscape of a Bacterial Cell" (behind paywall), is a very comprehensive study of the response of the E.coli whole knock-out library against an array of compounds and conditions. Robert, Athanasios and other members of the Carol Gross lab did an amazing job of creating this resource and picking some of the first gems from it.

Something that I wanted to highlight here was not so much what was discovered but what I was left wanting. These sort of growth measurements tell us a lot about drug-gene relationships. We also have a growing knowledge of how genes genetically interact either by similar growth measurements in double-mutants or by predictions (as in STRING). These should allow us then to make prediction about how drugs interact. If two drugs can act in synergy to decrease the growth of a bug we should be able to rationalize that in terms of drug-gene and gene-gene interactions. I find this is a very interesting area of research. Naively these sort of data should allow us to predict drug combinations that target a specific species (i.e. pathogen) or diseased tissue but not the host or the healthy tissue. Here is a scientific wish for 2011, that these and other related datasets will give us a handle on this interesting problem.

As for the future, I am entering the final year of my current funding source (thank you HFSP) so my attention is turning into finding either some more funds or another job. I will continue working on the evolution of signalling systems, in particular trying to find the function of post-translational modifications (aka P1). Unfortunately the project failed as an open science initiative, something that I have mostly given up for now. I think the main reason it didn't work was because of lack of collaborators of similar (open) interests and non-overlapping skill sets as Greg and Neil were discussing in the Nodalpoint podcast a while ago.

See you all in 2011 !

Tuesday, December 21, 2010

The GABBA program

I was recently in the annual meeting of my former PhD program, the GABBA program, a Graduate Program in Areas of Basic and Applied Biology in Portugal. I realized that I never blogged about the Portuguese PhD programs and I thought I would share with you their somewhat unusual concept.

Like in other PhD programs, GABBA students start by having courses during the first semester of the program. The semester is divided into week long courses in different subjects (think Cell-Cycle, Development, etc) with invited teachers. What is different from most other programs I know of is that students then get to use their scholarship to do their research projects anywhere in the world. GABBA students get payed to do their research in any lab that accepts them, no strings attached. No return clause, not even a requirement to inform the program of research progress. There is an annual meeting where students (and alumni) get to go to Portugal to present their work but no one is obliged to go. It is also a nice opportunity to exchange tips and in some cases even start collaborations.

The annual meeting is always organized around Christmas time so most people end up going. I kept going to the meetings after finishing my PhD mostly because I enjoy seeing the people but also because of the cool science. As you can imagine, everyone is scattered around the world in very nice labs doing research in all sort of different biomedical related subjects. This year there were a lot of talks about stem cells and an unusually high number of neurobiology related work. Some cool research of note for me were for example the work of Martina Bradic (Borowsky lab at NYU) about the convergent evolution of blind cave fish and the talk by Andre Sousa (Sestan Lab at Yale) on the transcriptional profiling of human brain regions during development (http://hbatlas.org/).

The GABBA program takes international students as well but they are typically asked to do their research in Portugal. The applications are usually around June so keep an eye out if you are interested in applying. Have a look at the admissions page for more information.

Wednesday, November 24, 2010

This holiday season, make them spit in a tube

Black Friday is upon us and everyone here in the US is going consumer crazy. Along with the traditional discounts in the offline world, there are also tempting promotions in many online stores. One great example is the discount that 23andMe is offering until next Friday. If you have not heard about 23andMe, they are a direct-to-consumer genetics company that sell a SNP profiling service. You get to find out about your ancestry and genetic propensity for traits and some diseases. The analysis usually costs $499 (plus a one year $5 monthly mandatory subscription) but they are having a $400 dollar discount (use promo code UA3XJH). What better way to spend Christmas than having everyone spit into a little tube.