Thursday, February 23, 2012

Academic value, jobs and PLoS ONE's mission

Becky Ward from the blog "It Takes 30" just posted a thoughtful comment regarding the Elsevier boycott.  I like the fact that she adds some perspective as a former editor contributing to the ongoing discussion. This follows also from a recent blog post from Michael Eisen regarding academic jobs and impact factors. The tittle very much summarizes his position: "The widely held notion that high-impact publications determine who gets academic jobs, grants and tenure is wrong". Eisen is trying to play down the value of the "glamour" high impact factor magazines and fighting for the success of open access journal. It should be a no-brainer really. Scientific studies are mostly payed for by public money, they are evaluated by unpaid peers and published/read online. There is really no reason why scientific publishing should be behind pay-walls.

Obviously it is never as simple as it might appear at first glance. If putting science online was the only role publishers played I could just put all my work up on this blog. While I write up some results as blog posts I can guarantee you that I would soon be out of job if I only did that. So there must be other roles that scientific publishing plays and even if these roles might be outdated or performed poorly they are needed and must be replaced for us to have a real change in scientific publishing.

The value of scientific publishing

In my view there are 3 main roles that scientific journals are currently playing: filtering, publishing and providing credit. The act of publishing itself is very straightforward and these days could easily cost near zero if the publishers have access to the appropriate software. If publishing itself has benefited greatly with the shift online, filtering and credit are becoming increasingly complex in the online world.

Filtering
Moving to the digital world created a great attention crash that we are still trying to solve. What great scientific advances happened last year in my field ? What about in unrelated fields that I cannot evaluate myself ?  I often hear that we should be able to read the literature and come up with answers to these questions directly without regard to where the papers where published. However, try to just imagine for a second that there were no journals. If PLoS ONE and its clones get what they are aiming for, this might be on the way. A quick check on Pubmed tells me that 87134 abstracts were made available in the past 30 days. That is something like 2900 abstracts per day ! Which ones of these are relevant for me ? The currently filtering system of tiered journals with increasing rejection rates is flawed but I think it is clear that we cannot do away with it until we have another in place.

Credit attribution
The attribution of credit is also intimately linked to the filtering process. Instead of asking about individual articles or research ideas credit is about giving value to researchers, departments or universities. The current system is flawed because it overvalues the impact/prestige of the journals where the research gets published. Michael Eisen claims that impact factors are not taken into account when researchers are picked for group leader positions but honestly this idea does not ring true to me. From my personal experience of applying for PI positions (more on that later), those that I see getting shortlisted for interviews tend to have papers in high-impact journals. On twitter Eisen replied to this comment by saying "you assume interview are because of papers, whereas i assume they got papers & interviews because work is excellent". So either high impact factor journals are being incorrectly used to evaluate candidates or they are working well to filter excellent work. In either case, if we are to replace the current credit attribution system we need some other system in place.

Article level metrics
So how do we do away with the current focus on impact factors for both filtering and credit attribution? Both of those could be solved if we could focus on evaluating articles instead of the journals. The mission of PLoS ONE was exactly to develop article level metrics that would allow for a post-publication evaluation system. As they claim in their webpage they want "to provide new, meaningful and efficient mechanisms for research assessment". To their credit PLoS has been promoting the idea and making some article level indicator easily accessible but I have yet to see a concrete plan to provide the readers with a filtering/recommendation tool. As much as I love PLoS and try to publish in their journals as much as possible, in this regard PLoS ONE has so far been a failure. If PLoS and other open access publishers want to fight Elsevier and promote open access they have to invest heavily in filtering/recommendation engines. Partner with academic groups and private companies with similar goals (ex. Mendeley ?) if need be. With PLoS ONE they are contributing to the attention crash and making (finally) a profit off of it. It is time to change your tune, stop saying how big PLoS ONE is going to be next year and start staying how you are going to get back on track with your mission of post-publication filtering.  

Summary
Without replacing the current filtering and credit attribution roles of traditional journals we wont do away with the need for tiered structure in scientific publishing. We could still have open access tiered systems but the current trend for open access journals appears to be the creation of large journals focused on the idea of post-publication peer review since this is economically viable. However, without filtering systems, PLoS ONE and its many clones can only contribute to the attention crash problem and do not solve the issue of credit attribution. PLoS ONE's mission demands it that they work on filtering/recommendation and I hope that if nothing else they can focus their message, marketing efforts and partnerships on this problem.




 



Wednesday, February 22, 2012

The 2012 Bioinformatics Survey

I am interrupting my current blogging hiatus to point to a great initiative by Michael Barton. He is collecting some information regarding those working in the fields of bioinformatics / computational biology in this survey. This is a repeat from a similar analysis done in 2008 and I think is it is really worth getting a felling for how things have been changing. We can all benefit from the end result. So far, after 2 weeks, there have been close to 400 entries to the survey but the rate of new entries is slowing down. So, if you have not done so already, go and fill it out or bug some colleague to do so. 

Wednesday, May 25, 2011

Predicting kinase specificity from phosphorylation data

Over the past few years, improvements in mass-spectrometry methods have resulted in a big increase in throughput for the identification of post-translational modifications (PTMs). It is even hard to keep up with all the phosphoproteomics papers and the accumulation of phosphorylation data. Most often, improvements in methods result in interesting challenges and opportunities. In this case, how can we make use of this explosion in PTM data ? I will try to explore a fairly straightforward idea, on how to use phosphorylation data to predict kinase substrate specificity. I'll describe here the general idea and just the first stab at it to show that I think it can work.

The inspiration for this is the work by Neduva and colleagues that have show that we can search for enriched motifs within proteins that interact with the domain of interest. For example, we can take a protein containing and SH3 domain, find all of it's interaction partners and you will likely see that they are enriched for proline rich motifs of the type PXXP (x = any amino-acid) that is the known binding preference for this domain. So the very obvious application to kinases would be to take the interaction partners of a kinase and find enriched peptide motifs. The advantage of looking at kinases, over any other type of peptide binding domains, is that we can focus specifically on phosphosites.

As a test case I picked the S.cerevisiae Cdc28p (Cdk1) that is known to phosphorylate the motif  [ST]PXK. I used the STRING database to identify proteins that functionally interact with Cdc28 with a cut-off of 0.9 and retrieved all currently known phosphosites within these proteins. As a quick check I used Motif-X to search for enriched motifs.  The first try was somewhat disappointing but after removing phosphosites that had less than 5 MS spectra and/or experiments supporting it I got back the this logo as the most enriched motif:

This was probably the easiest kinase to try since it is known that it typically phosphorylates its targets at multiple sites and it heavily studied.  Still, I think there is a lot of room for exploration here. If anyone is interested in collaborating on this let me know. If your doing computational work I would be interested in some code/tools for motif enrichment. If your doing experimental work let me know about your favorite kinases/species. 

Thursday, April 28, 2011

In defense of 'Omics

High-throughput studies tend to have a bad reputation. They are often derided as little more than fishing expeditions. Few have summarized these feelings as sharply as Sydney Brenner:
"So we now have a culture which is based on everything must be high-throughput.I like to call it low-input, high-throughput, no-output biology"
Having dealt with these type of data for so long, I am often in the strange position of having to defend the approaches. As I was in a real need to procrastinate, I decide to try to write some of these thoughts down.

Error rates
One of the biggest complaints directed at large-scale methods is that they have very high error rates. Usually these complaints come from scientists interested in studying system X or protein Y, that dig into these datasets only to find out that their protein of interest is missing. Are the error rates high ? While this might be true for some methods it is important to note that the error rates are almost always quantified and that those developing the methods keep pushing the rates down.

When thinking about 'small-scale' studies I could equally ask - why should I trust a single western blot image ? How many westerns were put in the garbage bin before you got that really nice one that is featured in the paper ? In fact, some methods for reducing the error become only feasible when operating in high-throughout. As an example, when conducting pull-down experiments to determine protein-protein interactions, unspecific binding becomes much easier to call. This has lead to the development of analysis tools that cannot be employed on single pull down experiments.

So, by quantifying the error rates and driving these down via experimental or analysis improvements, 'omics research is in fact, on the forefront of data quality. At the very least, you know what the error rate is and can use the information accordingly. Once the methods are improved to an extent that the errors are negligible or manageable they are quietly no longer consider "omics". The best example of this I think is genome sequencing. Even with the current issues with next-gen sequencing, few put 'traditional' genome sequencing in the same bag with the other 'omics tools, although they have quantifiable errors.

Standardization
Related to error quantification is standardization. To put is simply, large-scale data is typically deposited in databases and is available for re-use. What is the point of having really careful experiments if they will only be available for re-use, in any significant way, when a (potentially sloppy) curator digs the info out of papers ? This availability fuels research by others that are not set-up to perform the measurements. This is one of the reasons why bioinformatics thrives. The limitations become the ideas not the experimental observations/measurements. Anyone can sit down, think of a problem and with some luck the required measurements (or proxy of them) have been made by others for some unrelated purpose. This is why publications of large-scale studies are so highly cited, they are re-used over and over again.

Engineering mindset and costs
One other very common complaint about these methods is cost. It is common to feel that 'omics research is 'trendy', expensive and consumes too much of the science budgets. While the part about budget allocation might be true, the issue with costs is most certainly not. Large-scale methods are developed by people with an engineering mindset. The problems in this type of research are typically on how to make the methods work effectively, which includes making them cheaper, smaller, faster, etc. 'Omics research drives costs down.

Cataloging diversity
Besides these technical comments the highest barrier to deal with, when discussing these methods with others is a conceptual one.  Is there such a thing as 'hypothesis free' research ? To address this point let me go off on a small tangent. I am currently reading a neuroscience book - Beyond Boundaries - by Miguel Nicolelis, a researcher at Duke University.  I will leave a proper review for some later post but, at some point, Nicolelis talks about the work of Santiago Ramon y Cajal. Ramon y Cajal is usually referred to as the father of the neuron theory that postulates that the nervous systems is made up of fundamental discrete units (neurons).  His drawings of neuronal circuits of different species are famous and easily recognizable. The amazing level of detail and effort that he put into these drawings really underscores his devotion for cataloging diversity. These observations inspired a revolution in neuroscience, much the same way Darwin's catalogs of diversity impacted biology. Should we not build catalogs of protein-interactions, gene-expression, post-translational modifications, etc ? I would argue that we must. Omics research drives errors and price down, creates catalogs of easily accessible and re-usable observations that fuels research. I actually think that it frees researchers. While a few specialize in method developments others are free to dream up biological problems to solve with the data gathering effort shortened to a digital query.

Miss-understandings
So why the negative connotations ? Part of it is simple backlash against the hype. As we know, most technologies tend to follow a hype cycle where early exaggerated excitement is usually followed by disappointment and backlash when they fail to deliver. A second important aspect is simply a lack of understanding of how to make use of the available data. This model of data generation separated from the problem solving and analysis only makes sense if researchers can query the repositories and integrate the data into their research. It is sad to note that this capacity is far from universal. While new generations are likely to bring with them a different mindset, those developing the large scale methods should also bear the responsibility of improving the re-usability of the data. 

Thursday, March 03, 2011

Structure based prediction of kinase interactions

About a year ago Ben Turk's lab published a large scale experimental effort to determine the substrate recognition preferences of most yeast kinases (Mok et al. Sci. Signal. 2010). They used a peptide screening approach to analyze 61 of about 122 known S. cerevisiae kinases in order to derive, for each one, a position specific scoring matrix (PSSM) describing their substrate recognition preference. In the figure below I show an example for the Hog1 MAPK where it is clear that this kinase prefers to phosphorylate peptides that have proline next to the S/T that is going to be phosphorylated.

Figure 1 - Example of Hog1 substrate recognition preference derive from peptide screens. Each spot in the array contains a mixture of peptides that are randomized at all positions except at marked position (-5 to +4 relative to the phosphorylatable residue).  Strong signal correlates with a preference for phosphorylating peptides containing that amino-acid at the fixed position.

As was previously known, most kinases don't appear to have very striking substrate binding preferences. Still, these matrices should allow for significant predictions of kinase-site interactions. These matrices should allow us also to benchmark previous efforts by Neil and other members of the Kobe lab on the structural based predictions of kinase substrate recognition. For this, I obtained the predicted substrate recognition matrices from the Predikin server and known kinase-site interactions from the PhosphoGrid database. I used this data to compare the predictive power of the experimentally determined kinase matrices (Mok et al.) with the predicted matrices from Predikin. This analysis was done about a year ago when the Mok et al. paper was published but I don't think Phosphogrid was significantly updated since then.

Phosphogrid had 422 kinase-site interactions for the 61 kinases analyzed in Mok et al. of which ~50% of these have in-vivo evidence for kinase recognition. As expected, the known kinase-site interactions have a stronger experimental matrix score than random kinase-site assignments (Fig 2).

Figure 2 - The set of kinase-site interactions used broken down according the kinases with higher representation. These sites were scored using the experimental matrices along with other randomly selected phosphosites and the scores of both populations are summarized in the boxplots.


A random set of kinase-phosphosite interactions of equal size was used to quantify the predictive power of the experimental and the Predikin matrices with a ROC curve (Fig 3).
Figure 3 - Area under the ROC curve values for kinase-site predictions using both types of matrices.

Overall, the accuracy of the predicted matrices from Predikin matched reasonably well with those derived from the peptide array experiments with only a small difference in AROC values. I broke down the predictions for individual kinases with at least 10 sites known. Benchmarking of such low numbers becomes very unreliable but besides the Cka1 kinase, the performance of the Predikin matrices matched reasonably well the experimental results.

I am assuming here that Predikin was not updated with any information from the Mok et al study to derive their predictions. If this is true it would mean that structural based prediction of kinase recognition preferences, as implemented in Predikin, is almost as accurate as preferences derived from peptide library approaches. 

Friday, January 07, 2011

Why would you publish in Scientific Reports ?

The Nature Publishing Group (NPG) is launching a fully open access journal called Scientific Reports. Like the recently launched Nature Communications, this journal is online only and the authors cover (or can choose to cover for Nat Comm) the cost of publishing the articles in an open access format. Where 'Scientific Reports' differs most is that the journal will not reject papers based on their perceived impact. From their FAQ:
"Scientific Reports publishes original articles on the basis that they are technically sound, and papers are peer reviewed on this criterion alone. The importance of an article is determined by its readership after publication."

If that sounds familiar it should. This idea of post-publication peer reviewing was introduced by PLoS ONE and Nature appears to be essentially copying the format from this successful PLoS journal. Even the reviewing practices are the same whereby the academic editors can choose to accept/reject based on their opinion or consult external peer reviews. In fact, if I was working at PLoS I would have walked into work today with a bottle of champagne and I would have celebrated. As they say, imitation is the sincerest form of flattery. NPG is increasing their portfolio of open access or open choice journals and  hopefully they will start working on article level metrics. In all, this is a victory for the open-access movement and to science as a whole.

As I had mentioned in a previous post, PLoS has shown that one way to sustain the costs of open access journals with high rejection rates a publishers needs also to publish higher volume journals. Both BioMedCentral and more recently PLoS have also shown that high-volume open access publishing can be profitable so Nature is now trying to get the best of both worlds. Brand power from high-rejection rate journals with a subscription model and a nice added income with a higher-volume open access journals. If by some chance, founders force a complete move to immediate open access, NPG will have a leg to stand on.

So why would you publish in Scientific Reports ? Seriously, can someone tell me ? Since the journal will not filter on perceived impact, they wont be playing the impact factor game. They did not go as far as naming it Nature X so brand power will not be that high. It is similarly priced (until January 2012) as PLoS ONE and has less author feedback information (i.e. article metrics). I really don't see any compelling reason why I would choose to send a paper to Scientific Reports over PLoS ONE.

Friday, December 31, 2010

End of the year with chemogenomics

Taken from jurvetson at:
www.flickr.com/photos/jurvetson/3156246099/
Around this time of the year it is customary to make an assessment of the year that is ending and to make a mental list of things we wish for in the year ahead. Here is my personal (but work related :) take on this tradition.

My academic year ended with the publication of two works related to chemogenomics. Chemogenomics or chemical genomics tries to study the genome-wide response to a compound. Usually, collections of knock-outs or over-expression of large number of genes are grown in the presence or absence of a small molecule to assess the fitness cost (or advantage) of that perturbation to the drug response. This is what was done in these two works.

In the first one, Laura Kapitzky (a former postdoc colleague in the lab) used a collection of KO strains both in S. cerevisiae and S. pombe to essay for the growth in the presence of different compounds. The objective was to study the evolution of the drug response in these distantly related fungi. In line with what was previously observed in the lab for genetic-interactions and kinase-substrate interactions we found that drug-gene functional interactions were poorly correlated across these two species. Perhaps one interesting highlight from this project was that we could combine data from both fungi to improve the prediction of the mode-of-action of the compounds.

The second project, in which I was only minimally involved in, was a similar chemogenomic screen but at a much larger scale. As the tittle implies "Phenotypic Landscape of a Bacterial Cell" (behind paywall), is a very comprehensive study of the response of the E.coli whole knock-out library against an array of compounds and conditions. Robert, Athanasios and other members of the Carol Gross lab did an amazing job of creating this resource and picking some of the first gems from it.

Something that I wanted to highlight here was not so much what was discovered but what I was left wanting. These sort of growth measurements tell us a lot about drug-gene relationships. We also have a growing knowledge of how genes genetically interact either by similar growth measurements in double-mutants or by predictions (as in STRING). These should allow us then to make prediction about how drugs interact. If two drugs can act in synergy to decrease the growth of a bug we should be able to rationalize that in terms of drug-gene and gene-gene interactions. I find this is a very interesting area of research. Naively these sort of data should allow us to predict drug combinations that target a specific species (i.e. pathogen) or diseased tissue but not the host or the healthy tissue. Here is a scientific wish for 2011, that these and other related datasets will give us a handle on this interesting problem.

As for the future, I am entering the final year of my current funding source (thank you HFSP) so my attention is turning into finding either some more funds or another job. I will continue working on the evolution of signalling systems, in particular trying to find the function of post-translational modifications (aka P1). Unfortunately the project failed as an open science initiative, something that I have mostly given up for now. I think the main reason it didn't work was because of lack of collaborators of similar (open) interests and non-overlapping skill sets as Greg and Neil were discussing in the Nodalpoint podcast a while ago.

See you all in 2011 !