Saturday, July 29, 2006

The likelihood that two proteins interact might depend on the proteins' age - part 2

Abstract
It has been previously shown [1] that S. cerevisiae proteins preferentially interact with proteins of the same estimated likely time of origin. Using a similar approach but focusing on a less broad evolutionary time span I observed that the likelihood for protein interactions depends on the proteins’ age. I had show this previously for the interactome of S. cerevisiae [2] and here I extend the analysis to show that the same is also observed for the interactome of H. sapiens. Importantly the observation does not depend on the experimental method used since removing the yeast-two-hybrid interactions does not alter the result.

Methods and Results
Protein-protein interactions for H.sapiens were obtained from the Human Protein Reference database and from two high-throughput studies excluding any interactions derived from protein complexes. I considered only proteins that were represented in this interactome (i.e. with one or more interactions).
As before I created groups of H. sapiens proteins with different average age using the reciprocal best blast hit method to determine the most likely ortholog in eleven other eukaryotic species (see figure 1 for species names). For a more detailed description of the group selection and the construction of the phylogenetic tree please see the previous post [2].
It is important to note that the placement of C. familiaris does not correspond with other published phylogenetic trees it might be due to the proteins selected for the tree construction. I should consider using different combinations of ancestral proteins to check the robustness of the tree.

In table 1 we can see the likelihood for protein interactions to occur within the ancestral proteins of group A and between the ancestral proteins and other groups of decreasing average age. As published by Qin et al. and as I had observed before for S. cerevisiae, the interactions within groups of the same age (group A) are more likely than between groups of proteins of different times of origin. Also, the likelihood for a protein to interact with an ancestral protein depends on the age of this protein. Confirming the pervious observation that the younger the protein is the less likely it is to interact with an ancestral protein.

I redid the analysis excluding yeast-two-hybrid interactions from the dataset. As it can be see in table 2, the results are qualitatively the same. There is a small increase in the likelihood of interaction with the ancestral proteins for the youngest group (highlighted in red in table 2) that is likely due to lack of data.


Caveats and possible continuations

I still have to test the statistical significance of these observations and control for possible other effects like protein size and protein expression that could explain these results.
I am interested in continuing this further as an open project. Fallowing the suggestion of Roland Krause I will soon start a wiki page to dump the data bits accumulated for open discussion. Hopefully more people will join in and maybe we can together shape up a small communication.

[1]Qin H, Lu HH, Wu WB, Li WH. Evolution of the yeast protein interaction network. Proc Natl Acad Sci U S A. 2003 Oct 28;100(22):12820-4. Epub 2003 Oct 13
[2]Beltrao. P The likelihood that two proteins interact might depend on the proteins' age Blog post

Friday, July 28, 2006

Binding specificity and complexity

There is a paper out in PNAS about the distribution of free energy of binding for the yeast-two-hybrid datasets. Although I still have to dig into the model they used I found the result quite interesting. They observe that the average binding energy decreases with cellular complexity.
They have some sentences in there that made my hairs stand like: "more evolved organisms have weaker binary protein-protein binding". What does "more evolved" mean ? Also on figure 4 of the paper they plot miu (a parameter related to the average binding energy) over divergence times without saying what species they are comparing.

This result fits well with another paper published a while ago in PLoS Comp Bio about protein family expansions and complexity. Christine Vogel and Cyrus Chothia show (among other things) what protein domains expansion best correlate with complexity. They used cell numbers as a proxy for species complexity. If you look at the top of the list (in table 2) you can find several of the peptide binding domains, know to be of low specificity, given that they do not require a folded structure to interact with.

What I would like to know is the correlation between binding affinity and binding specificity. For example SH2 domains bind much more tightly than SH3 domains although they are both not very specific binding domains. Maybe in general it could be said that average lower binding affinities correspond to lower average binding specificity.

Why would complexity correlate with binding specificity ? I think one important factor is cellular size. An increase is size has allowed for exploration of spacial factors in determining cellular response. Specificity of binding in the real cell (not in binary assays) is determined also by localization at sub cellular structures.

One practical reminder coming from this is that even if we have the perfect method to determine biophysical binding specificity we are still going to get poor results if we cannot predict all other components that will determine if the two proteins will bind or not (i.e localization, expression).

TOPAZ and PLoS ONE

According to the PLoS blog the new PLoS ONE will be accepting submissions soon. I guess they will at the same time release the TOPAZ system that will likely be available here.

"TOPAZ will serve the rapidly growing demand for sophisticated tools and resources to read and use the scientific and medical literature, allowing scholarly publishers, societies, universities, and research communities to publish open access journals economically and efficiently."


Sunday, July 23, 2006

Opening up the scientific process

During my stay at the EMBL, for the past couple of years, it already happened more than once that people I know have been scooped. This simple means that all the hard work that they have been doing was already done by someone else that manage to publish it a bit sooner and therefore limited severely the usefulness of their discoveries. Very few journals are interested in publishing research that merely confirms other published results.

From talking to other people, I have come to accept that scooping is a part of science. There is no other possible conclusion from this but to accept that the scientific process is very flawed. We should not be wasting resources literally racing with each other to be the first person to discover something. When you try to explain to non-scientist, that it is very common to have 3 or 4 labs doing exactly the same thing they usually have a hard time integrating this with their perception of science as the pursue of knowledge trough collaboration.
I am probably naïve given that I am only doing this for a couple of years but I don’t pretend to say that we do not need competition in science. We need to keep each other in check exactly because lack of competition leads to waste of resources. I would argue however that right now the scientific process is creating competition at wrong levels decreasing the potential productivity.

So how do we work and what do we aim to produce? We are in the business of producing manuscripts accepted in peer reviewed journals. To have competition there most be a scarce element. In our case the limited element is the attention of fellow scientist. Given that scientist’s attention is scarce we all compete for the limited number of time that researchers have to read papers every week. So the good news is that the system tends to give credit to high quality manuscripts. This means that research projects and ongoing results should be absolutely confidential and everything should be focused in getting that Science or Nature paper.
I found a beautiful drawing of an iceberg (used here with permission from the author, David Fierstein) that I think illustrates the problem we have today by focusing the competition on the manuscripts. Only a small fraction of the research process is in view.


Wouldn’t it be great if we could find a way to make most of the scientific process public but at the same time guaranty some level of competition? What I think we could do would be to define steps in the process that we could say are independent, which can work as modules. Here I mean module in the sense of a black box with inputs and outputs that we wire together without caring too much on how the internals of the boxes work. I am thinking these days about these modules and here is a first draft of what this could look like:


The data streams would be, as the name suggests, a public view of the data being produced by a group or individual researcher. Blogs are a simple way this could be achieved today (see for example this blog). The manuscripts could be built in wikis by selection of relevant data bits from the streams that fit together to answer an interesting question. This is where I propose that the competition would come in. Only those relevant bits of data that better answer the question would be used. The authors of the manuscript would be all those that contributed data bits or in some other way contributed for the manuscript creation. In this way all the data would be public and still a healthy level of competition would be maintained.
The rest of the process could go on in public view. Versions of the manuscript deemed stable could be deposited in a pre-print server and comments and peer review would commence. Latter there could still be another step of competition to get the paper formally accepted in a journal.

One advantage of this is that it is not a revolution of the scientific process. People could still work in their normal research environment closed within their research groups. This is just a model of how we could extend the system to make it mostly open and public. The technologies are all here: structured blogging for the data streams, wikis for the manuscripts and online communities to drive the research agendas.

I think it is important to view the scientific process as a group of modules also because it allows us latter to think of different ways to wire the modules together. Increasing the modularity should permit us to innovate. For example we can latter think of ways that the data streams are brought together to answer questions, etc.


Friday, July 21, 2006

Bio::Blogs #2 - call for submissions

(via Nodalpoint) This is just a quick reminder that we have 10 days to submit links to the second edition of Bio::Blogs. You can send your suggestions to bioblogs {at} gmail.com. Also if you wish to host future editions send in a quick email with your name and link to your blog to the same email address.

Monday, July 17, 2006

Conference on Systems Biology of Mammalian Cells

There was a Systems Biology conference here in Heidelberg last week. For those interested the recorded talks are now available on their site. There is a lot of interesting things about the behavior of network motifs and about network modeling.

Sunday, July 16, 2006

Blog changes

Notes from the Biomass is back again in a new website. I was cleaning the links on the blog to better reflect what I am actually reading and while I was at it I changed the template. It looks better in IE than in Firefox but I really don't have the time nor the ability to work on a good design.

Tuesday, July 11, 2006

Defrag my life

I am taking the week to visit my former lab in Aveiro, Portugal where I spent one year trying to understand how a codon reassignment occurred in the evolutionary past of C. albicans. This was where I first got into Perl and the wonders of comparative genomics.

It brings back a lot of memories every time I come back to one of the cities I lived in before (6 cities and counting) and I sometimes wonder if it is really necessary for scientists to live such fragmented lives.

reboot, restart, new program.

The regular programming will return soon :).

Tuesday, July 04, 2006

Re: The ninth wave

I usually reed Gregory A Petsko' comments and editorials in Genome Biology that are unfortunately only available with subscription. In the last edition of the journal he wrote a comment entitled "The ninth wave". I have lived most of my life 10min away from the Atlantic ocean and at least to my recollection we used to talk about the 7th wave not the ninth as the biggest wave in a set of waves, but this it not the point :).
Petsko argues that the increase of free access to information on the web and of computer savvy investigators presents a clear danger of a flood of useless correlations hinting at potential discoveries never followed by careful experimental work:
Computational analysis of someone else's data, on the other hand, always produces results, and all too often no one but the cognoscenti can tell if these results mean anything.

This reminded me of a review I read recently from Andy Clark (via Evolgen). Andy Clark talks about the huge increase of researchers in comparative genomics:
...one of its worst disasters is that it has created a hoard of genomics investigators who think that evolutionary biology is just fun, speculative story telling. Sadly, much of the scientific publication industry seems to respond to the herd as much as it does to scientific rigor, and so we have a bit of a mess on our hands.

I have a feeling that this is the opinion of a lot of researchers. There is this generalized consensus that people working on computational biology have it easy. Sitting at the computer all day, inventing correlations with other people's data.
Maybe some people feel this way because it is relatively fast to go from idea to result using computers if you have in a mind clearly what you want to test while the experimental work certainly takes longer.
Why should I re-do the experimental work if I can answer a question that I think is interesting using available information ? I should be criticized if I try to overinterpret the results, if the methods used are not appropriate or if the question is not relevant but I should not be criticized for looking for an answer the fastest way I can.

Monday, July 03, 2006

Journal policies on preprint servers (2)

Recently I did a survey on the different journal policies regarding preprint servers. I am interested in this because I feel it is important to separate the peer review process from the time-stamping (submission) of a scientific communication. Establishing this separation allows for exploration of alternative and parallel ways of determining the value of a scientific communication. This is only possible if journals accept manuscripts previously deposited in pre-print servers.
Today I received the answer from Bioinformatics:
"The Executive Editors have advised that we will allow authors to submit manuscripts to a preprint archive."


If you also think that this model, already very established in physics and maths, is useful you can also sent some mails to your journals of interest to enquire about their policies. If enough authors voice their interest there will be more journals accepting manuscripts from pre-print servers.
I think we are now lacking a biomedical preprint server. The Genome Biology journal served until early this year also as a preprint server but they discontinued this practice. Maybe arxiv could expand to include biomedical manuscripts (they already accept quantitative biology manuscripts) .

Saturday, July 01, 2006

Bio::Blogs # 1

An editorial of sorts

Welcome to the first edition of Bio::Blogs, a blog carnival covering all subjects related to bioinformatics and computational biology. The main objectives of Bio::Blogs are, in my opinion, to help nit together the bioinformatics blogging community and to showcase some interesting posts on these subjects to other communities. Hopefully it will serve as incentive for other people in the area to start their own blogs and to join in the conversation.

I get to host this edition and I decided to format it more or less like a journal with three sections:1) Conference reports; 2) Primers and reviews; 3) Blog articles. I think this reflects also my opinion on what could be a future role of these carnivals, to serve as a path for certification of scientific content parallel to the current scientific journals.

Given that there were so few submissions I added some links myself. Hopefully in the next editions we can get some more publicity and participation :). Talking about future editions, the second edition of Bio::Blogs will be hosted by Neil and we have now a full month to make something up in our blogs and submit the link to bioblogs{at}gmail{dot}com.


Conference Reports
I selected a blog post from Alf describing what was discussed in a recent conference dedicated to Data Webs. There is a lot of information about potential ways to deal with the increase of data submitted all over the web in many different formats. I remember seeing the advert for this conference and I was intrigued to see Philip Bourne, the editor-in-chief of PLoS Computational Biology, among the speakers. I see know that he is involved in publishing tools under development in PLoS.

Primers & Reviews
Stew from Flags and Lollipops sent in this link to a review on the use of bioinformatics to hunt for disease related genes. He highlights a series of tools and methods that can be used to prioritize candidate genes for experimental validation.

Neil, the next host of Bio::Blogs spent some time with the BioPerl package called Bio::Graphics. He dedicated a blog entry to explain how to create graphics for your results with this package. He gives examples on how to make graphic representations of sequences mapped with blast hits and phosphorylation sites.

Chris, a usual around Nodalpoint, nominated a post in Evolgen:
Evolgen has an interesting post about the relative importance (and interest in) cis and trans acting genetic variation in evo-devo. A lot of (computational) energy has thus far been expended in finding regulatory motifs close to genes (ie, within promoter regions), and conserved elements in non-coding sequences. Rather predictably, cis-acting variants have received the lion's share of attention, probably because they present a more tractable problem. The post deals with work from the evo-devo and comparative genomics fields, but these problems have also been attacked from within-species variation perspectives, particularly the genetics of gene expression. But that's next month's post...

Blog articles
I get to link to my last post. I present some very preliminary results on the influence of protein age on the likelihood of protein-protein interactions. Have fun pointing out all the likely flaws in reasoning and hopefully useful ways to build on it.

To wrap things up here is an announcement by Pierre of a possibly useful applet implementing a Genetic Programming Algorithm. If you ever wanted to play around with genetic programming you can have a go with his applet.


That is it for this month. It is a short Bio::Blogs but I hope you find some of these things useful. Don’t forget to submit the links for the next edition before the end of July. Neil will take up the editorial role for #2 in his blog. If you know of a nice symbol that we might use for Bio::Blogs sent it in as well.

The likelihood that two proteins interact might depend on the proteins' age

Abstract
It has been previously shown[1] that S. cerevisiae proteins preferentially interact with proteins of the same estimated likely time of origin. Using a similar approach but focusing on a less broad evolutionary time span I observed that the likelihood for protein interactions depends on the proteins’ age.

Methods and Results
Protein-protein interactions for S. cerevisiae were obtained from BIND, excluding any interactions derived from protein complexes. I considered only proteins that were represented in this interactome (i.e. with one or more interactions).
In order to create groups of S. cerevisiae proteins with different average age I used the reciprocal best blast hit method to determine the most likely ortholog in eleven other yeast species (see figure 1 for species names).

S. cerevisiae proteins with orthologs in all species were considered to be ancestral proteins and were grouped into group A. To obtain groups of proteins with decreasing average age of origin, S. cerevisiae proteins were selected according to the absence of identifiable orthologs in other species (see figure 1). It is important to note that these groups of decreasing average protein age are overlapping. Group F is contained in E , both are contained in D and so forth. I could have selected non overlapping groups of proteins with decreasing time of origin but the lower numbers obtained might in a latter stage make statistical analysis more difficult.
The phylogenetic tree in figure 1 (obtained with MEGA 3.1) is a neighbourhood joining tree obtained by concatenating 10 proteins from the ancestral group A. I did it mostly to avoid copyrighted images and too have a graphical representation of the species divergence.
To determine the effect of protein age on the likelihood of interaction with ancestral proteins I counted the number of interactions between group A and the other groups of proteins (see table 1).

From the data it is possible to observe that protein-interactions within groups (within group A) is more likely than protein-interactions between groups. This is in agreement with the results from Qin et al.[1]. Also the likelihood for a protein to interact with an ancestral protein depends on the age of this protein. This simple analysis suggests that the younger the protein is the less likely it is to interact with an ancestral protein.
One possible use of this observation, if it holds to further scrutiny, would be to use the likely time of origin of the proteins as information to include in protein-protein prediction algorithms.

Caveats and possible continuations
The protein-protein interactions used here also contain the high-throughput studies and therefore the interactome used should be considered with caution. I might redo this analysis with a recent set of interactions compiled from the literature[2] but this will also introduce some bias into the interactome.
I should do some statistical analysis to determine if the differences observed are at all significant. If the differences are significant I should try to correlate the likelihood of interactions with a quantitative measure like average protein identity.

References
[1]Qin H, Lu HH, Wu WB, Li WH. Evolution of the yeast protein interaction network. Proc Natl Acad Sci U S A. 2003 Oct 28;100(22):12820-4. Epub 2003 Oct 13
[2]Reguly T, Breitkreutz A, Boucher L, et al. Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae. J Biol. 2006 Jun 8;5(4):11 [Epub ahead of print]