Wednesday, October 26, 2005

Google Base and Bioinformatics

Google is creating a new service called Google Base. It looks like a general database service. Currently I cannot yet login but from the discussions around in the blogs we will be able to define content types and populate the database with our own content. I don't know how much space will be allocated to each user but I guess that this will be at least the disk space of our gmail accounts (around 2.5G currently and growing).
Can the bioinformatics community take advantage of this ?
Well one of the most boring tasks that we usually have to perform is cross-referencing databases. This usually means downloading some flat-files and spending some time scripting up some stuff. Of course some of the main databases take up way more then the 2.5G but we could imagine that having all databases under the same hosting service would help us. Probably Google Base will have a nice standard API that would come in handy for accessing all sorts of different data.
The next step would be the ability to do some processing on the data right on their servers. Please Google set up some clusters with some standard software and queuing systems. We have clusters here at EMBL but Google would do a lot of researchers a favor by "selling" computer processing time for some ads :).
Protein Modules Consortium & Synthetic Biology

I have become a member of the Protein Modules Consortium, along with all participants in the FEBS course on modular protein domains that I attended recently. The aim of the consortium is the "promotion of scientific knowledge concerning the structure and function of protein modules, as well as the dissemination of scientific knowledge acquired by various means of communication".

Modular protein domains are "parts" inside a protein that can be regarded as a module. In this sense one could try to understand the function of a protein by understanding how the modular parts behave in the context of the whole protein. Another useful interpretation is that one should be able to create a database of modules that we can understand and create proteins with a predetermined function by copying and pasting the parts in the right way. Here are two short reviews on the subject. What would be the most efficient way of creating a database of protein parts that can be combined ? They should all be cloned into the vectors in the same way and there should be already tested protocols to rapidly combine the parts together. One of the future goals of the consortium, that was discussed in the FEBS course, is exactly to promote a set of cloning standards that could be used to this effect.

One possible strategy would be to use the Gateway cloning system. This is an in-vitro cloning method that is used for example by Marc Vidal's lab in the orfeome project of C. elegans. It is a reliable system , specially for small protein domains, and it is very fast. Compared to traditional cloning strategies it could be a bit more expensive but not much more if you consider the cost of the restriction/ligase enzymes. Creating an "entry" vector can be done with a PCR reaction followed by a recombination reaction (~2h) (followed by the usual transformation and sequencing steps) and this entry vector could be then stored in the databank. The biggest disadvantage mentioned for this cloning strategy is the reported low efficiency in cloning big proteins, but this would not be a problem for protein domains since the average protein domain size is around 100 amino-acids.

For reference, here is a paper where the authors compare different recombination systems, and another where the authors show a proof of principle experiment on how to use Gateway recombination to assemble functional proteins from cloned parts.

Monday, October 17, 2005

Your identity "aura"

I was thinking today of some possible future trends on our way to man-machine integration (known to some as the singularity :). More exactly I was thinking of all the recent moves on the portable devices, like the speed at which Apple is sending new iPods to the market and the Palm-Microsoft deal. The idea is simple and probably not very new. Wouldn't it be nice to carry your identity around in a machine readable format. It does not really matter in what way, it could be for example a device with wireless connection with a certain radius that you could turn on and off whenever you wished (any recent palm/cell-phone thing will have it nowadays). Now imagine you walk into a bar and the bar recognizes your identity, takes you list of music preferences from your music player or from the net and includes them into the statistical DJ playing stuff. This way the music the bar will play will be a balanced mesh of the tastes of the majority of people inside. The same way you could pass by any social place and check out the most used tags of the people inside to decide if this is the type of place for you. People broadcasting their identities would bring the same type of web 2.0 mesh/innovations to the social places around us in the real world.

Wednesday, October 12, 2005

In support of text mining

There is a commentary in Nature Biotech where the authors used text mining to look at how knowledge about molecular interactions grows over time. To do this, they used time-stamped statements about molecular interactions taken from full-text articles from 60 journals from 1999-2002. They describe how knowledge mostly expands from known "old" interactions instead of "jumping" to areas of the interaction space that is totally unconnected from previous knowledge. Since this work is based on statements about interactions I guess that the authors did not take into account the data coming from the high-throughput methods that is not described in the papers but is deposited in databases. In fact, in a recent effort to map the human protein-protein interaction network there was very little overlap between the know interactions and the new set of proposed interactions. What we might conclude from this is that although high-throughput methods are more error-prone than small-scale experiments they help us to jump to unexplored knowledge space.
The other two main conclusions of the commentary are that some facts are restricted to "knowledge pockets" and that only a small part of the network is growing at a given time. In general they try to make a case for the use of text mining but they do not go into the details of how this should be implemented. They do not talk about the possible roles of databases, tagging, journals, funding agencies, etc in this process of knowledge growth. Databases should help to solve the problem of knowledge pockets the authors mention. Tagging can eliminate the need for mining the data and journals/funding agencies have the power to force the authors to deposit the data in databases or tag their research along with the paper.
Without wanting to attract the wrath of people working on text mining, my opinion is that at least an equal amount of effort should be dedicated in making the knowledge that is discovered in the future easier to recollect.

Saturday, October 08, 2005

Biology Direct

I am just propagating the announcement of a new journal. You can also read it in Propeller Twist and in Notes from the Biomass. There are tons of new journals coming up, so what is so interesting about this one ? Well they claim that they will implement a novel system of peer review where the author must find three board members to peer review the article. The paper is rejected if the author cannot get the board members to referee the work. Another interesting idea is that the referee can write comments to be published along with the paper. They plan to cover broadly the field of biology but they say that they will start off with Genomics, Bioinformatics and Systems Biology. The editorial board is full of very well know people from these areas so I assume that this is actually a journal to keep a look out for in the future.
Connotea and tags

I have finally started using Connotea from Nature Publishing Group. I'm not a big user of these types of "social" web services like del.icio.us or Flickr but I thought I would give this a try since I do a lot of reading and I would like a nice way of keeping scientific reading organized. Here is my Connotea library.
When I first started downloading pdf files of interesting papers (some years ago) I used to put them neatly into folders organized by subject. Then, when the google desktop search started indexing PDFs I started just putting everything in one folder and I search for it when I want it back. Both ways work ok but the second ends up being faster.
So why should I use a web based reference manager to keep track of the papers I am interested in ? For one, because it takes almost to time at all. This was one of the nicest things about it. Just highlight the papers' DOI with the mouse and click a bookmarklet. Put in a couple of tags to describe the paper and it's done.
One other advantage of using this is the possibility of sharing the load of finding interesting papers with other people in the site with guilt by association.

I would like to see two tools added to Connotea, one is label clusters, like you see in Flickr and the other would be a graph of related papers or authors, like you can see when you click a news in the CNET news site.

In general I think that the tag/label concept is presently one of the best user driven modes of organizing knowledge. It takes the individual very little time to help out and the outcome is a vast amount of organized information. It is also probably a standard by now and this means that a lot of tools will be built to take advantage of this. Right now the tagging efforts are behind walls but there is no reason not to fuse "tag space" among different domains. Instead of an RSS aggregator we could have tag readers across different services. There is already a nice "tag reader" for del.icio.us called direc.tor.
Another useful tool would be a program to automatically label a document according to my labeling practices (or to someone else's habits). The program could scan through all the stuff I had labeled in the past and learn how to label or at least suggest labels for this new document. It could therefore also label whatever is in my computer. It would be close to indexing but more personalized :).

Further reading on the subject? Start here.

Monday, October 03, 2005

Recent reads

I am doing some boring repetitive jobs that take some time to run (I am so glad to have a cluster to work with) and in the middle of the job runs I took some time to catch up to some paper reading. So here is some of the interesting stuff:

Number one goes to a provocative review/opinion from Fox Keller E. called "Revisiting 'scale-free' networks." There is a comment about it in Faculty 1000. The author talks about power law distributions in an historical perspective removing some of the exaggerated hype and maybe overly optimistic notion that the observations about scale free networks might contain some sort of "universal" truth about complex networks.

I talked before about the work of Rama Ranganathan when I went to a FEBS course on modular protein domains. I said that he had talked about PDZ domains but it was actually WW domains :). Anyway , what he talked about in the meeting was published in two papers in Nature. They are worth a look, specially as a good example of the combination of computational and experimental work. This work exemplifies what I consider a nice role for computational biology, to guide the experimental work. They suggest what are the necessary constraints for a protein fold and then they built them to test their folding and activity experimentally.

Small is beautiful ? I am interested in protein network evolution and this small report by Naama Barkai's group caught my eye. It is a very simple work, they show an example where a cis regulatory motif sequence was dropped during evolution in the Saccharomyces lineage in several genes. I usually like small interesting ideas demonstrated nicely but I dare to say that maybe this one is slightly to simple :).

There is also a paper that I disliked. The paper talks about "The binding properties ad evolution of homodimers in protein-protein interaction networks" but most the conclusions look obvious or misleading. They say for example that a protein that has self interactions has higher average number of neighbors than a random protein. The comparison is not fair because a protein that has self interactions, in their analysis, has two or more interactions (including the self interaction) and a random protein has one or more interactions. The fair comparison would be to compare homodimers with proteins in the network that have at least two interactions.