Thursday, November 17, 2005

Google Base and Bioinformatics II

The Google Base service is officially open in beta (as usual). Is is mostly disappointing because you can do nothing with it really (read previous post). You can load tons of data, very rapidly although they take a lot of time to process the bulk uploads. Maybe this will speed up in the future. The problem is once you have your structured data in Google Base you cannot do anything with it apart from searching and looking at it with the browser. I uploaded a couple of protein sequences just for fun. I called the item "biological sequence" and I gave it very simple attributes like sequence, id, and type. The upload failed because I did not have a title so I added title and just copied the id field. Not very exciting right.

I guess you can scrape the data off it automatically but that is not very nice. This for example gets the object ids for the biological sequences I uploaded:


use LWP::UserAgent;
use HTTP::Request;
my $url = "http://base.google.com/base/search?q=biological+sequence";
my $ua = new LWP::UserAgent();
my $req = HTTP::Request->new('GET',$url);
my $res = $ua->request($req);
open(DATA, ">google.base.temp") || die "outputfile didn't open $!";
print DATA $res->content;
close DATA;
open (IN,"<google.base.temp")|| die "Error in input $!";
grep(/oid=([0-9]+)\">(\S+)</ && ($data{$1}=$2) ,<IN>);
close IN;
foreach $id (keys %data) {print $id,"\n";}

With the object ids then you can do the same to get the sequences.

Anyway, everybody is half expecting that one day google will release an API to do this properly. So coming back to scientific research, is this useful for anything ? Even with a proper API this is just a database. It will make it easy for people to rapidly set up a database and maybe google can make a simple template webpage service do display the content of the structured database. It would be a nice add-on to blogger for example. You could get a tile to put in your blog with an easy way to display the content of your structured database.

For virtual online collaborative research (aka science 2.0 :)?) this is potentially useful because you get a free tool to set up a database for a given project. Apart from this I don't see potential applications but like the name says it is just the base for something.

4 comments:

Spitshine said...

Yeah, back to the basic flat file databases. For sequence and structured biological data, this won't be of much use as it is now.

Most hits for "bioinformatics" were in the jobs section btw.

Greg Tyrelle said...

Apparently Google doesn't want Google Base to be used for storing "Biological sequences". Last I check the sequences had been removed.

Using Google Base as a personal general purpose data store is, from my perspective, a reasonable thing to do. However I think their intended purpose is to provide a store for "products" and information that will generate advertising revenue.

Although this does highlight the need for a bioinformatics service that allows you to write to a general purpose data store (via XML upload) and the retieve it via a web API, maybe even query too. The Semantic Web stack would be ideal for this (RDF database and SPARQL query interface).

Greg Tyrelle said...

I take that back, the sequences are still there. I don't know what happened but I swear searching for "biological sequence" in google base returned no results when I tried it. Maybe they were reported as "bad items".

I also went through the process of adding a DNA sequence. This failed as the attribute values are only allowed to contain less than 1000 characters. Again, this seems more geared towards auctions, there is a payment method, expiry date etc.

Pedro Beltrão said...

I agree that right now it looks like it is more targeted for products people might sell but who knows maybe the service will evolve.

Post a Comment