Friday, April 30, 2010

Kaggle - a home for data mining challenges

I got a promotional email today from a new project called Kaggle. Somewhat related to Innocentive, this project aims to connect challenging problems with people that have the right set of skills to solve them. Kaggle is more specifically aiming to host prediction challenges and should appeal more to the data mining communities. For example, the site is currently hosting a challenge about HIV progression where problem solvers are giving a training dataset and asked to predict improvement in a patient's viral load.

I sent a few questions to Anthony Goldbloom (who works for Kaggle) to get a better idea of what the site is about:

Could you just tell me a bit about the company ? 
The project was inspired by an internship I did as a journalist in London in 2008, when I wrote about the use of data by organizations. I am an econometrician by training and I was excited to see the principles we use to forecast economic growth, inflation etc, being applied by organizations. I returned to Australia and resolved to get involved in the broader analytics community. That's how Kaggle was born.  

It looks like a young startup, is this right?
The project is only two weeks old and we've been thrilled with the response - we've attracted over 6,000 unique visitors. 

We launched the Eurovision contest to get things going. In the last few days we released the HIV Progression Prediction competition. This was my introduction to bioinformatics, which seems like a fascinating area - we're hoping to attract more such competitions. Perhaps your readers have ideas or data

Does the name mean anything ?
The name doesn't mean anything. I got tired of coming up with great names and finding they were taken (and that the owner would only sell for $xx,xxx). As a young project,  our funds could be better spent elsewhere, so I built a program that iterated over different combinations of letters and printed a list of available and phonetic domain names. (I put this program on the web for others in a similar situation.) 

How do you hope to be different from what Innocentive is doing ?
The project is solely focused on data competitions. This enables us to offer services - e.g. to help our clients frame their problems, anonymize their data,  etc. 

The platform is also easily extensible, so we can modify it to suit the specific needs of different data competitions. 

We will host a rating system/league table, so that statisticians can use strong performances to market themselves. The rating system also allows us to host forecasting competitions, since the competition host will know who has a track record of forecasting well (and therefore who to pay attention to).

In the medium term, we plan to also offer a tender system, so that consultants can bid for work from organizations and researchers all over the world. From the organization's perspective, the rating system means they know what they're paying for. From the consultant's perspective, they don't have to waste time touting for work and they get access to interesting clients and datasets. 

Tuesday, April 27, 2010

Science isn’t fair

<rant>

Life isn’t fair, science is part of life therefore science isn’t fair. This would be a very short way to say what I am thinking but this is a rant so I will stretch it out a bit more.

We learn early on that in our line of work there is almost no correlation between the amount of work we do and the results we get. You need luck and I am not turning mystical on you here. I mean the low likelihood kind of luck. Even if you do everything right, being successful in science depends mostly on factors that are outside your control. A somewhat random pool of people end up being in the right place and the right time to go on with their academic work. Almost like a game of musical chairs, those with enough passion and perseverance to sustain the blows of lady luck get to play in the final rounds. Granted that I have been at this only for a few years but I have seen my share of hard working people getting scooped or hitting the wall with impossible projects. Try to explain scooping to non-scientists to see how ridiculous that sounds. I have also seen people (myself included) getting authorships for things I would not consider worthy of such.

So … science isn’t fair. This was exactly the sort of observations that made me start thinking about open science a few years ago. We could help to even out the playing field if we all are a bit more open about what we are working on. Too many financial and personal resources are eaten away to the duplication of research agendas.

</rant>

Tuesday, April 13, 2010

Nature Communications serves its first papers

The new Nature brand journal (Nature Communications) has published its first set of papers this week. It is an interesting development in scientific publishing for many reasons. This is the first Nature brand journal that is online only and offers an (expensive) $5000 open access choice. Also, they are positioning this journal specifically as lower tier journal than previous Nature journals. According to the scope section:
"papers published in Nature Communications will be of high quality, without necessarily having the scientific reach of papers published in Nature and the Nature research journals."
So why is Nature dipping its toes in higher volume open access versus its typical market of highly selective closed access papers ? A bit of context might be required and some of the discussions from 2008 about the PLoS business model are worth revisiting. A few years ago, Declan Butler, a reporter from Nature, wrote an overly negative news piece about PLoS ONE which generated a huge online discussion (see Bora's link fest). Timo Hannay's reaction to this discussion was a much more balanced point of view from Nature's side of things. Essentially, Timo Hanny was pointing out that PLoS had failed to make a profit with their more selective journals and that it was showing that a lower tier of less selective journals are required to subsidize the higher tiers. Timo also said that PLoS was creating barriers to market entry for other OA publishers because they were using philanthropic grants to sustain their business.

So with this in mind, Nature Communications could be seen as bet hedging. Open access might be here to stay due to mandates from funding agency. If that is the case, the example from PLoS shows us that the only way to sustain highly selective journals is to publish also lower tier, less selective journals. This way the publishing house can also directly pass papers down its chain of journals and even possibly pass around the referee reports to expedite publishing.

If most publishers try to cover the whole range of journal selectivity how may publishers will there be a market for ?

While PLoS and Nature and expanding down this perceived pyramid of journal selectivity, BMC has been trying to expand up. This week, BMC Biology and Journal of Biology announced that these two journals are fusing to be the new flagship journal of BMC. I wish the best to the re-birth of BMC Biology but expanding up the ladder of "perceived impact" is much harder than expanding down.

Through this all we have still not managed to do away with this idea of journal prestige or impact. PLoS ONE promised us they would provide us with ways to filter and sort papers on their individual value but we are still not there yet. Ironically these "editorial" services might end up coming from third party programs like Mendeley, CiteUlike or Papers.