April 9, 2008

We need a Wikipedia for data

I just started blogging. I am not sure what I want to write about, but I think one theme will be "things I want but want someone else to build." This article describes one of those things.

At Google, I worked on a number of projects that required data from third party data sources. We licensed mapping data for 100s of countries for Google Maps, movie showtimes data for Google Movies, and stock data for Google Finance, among many others.

After leaving Google and the company of the Google BizDev team, I have come to realize how hard it is for a everyday programmer to get access to even the most basic factual data. If you want to experiment with a new driving directions algorithm, it is infinitely more difficult than coming up with an algorithm; you have to hire a lawyer and a sign a contract with a company that collects that data in the country you are developing for. If you want to write an open source TiVo competitor, you need television listings data for every cable provider in the country, but your options are tenuous at best. In July, the most popular "free" listings service shut down their site, breaking most MythTV installations. The CD database (which is used to recognize CD track names when you rip CDs on your computer) has gone through a number of controversial transitions and license changes for similar reasons.

Even when data is available under a reasonable license, it often suffers from extremely serious quality or discoverability problems. The US Census Bureau publishes map data, but it only includes a small subset of the attributes required for a real mapping product. The Reuters corpus, which is a standard body of text used in data mining and information retrieval research, requires you to sign two agreements, send them to some organization via snail mail, and get the corpus via snail mail on CDs (what century is this, folks?).

I think all of these barriers to data are holding back innovation at a scale that few people realize. The most important part of an environment that encourages innovation is low barriers to entry. The moment a contract and lawyers are involved, you inherently restrict the set of people who can work on a problem to well-funded companies with a profitable product. Likewise, companies that sell data have to protect their investments, so permitted uses for the data are almost always explicitly enumerated in contracts. The entire system is designed to restrict the data to be used in product categories that already exist.

Imagine what amazing applications would be created if every programmer in the world had free access to all of these data sets:

Map data for all countries in a relatively uniform data format
White pages data (names and addresses) for all cities of the world
Stock data for all major exchanges for all time
Movie showtimes data for all cities in the world
Television schedule data for all cities in the world
Sports scores and stats for all sports in the world for all time
Rich meta data for all musical albums and movies from all labels for all time

The interesting thing is, almost every internet company would benefit if this data were freely available. Most internet companies have embraced open source operating systems because every company needs an operating system, and no company wants their OS to be a competitive advantage - they just want it to work. I would argue we are all in the same boat with these factual data sources. No one really wants factual data accuracy and completeness to be their competitive advantage; we all want the best data possible to build the best products possible, and discrepancies in data quality are artifacts of the extremely inefficient economy of buying and selling data we currently live in. If everyone had the same, high quality data, all of our products would be better for it.

To this end, I think we should create a Wikipedia for data: a global database for all of these important data sources to which we all contribute and that anyone can use. When a user reports an inaccurate phone number in your products, save it back to the DataWiki so everyone can benefit, and in return, you get everyone else's improvements as well. If your local movie theater doesn't have listings data in DataWiki, you can type it in yourself, and everyone in your town can benefit, and all the products you use that access movie listings will automatically update. Need better mapping data for a city? Pay to collect it, and upload it to the DataWiki. In return you get all the other cities other companies paid for (sort of like a company contributing device drivers to the Linux kernel).

DataWiki seems like an extremely hard problem, and I don't think it would work unless some big companies got on board and donated their data sets to bootstrap the process. However, I think all companies would benefit almost immediately from the quality improvements that would come from openness. Some data sets are more expensive to collect than others, and those certainly seem like the hardest data sets to make freely available.

I have some concrete ideas on how this could work for some data sets, but I will save them for future posts. In the meantime, what are some of the most interesting existing projects attempting to open up these data sources? I only know of a few, and none of them has really taken off.

Update: Check out this great summary of the sites people have mentioned in the comments on ReadWriteWeb.