Big Data Giant Joins InfoChimps to Save the World’s Structured Information

Sometimes highly accomplished people just have to join crazy little startups. It’s always exciting to see what happens when they do. Data scientist Kurt Bollacker is one of those people; he’s decided to join Austin-based bulk data marketplace startup Infochimps, one of the most interesting little companies we regularly write about here.

Bollacker’s history is intense. He helped build one of the first search engines online for academic research papers, the first prototype for the Internet Archive’s Wayback Machine where he was the Technical Director, he was a biomedical research engineer at the Duke University Medical Center, did research on long term digital archiving as the Digital Research Director at the Long Now Foundation and was the Chief Scientist at Metaweb, the massively ambitious semantic web project that Google acquired in the Summer of 2010. Those are some of the weightiest data projects in the Internet’s young history; now he’s joined InfoChimps. “The project that is Infochimps is in it for the long haul,” Bollacker told ReadWriteWeb. “We’re going to make something of lasting value. That’s something I can buy into.”

Sponsor

InfoChimps is a small startup that provides infrastructure for people to buy and sell large sets of data. We first wrote in-depth about the company when it made a controversial move of putting 1 billion data points from months of the Twitter firehose up for sale. Twitter’s legal department quickly took the edge off of what the marketplace was able to offer its customers, but its splash was made and the web suddenly knew about InfoChimps.

InfoChimps offers a wide variety of types of data, however. Among its most popular sets, the company says, is a complete downloadable set of Major League Baseball data concerning every trade, drafting, free agency and other player transaction since 1873. You can also download the raw survey data used for the Zogby International book What Arabs Think, for $999.00.

Revealing the hidden laws and processes underlying societies constitutes the most pressing scientific grand challenge of our century. That may or may not be overstated, but the point is: data is essential in order for us to develop the full extent of self-awareness that science can offer.

Who cares about raw data? Data scientists do, of course, but there’s ample reason for the rest of us to as well. Our big picture interest was well articulated by Dr Dirk Helbing of the Swiss Federal Institute of Technology, who is leading an effort to build what’s being called the Living Earth Simulator (LES), a giant simulation of as many of the earth’s natural and social problems as can be simulated at once. His project is big data analysis taken to one of its most extreme conclusions.

“Many problems we have today – including social and economic instabilities, wars, disease spreading – are related to human behavior, but there is apparently a serious lack of understanding regarding how society and the economy work. Revealing the hidden laws and processes underlying societies constitutes the most pressing scientific grand challenge of our century.”

Revealing the hidden laws and processes underlying societies constitutes the most pressing scientific grand challenge of our century. That may or may not be overstated, but the point is: data is essential in order for us to develop the full extent of self-awareness that science can offer.

Metaweb

Metaweb, where Bollacker was Chief Scientist, was a company best-known for its product Freebase, which it describes as An entity graph of people, places and things, built by a community that loves open data. Founded by Danny Hillis, a computer scientist whose name is usually said in hushed tones, Metaweb raised nearly $60 million to build its giant structured semantic graph.

Metaweb was acquired this Summer for an undisclosed sum and parts of the Freebase technology have turned into Google Refine, “a power tool for working with messy data.”

“At large scale there are classes of applications you can build that you can’t do with 50 items in a data set, but with 50 million or 50 billion items,” Bollacker explains. “Statistics, searches to find patterns, etc.

“I have no illusions that in 20 years, Google will still be paying to keep Freebase online as a service. I have an interest in making sure these bulk data sets stay alive. I think Infochimps has part of a model that could help that happen.

“One of the things I’ve learned is data that is loved tends to survive. I think the Freebase data is underloved. I think we can build extracts out of Freebase. They publish regular dumps. We’re going to grab sections of those dumps, make them better indexed, better labeled and better described.”

Bollacker received a Ph.D. in Computer Engineering from The University of Texas at Austin and it was in his trips back to Austin that he met Infochimps CTO Flip Kromer, a Cornel educated Mechanical Engineer, University of Texas physics education specialist and super-geek.

“The knowledge and experience is a huge known quantity,” Kromer says of Bollacker’s joining the company. “I got into this to build out the open data part of it. The best way to build the open data commons for the world is to do it within the context of a mixed open and commercial thing that makes everybody smarter. We’re building out the commercial part, that’s what we have to focus on. With Kurt on board, I have no fears that we’re ever going to lose our soul. We won’t lose sight over the central mission of making everybody smarter.”

Discuss