cdixon - Latest Comments in To make smarter systems, it’s all about the data

Re: To make smarter systems, it’s all about the data

Mark Essel — Sun, 29 Nov 2009 15:42:41 -0000

I'd like to add an orthogonal viewpoint here. Yes, there's tremendous value in building databases. But it is in the leveraging of existing processing techniques where great value explosions will occur.

Take for instance the coupling of a dozen software services that all independently produce improved customer purchase action. The combination of these independent sources in a novel way, can give a multiplicative value ad of 1.1^12 or a factor of 3 improvement. Now expand this to hundreds or even thousands of independent techniques or connections within a network and you can reveal massive improvements in quality.

A brain with only a couple of nodes is pretty weak. A mind with billions of nodes and hundreds of billions of connections is capable of advanced conscious connections, creativity, and unpredictable value advancement. The potent software applications of the future will exist and thrive by utilizing the network of APIs optimally in the construction of their databases and decision architectures.

If your curiosity is piqued but you're still not convinced, check out some of Kevin Kelly's swarm and emergence concepts. I really enjoy some of his far out predictions and their closer than most folks would guess (in the 10-20 year horizon).

10th popular post, and I give it a 7/10 only because you didn't tie in network effects.

Re: To make smarter systems, it’s all about the data

xamat — Tue, 27 Oct 2009 19:26:48 -0000

Quite a coincidence. I recently gave a talk with the same title. See my blog http://technocalifornia.blo... or my slides on slideshare http://www.slideshare.net/x...

Re: To make smarter systems, it’s all about the data

chris dixon — Thu, 03 Sep 2009 17:47:39 -0000

Thanks - looks very interesting - I'll check it out.

Re: To make smarter systems, it’s all about the data

terrycojones — Wed, 02 Sep 2009 17:25:03 -0000

Actually, most of http://blogs.fluidinfo.com/... is relevant to this.

Re: To make smarter systems, it’s all about the data

terrycojones — Wed, 02 Sep 2009 17:18:03 -0000

Hi Chris

I agree. You might like to check out FluidDB, which is all about using a better data representation to change how we work with information. See http://fluidinfo.com and http://blogs.fuidinfo.com/fluidDB

I've also written about this exact subject a few times. A starting link:
http://blogs.fluidinfo.com/terry/2007/03/19/why...

And I talk a little about Alex Wright at http://blogs.fluidinfo.com/terry/2008/01/04/tag...

Please feel free to get in touch, I'm terry fluidinfo com and would be happy to go into more depth / hear more from you, etc.

Re: To make smarter systems, it’s all about the data

jeremy — Tue, 01 Sep 2009 18:41:19 -0000

Actually, I was driving at something different than having humans label the data. What I was talking about was adjusting the machine learning, so that domain-specific knowledge is built into the learning algorithm.

For example, I had some work a few years ago that used HMMs to label chord information on a set of Beatles tunes. However, instead of using massive amounts of training data on the HMMs, I used zero training data. Zero. Instead, I initialized the HMM using musicologically-sensible initial conditions, and then I adjusted the standard HMM E-M algorithm so that the [B] output probability matrix did NOT get updated; it stayed stable.

All of the intelligence was in the algorithm. There was no human labeled data. And the algorithm performed best in class -- better than solutions that had been trained on lots of human-labeled data.

So to say we can't make smarter algorithms, or that breakthroughs will only come via data, simply doesn't sit right with me.

I think what people tend to mean when they say "it's all about the data" is that there is no purpose coming up with better general purpose machine learning algorithms, i.e. SVMs vs. gaussian mixture models vs. Markov random fields vs. whatever. If that's your main point, then I agree.

But we can also create more intelligent, specialized, intelligent algorithms by building our own smarts into a general purpose ML algorithm, thereby making the algorithm smarter. And we can do it without the need for massive amounts of data. Again, imho.

So I just don't buy that

Re: To make smarter systems, it’s all about the data

chris dixon — Tue, 01 Sep 2009 14:03:15 -0000

Good points jeremy. I think this is also where the line between data and algorithms starts to blur. If you are doing, say, vertical music search, a lot of your "algorithm" will come from ML on music-related corpora, which might include having humans labeling the data at various points.

Re: To make smarter systems, it’s all about the data

jeremy — Tue, 01 Sep 2009 10:50:24 -0000

think the reason what you say works is precisely because the domain becomes small enough that you can do all sorts of things to fill in the gaps in the data.

But isn't that the point? At the end of the day, no matter what the reason, you resorted to clever algorithms, rather than large data. So the thing about data being the only good source of future AI breakthroughs just ain't true. Relying on large data isn't "wrong". It's just not the universal panacea that you make it out to be.

The way I see it, there is a small head of large-scale, breadth-loving domains (e.g. web) and/or tasks (e.g. known-item finding.. rather than exploratory search..) in which large data is very appropriate.

At the same time, there is a long tail of medium and small-scale, depth-loving domains (e.g. content-based music search) and tasks (e.g. exploratory search) in which large data does not give you as much as an intelligently-constructed algorithm.

So what if the only reason you can construct those algorithms is because the domain is well-enough constrained? We know from power-law distributions that the volume (usage, whatever) of tasks and problems in the tail sums up to be equal in magnitude to the head.

So at the most, you can say that large data will help you make AI breakthroughs in *half* of the open problems. Intelligent algorithms will still be necessary for the other half. imho.

Re: To make smarter systems, it’s all about the data

jeremy — Tue, 01 Sep 2009 10:40:43 -0000

Sure, MS now has all this Yahoo data. And Google has plenty of data too. But what is that data? It's known-item, factoid retrieval data. There is no exploratory search data in there. There is no recall-oriented search there. So the only way the data can be used is to improve known-item oriented searching. But that in turn feeds back on itself.. and when Bing gets better at known-item searching, more people will use it for known-item searching, and then the data they collect pigeon-holes them further into that one, narrow, Google-like information seeking behavior.

So it seems to me that the only way out of the constrains imposed upon Bing search is for Bing to come up with clever-er algorithms that do something different and better, despite the gradient which the data is pointing it toward.

If one relies on the data alone, one will not solve a very large range of AI problems. Intelligent algorithms are needed to make those breakthroughs.

Re: To make smarter systems, it’s all about the data

chris dixon — Tue, 01 Sep 2009 07:30:37 -0000

Yeah, from what I hear in the rumor mill, search engines today use click data, bounce rates etc much more than people suspect. With such a long tail of key phrases people enter into search engines, they must have almost unlimited appetite for more user data to get statistically meaningful tests.

Re: To make smarter systems, it’s all about the data

chris dixon — Tue, 01 Sep 2009 07:28:39 -0000

Hey roger, thanks! Glad to see you here.

Re domain specific - I agree, but I think the reason what you say works is precisely because the domain becomes small enough that you can do all sorts of things to fill in the gaps in the data. I think of my last company, SiteAdvisor, as a data company. The way we got from 80% to near 100% was all sorts of techniques, from hacks to manual processes to integrating other data sets. We couldn't have used those techniques in a horizontal setting.

Re: To make smarter systems, it’s all about the data

michels24 — Tue, 01 Sep 2009 01:21:19 -0000

Access to data was one of the things overlooked in the Msft/YHOO search deal. There was a lot of talk about revenue shares and upfront payments, but people forget that Msft now gets a larger source of data to improve it's product. Without that query stream (i.e. data) they would never be able to build as intelligent tools for spelling correction, query intent, auto-complete, etc. One (of many) reasons Google has knocked the ball out of the park is access to this data. Their bucket tests in a week probably provide more insights than the other guys get in a quarter.

Re: To make smarter systems, it’s all about the data

infoarbitrage — Tue, 01 Sep 2009 00:23:25 -0000

Chris, first of all, congrats on the blog. It is terrific. And based on the breadth and intelligence of the comments, this has already become a very exciting ecosystem in which to participate.

While I agree with the thrust of your post, you've taken a very horizontal view of the problem. I do not spend time on Google-scale problems, but on much more targeted, vertical solutions to the "big data" problem. By layering domain specificity onto the problem of semantic analysis, many of the pitfalls of NLP and AI become far more manageable. I'm not saying they're a panacea, and certainly not when trying to solve problems in real-time, but they can take you a lot farther than when applying them to horizontal data sets.

And yes, tagging rich data and creating additional metadata for analysis holds many of the keys to extracting true meaning from unstructured data sets. I could write on this topic for hours. Thanks for the post.

Roger

Re: To make smarter systems, it’s all about the data

Amit Seth — Mon, 31 Aug 2009 18:23:40 -0000

Well said. The big fact about 'data' is that if it is not 'whole' then it tends to be dangerous (in terms of the predictions that it produces - the predictions on the face of it could look awesome, but have a propensity to be as wacky as not having data at all).

I have seen entities make big mistakes in trying to solve major problems with machine learning (and resting on their laurels) without considering the fact that not all data needed that influences the outcomes are being sourced or even though about.

Re: To make smarter systems, it’s all about the data

henchan — Mon, 31 Aug 2009 17:26:59 -0000

To illustrate the incredible subtlety in the interplay of system components, let me paint a metaphor from nature. It is not the only possible mapping between these two domains, but it serves my current purpose.
The algorithm is a copying mechanism; data encoding is DNA / RNA; information is the array of working combinations of encoded data; application processes are organisms; applications are species; communication (pub/sub) is natural selection; the ecosystem is the ecosystem.
Living organisms are incented to survive and replicate. Likewise the aim of a publisher is to communicate - deeply, broadly and for a long time. SEO happens to be a powerful form of communication at present. Certainly, a breakthrough in this area will need to get established in some existing niche. Long term though it need not be sustained by currently extant forms of communication.

Re: To make smarter systems, it’s all about the data

MikePLewis — Mon, 31 Aug 2009 11:21:50 -0000

Interesting point about the links. I never really thought about it before. More and more metadata will begin to appear around the web which means that the systems that "understand" the data will be able to do new and more powerful thigns. Similar to how last.fm can know which person is most like me - a "musical neighbor." There was never a source/database of listened and liked tracks before, but once you have it you can do things like this. Very interesting post

Re: To make smarter systems, it’s all about the data

calebelston — Mon, 31 Aug 2009 10:49:08 -0000

Hey Chris, great post. Been thinking about the business implications of focusing on algorithms vs. insight. I am experimenting with something new on my latest post; decided to record an audio companion version of for those who prefer to listen. Would love to hear your thoughts on the format and content :)

http://calebelston.com/are-...

Re: To make smarter systems, it’s all about the data

chris dixon — Mon, 31 Aug 2009 09:03:41 -0000

"If I were to aggregate all the world's information (cost aside) and structure the data somehow," One problem is that 99% of the "information" (speaking in the broadest sense) iis in people's heads, out in nature, etc - not in digitally accessible form.

Re: To make smarter systems, it’s all about the data

chris dixon — Mon, 31 Aug 2009 09:02:29 -0000

I agree it's a bit of slippery slope between data and algorithms. You could create an algorithm that creates a new data source from an existing one. But I bet you if someone has a breakthrough doing it the algorithm itself won't be as interesting as the data source they identified.

Re semantic tagging - If publishers were to ubiquitously start doing so, that would qualify as a massive new data source in my way of thinking. People have been talking about that for years but right now their is no real incentive for publishers to do it. Maybe if Google made it help you SEO or something people would start to care.

Re: To make smarter systems, it’s all about the data

henchan — Mon, 31 Aug 2009 02:41:49 -0000

Isn't there a false dichotomy in the post? Better algorithms will confer functional benefits while new data sources will increase the range of their coverage. Depth and breadth respectively. One or both approaches may be useful for different use cases. Indeed data and code can be inter-dependent. If the quantity of data increases say, while its quality (according to some specified requirement) simultaneously deteriorates, net gain could be negative unless the algorithm can be altered to compensate.

It is good to hark back to Google in 1998 or to the nascent WWW ten years earlier. To be thinking of what would it take to make another radical improvement in information management. My view is that the next generational shift will be ubiquitous semantic tagging of public data by the publisher. These tags will be interpreted using consistent, open algorithms but they will be interpreted subjectively by each subscriber, according to private data unevenly distributed across the system.
The high cost of creating tags is an empirical observation: true in respect of the Semantic Web and no doubt other systems, but not a universal law. When the requirement for objectivity is dropped, semantic tags with good-enough efficacy can be created at very low marginal cost.

Re: To make smarter systems, it’s all about the data

chris dixon — Sun, 30 Aug 2009 23:09:23 -0000

Eran - I was speaking about Google circa 1998. At the time the insight of including links and anchor text really did make their search engine vastly better. All search engines use that data today so that advantage is gone. Probably today the biggest advantages in search today comes from years of devisings "bags of tricks" - lots of little algorithms that collectively yield a better experience.

Re: To make smarter systems, it’s all about the data

samfjacobs — Sun, 30 Aug 2009 23:00:58 -0000

A very clever and useful insight. Almost a perfect blog post to me. An archetype for the form. The Google example is great. Got me thinking.

Re: To make smarter systems, it’s all about the data

Eran Shir — Sun, 30 Aug 2009 20:50:10 -0000

I think data is like the height of NBA players. It's very hard to be a pro with a 5' height, but it doesn't mean the tallest player is the best, in fact it's seldom the case. Same with data. You need enough of it to make things interesting but the idea that google for example, is dominant because she has more data than no. 2-10 is absurd. At some point it's not who has more, it's what it does with it.

Re: To make smarter systems, it’s all about the data

jeremy — Sun, 30 Aug 2009 19:03:15 -0000

I have my doubts, Chris. What you say is invariably true for a certain class of problems and tasks (finding popular recommendations on Amazon, finding home pages on Google). But by biasing your algorithms to large data, you might make other classes of problems even more difficult. Rather than repeat all the arguments, let me point you to a couple of places where I wrote about it a few months ago:

http://irgupf.com/2009/04/0...
http://irgupf.com/2009/04/2...
http://irgupf.com/2009/04/0...

In a nutshell, large data allows you to solve certain types of problems well, but may end up making other types of problems much more difficult, if all you have is naive Bayes on top of that data making your inferences.

Re: To make smarter systems, it’s all about the data

alvisbrigis — Sun, 30 Aug 2009 17:31:37 -0000

Right on Chris. We're clearly witnessing the increase of volume of data (doubling every 18 months), local data structuring (social graphs, geo graphs, genome graphs, brain graphs, body system graphs, energy graphs, real time robot vision / environment graphs, etc) and combinatorial/macro data structuring (e.g. Twitter + Google Maps mashups), which is clearly adding to the capabilities of what we've come to label AI.

AI exists for a purpose, a specific function or task. Just like a basic lifeform needs relevant environmental information to increase it's chances of success, AI functions best with access to the richest, most rapidly computable, system/task-relevant data. e.g. robots that can navigate the Darpa Road Challenge need maps, real-time road/environment sensors, ability to sense and determine the meaning of signs, etc. -- Circling back to the original point, the algorithm is just part of the AI - the other part is an environment of structured data. Intelligence arises from the interplay of the two, depends on the system context. So we can expect the algorithms that most effectively draw on the best data available to them for given tasks to be most successful - that means ongoing rise of AI-ish bots tailored for / carefully tuned to new data environments increasingly capable of performing more complex tasks. Clearly there's an expanding market for these (search being a huge part of that), as Norvig and company have realized.

When considering complementary data sources and the drive to increase intelligence in the system, it's occurred to me that we generally appear to be trending toward the super-structuring of all data (the everything graph), or total system quantification. By cross-referencing different rich data sets, we can interpolate value, push toward quantification / state closure, generating much value and "intelligence" along the way. If it becomes understood that this process is making our system smarter, then data may continue to centralize, be drawn together for certain higher uses, thus further commoditizing data structures, algorithms, and combinations thereof.

Related articles that explore these thoughts:
http://www.memebox.com/futureblogger/show/1591-...
http://memebox.com/futureblogger/show/1518-inte...
http://socialnode.blogspot.com/2009/06/simulati...