Artificial Intelligence: Driven by Data, Not Code

Ryan Tanaka
9 min readMay 5, 2016

--

In the ever-forward-looking world of the Silicon Valley, lately there’s been a lot of hype surrounding the use of AI and machine learning processes in order to build the next generation of software products and features — with Google’s self-driving cars taking the spotlight as the representative for this line of thought. Though largely an unproven concept at this point, given that a working, reliable model could yield untold benefits, it’s something that a lot of companies are pushing as the “next big thing” in the world of tech.

Not to say that the possibility of making it work isn’t there, but there’s a lot of challenges that go into building “AI” systems that often go undiscussed, which, in most cases, leads to the product’s lack of adoption in the long run. I’ve put “AI” in quotes here, because what gets categorized as “artificial intelligence” in the media these days isn’t actually something that’s driven by “intelligence”, per se — the vast majority of AI or machine learning projects tend to be driven by data, rather than code.

If you look under the hood of how Google’s self-driving algorithms work, you’ll see that a lot of its functionality is heavily reliant on the accuracy of Google Maps, which gives the software enough of an understanding of its environment in order for the car to navigate through its terrain. How well the self-driving car operates is determined by the accuracy of data input and the decision-trees configured prior to its execution — Apple’s Siri works in a similar way, relaying each question into the company’s data center every time a question is asked. This is important to note because algorithms aren’t currently capable of altering its own code organically in order to adapt to its circumstance, so the data-driven approach is used in order to fill in the gaps.

To the point: AI projects in the near future are probably going to look a lot different than what people might be speculating now, since it involves using less information, not more, in order to make the automated decisions and actions more effective in real-world settings. In the age of Big Data this might sound a little counter-intuitive, but for many data-driven products, the Small Data approach will yield just as effective, if not better, results than the former. More will be explained below.

Music that Composes Itself

My knowledge of AI systems mostly comes from my experiences working with algorithmic compositions, which is a direct decedent of process-based music, a movement in the classical music world that has been around since the 60s and 70s. Process music basically uses mathematical equations, algorithms, and “objective” sources of inspiration (data, patterns in nature) in order to procedurally generate music for an indeterminate length of time. You could say that John Cage’s 4'33" is a form of process-music as well: he was rolling some dice around to determine the notes for his latest work — as the story goes, he so happened to get a result where there were no notes to be played, but made the decision to get it “performed” anyway, in order to prove a point.

Now the problem is that there have been many prototypes and examples created over the decades in this line of work, but none of them, well, sound very convincingly “human”. The generally underwhelming results of these projects makes it hard to find recording samples of musics done in this style because in the vast majority of cases it showcases the proof that nobody has really been “tricked” by the machine as of yet. You could say the same thing about its distant cousin, the chatbot, which is also in need of a lot of improvement at this point in time. (Facebook actually uses human operators working behind the scenes for their “chatbot” services, but this is kind of cheating, imo.)

This isn’t to say that these outcomes are a result of a lack of effort, intelligence, or talent on the part of those who have made them: the problem is actually a very very hard one that gets progressively more difficult as you try to make the leap from “kind of working” to “acceptable”. And there’s a very simple reason for why this happens: too much data.

I’m using a self-driving car as an example here, but you can basically imagine this as anything you want: dialogue trees, chord/note progressions in music, your cheeseburger delivery order, etc. Anything that’s programmed to make decisions, basically.

Sorry for the poor image quality, but a sketch is a sketch.

When you add one object to a data map, it creates a relationship between them, drawing a few possible paths that you might take.

More objects = more stuff to analyze. In data, the relationship between each object can also become an “object” in itself.

Add another one, and you create a “network” of data in between them.

You don’t have to know what’s going on in this picture because I don’t either.

Eventually you end up with so many paths and possibilities that it overloads the computer to the point where it becomes physically impossible to process everything, due to hardware and storage limitations. And the amount of data generated grows exponentially as you add more things or try to get deeper into where you might want to go, and that’s not even including time-lapsed or historical data. You could literally analyze terabytes of data for something as simple as making a right turn — which is why even the most advanced robots programmed by the smartest people in the world often have trouble doing even the most basic of tasks.

The future of condiments innovation!

So it’s important to keep in mind that for any AI or machine learning project done with current methodologies, you will run into an impenetrable wall of data that you won’t be able to hurdle over no matter how hard you try. (For this reason I tend to recommend to people to wait for advances in the hardware space before expecting too much out of the AI movement.) At some point, you have to limit the input of data coming into the system in order to avoid it being overwhelmed.

The realization that these limitations must be imposed on self-driving systems has lead to a lot of debate about the ethical ramifications of what sort of outcomes the technology could lead us into. When people are publishing articles like The Ethical Questions Facing Self-Driving Cars and Why Self-Driving Cars Must be Programmed to Kill, it doesn’t exactly make you feel safe about the future of autonomous driving. While I’ve studied ethical models in grad school and find these discussions very interesting, the engineering side of me tells me that there’s probably a way to avoid these sorts of questions altogether by coming up with a better solution. And the way to go about it is: do more with less.

Big Data to Small Data: Less is More

I used my experience in music composition as an example above, because it’s actually relevant to how a solution to this problem can be found: some algorithmically generated musics can actually sound “ok” in certain circumstances. It won’t inspire you to do great things or change your perspective on life or anything, but you might be able to get away with people not being able to tell the difference between a piece of music written by a computer and a human being, if the materials you’re working with is sufficiently limited.

David Cope was able to fool a lot of people into thinking that the output from his “J.S. Bach Algorithm” was an undiscovered work of the composer.

What makes this “trick” possible is the fact that classical music, particularly in older styles like Bach, tends to be very strict in its treatment of its material. You’re limited to 12 notes at the octave, 4 voices/instruments, a few acceptable forms, and a handful of chord progressions here and there. Even in writing melodies and variations, there are all sorts of regulations (i.e. rules of counterpoint) you have to follow in order for the music to be “aesthetically sound”. By placing limitations on allowable outcomes, the formulation of a working “J.S. Bach algorithm” becomes a very real possibility.

This is why some people are speculating that Dubai, not the US, might actually be the first place to implement self-driving cars in the near future. Dubai’s roads are much cleaner, larger, more even, and less prone to variation in terrain and weather, which makes it much easier for AI systems to handle. Even as hardware and software development methods continue to advance, the act of simplifying the environment itself will always be an effective strategy because it reduces the amount of complexity that the system is meant to handle.

As the debate over AI and machine learning applications continue to rage on, it might help to keep in mind that railway systems have already been experimenting and utilizing automated transit systems for over 50 years now. The difficulty in automating rail systems (which is much easier to do on a technical level) should give an indication of how much work the AI community has ahead of them in order to reach a point of acceptable reliability. Still, the ability to connect automobiles to data networks is in itself already a feat: the movement shows great promise for use in private circuits or public transit systems where the AI can be tweaked to work with limited amounts of variables.

A fully automated train system by Paris Metro in France, opened in 2011.

In the end, though, all of these processes will be driven by data, at least for now. Chatbots won’t be passing the Turing test for a while, at least in real-world situations where people are likely to manipulate and break the code on purpose. (Just keep on asking it “why”, if anything, to test the durability of the chatbot’s code — most will break very easily.) Like trying to write a good piece of music, machine-learning projects are one of those things that’s easy to start but difficult to finish, since the levels of complexity and work required grows exponentially as time goes on. (It’s easy to write an “ok” piece, hard to write a good one, extremely difficult to write a great one.)

While the notion of “Big Data” has a nice ring to it, it’s important not to equate more data with “bigger” insights or “more powerful” systems. In most cases, smaller data sets yield better results since you can tailor the analyses towards more specific customers or objectives. By keeping the scope of the project small and the number of variables limited, the chances of an insight or idea growing into something bigger later on actually increases as a result. (Better to be something to someone, rather than everything for everyone, as the saying goes.)

While seemingly counter-intuitive, this was basically the compositional approach of Bach, and why his music is still being analyzed and studied today. His music gives us a glimpse into how contradictory notions of industry and nature, repetition and variation, growth and diminishment can work together in a way that’s both meaningful and pleasing.

In that spirit, I think that a lot of machine-learning projects today could use a similar approach towards its data: stronger focus on limited sets of variables, geared towards very specific outcomes and goals. From there, you can let the data dictate where the product wants to go — in data science, you’re hardly ever in control of its outcomes, after all.

--

--

Responses (1)