The origin of machine learning and machines

subscribe at

There are lots of machines in Machine Learning: Gradient Boosting Machines, Boltzmann Machine, Helmholtz Machine, Support Vector Machine, etc. So, what’s the matter with these machines. More importantly, why is it machine learning?

In this article, I will introduce the history of the following:

  • Machine
  • Learning Machines
  • Machine Learning
  • Machines in Machine Learning


Before we call it computers, people used the term, “Machine”, which was introduced by Alan Turing in his seminal 1936 paper as “UTM” (Universal Turing machine). Back then, “computer” is used to describe a person who did calculations. Until 1970s, it was not uncommon for companies and governments to advertise jobs as “computers”.

‘Machine’ is also used for naming of the ACM (Association for Computing Machinery), which was founded as the Eastern Association for Computing Machinery at a meeting at Columbia University in 1947.

Don’t forget that computers in the 50s were literally machines. The Electronic Numerical Integrator and Computer (ENIAC) was one of the first large general-purpose digital computers. By the end of its operation in 1956, ENIAC weighed more than 30 short tons, was roughly 8 ft × 3 ft × 98 ft in size, occupied 1,800 sq ft and consumed 150 kW of electricity. Another important aspect is that to develop functionalities, the machine requires rewiring, restructuring, or redesigning the machine due to its fixed program, which is very different from modern concept of software and hardware.

Learning Machines

The first combination of “machine” and “learning” comes from Turing. In his 1950 seminal paper, which titled “Computing Machinery and Intelligence” and introduced Turing test to the general public, he created the term “Learning Machines” and used it as one of the headings. He has some quite interesting descriptions of a learning machine, where he compared it to the Constitution:

The idea of a learning machine may appear paradoxical to some readers. How can the rules of operation of the machine change? They should describe completely how the machine will react whatever its history might be, whatever changes it might undergo. The rules are thus quite time-invariant. This is quite true. The explanation of the paradox is that the rules which get changed in the learning process are of a rather less pretentious kind, claiming only an ephemeral validity. The reader may draw a parallel with the Constitution of the United States.

He also describes the concept of Black Box, which is now one of the most important fields in Machine Learning:

An important feature of a learning machine is that its teacher will often be very largely ignorant of quite what is going on inside, although he may still be able to some extent to predict his pupil’s behaviour.

Machine Learning

In 1952, Arthur Samuel developed one of the first AI programs. It was a checkers program for IBM 701. In his 1959 paper titled Some Studies in Machine Learning Using the Game of Checkers, he popularized the term “Machine Learning”.

Machines in Machine Learning

However, where do these machines come from?

  • Boltzmann Machine was named by Ackley, D., Hinton, G., & Sejnowski, T., in 1985 in the paper A Learning algorithm for boltzmann machine.
  • Helmholtz Machine was named by Peter, Dayan; Hinton, Geoffrey E.; Neal, Radford M.; Zemel, Richard S. in 1995 in the paper The helmholtz machine.
  • Support Vector Machine was actually called Support-vector Networks. It was invented by Corinna Cortes and Vladimir Vapnik in their 1995 paper. The term Support Vector Machine probably comes from the first sentence:

The support-vector network is a new learning machine for two-group classification problems.

  • Gradient Boosting Machines was introduced by Friedman in his 1999 paper Greedy Function Approximation: A Gradient Boosting Machine.

They were called ‘Machine’ for many reasons, one of them is actually patent law.

In statutory United States patent law, software and computer programs are not explicitly mentioned. US courts tried clarify the boundary between patent-eligible and patent-ineligible subject matter for computers and software.

Section 101 of title 35, United States Code, provides:

Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

In case the invented “algorithms” need to be patented, the scientists preferred to use the word “machine”.

One great video I found during writing of the article is Patrick Winston’s Lecture 16: Learning: Support Vector Machines.

I would like to share one snippet:

Well, I want to talk to you today about how ideas develop, actually.

Because you look at stuff like this in a book, and you think, well, Vladimir Vapnik just figured this out one Saturday afternoon when the weather was too bad to go outside.

That’s not how it happens. It happens very differently.

Later in the video, he told the interesting history of Support Vector Machines:

So around 1992, 1993, Bell Labs was interested in hand-written character recognition and in neural nets.

Vapnik thinks that neural nets — what would be a good word to use?

I can think of the vernacular, but he thinks that they’re not very good.

So he bets a colleague a good dinner that support vector machines will eventually do better at handwriting recognition then neural nets.

And it’s a dinner bet, right?

It’s not that big of deal.

But as Napoleon said, it’s amazing what a soldier will do for a bit of ribbon.

So that makes colleague, who’s working on this problem with handwritten recognition, decides to try a support vector machine with a kernel, in which n equals 2, just slightly nonlinear, works like a charm.

Was this the first time anybody tried a kernel?

Vapnik actually had the idea in his thesis but never though it was very important.

As soon as it was shown to work in the early ’90s on the problem handwriting recognition, Vapnik resuscitated the idea of the kernel, began to develop it, and became an essential part of the whole approach of using support vector machines.

So the main point about this is that it was 30 years in between the concept and anybody ever hearing about it.

It was 30 years between Vapnik’s understanding of kernels and his appreciation of their importance.

And that’s the way things often go, great ideas followed by long periods of nothing happening, followed by an epiphanous moment when the original idea seemed to have great power with just a little bit of a twist.

And then, the world never looks back.

And Vapnik, who nobody ever heard of until the early ’90s, becomes famous for something that everybody knows about today who does machine learning.

Data Scientist at Affirm | Board Member of and | Previously at CMU and UIUC | . My views are my own.