How do artificial neural networks learn?
There are many machine learning methods. As such, we must decide from case to case which learning method appears suitable for the matter at hand. However, everyone who works with machine learning has their own favourite method. And sometimes, certain learning methods really come into fashion. Machine learning with artificial neural networks, often referred to as deep learning, is very popular at the moment and so we aim to take a closer look at it in the following articles.
The idea
Put simply, artificial neural networks are huge arithmetic units which are based on the highly simplified idea of a brain. Specifically, this refers to calculations that can be described and carried out with the help of linear algebra.
The supposed analogy to how a brain works is not only a source of the first ideas and models of artificial neutral networks (referred to simply as neural networks below). The analogy to a flow of information between real and artificial neurons also helps illustrate how the networks work.
How does a neural network calculate?
Feedforward networks to which we refer below are the most widespread. That said, other types of networks such as recurrent neural networks are seeing more and more frequent use.
In basic terms, a feedforward network calculates multiple output values from multiple input values. The calculations take place over multiple layers of units (also known as artificial neurons). The input values enter in the input layer and are ‘fed forward’ to the units in the next layer. The units in the output layer deliver the outputs for the input values. In the units of the intermediate layers (hidden units), an output value is determined from each of the input values. If there are many intermediate layers, this is also referred to as deep learning, although there is no definition of how many ‘many’ means.
The equation is identical for every unit and is easy to conceive. Every connection between two units ‘carries’ a weight. The output of a unit is the sum of all products of input values with that weight.
To prevent everything from simply being linear, an activation function is applied to the weighted sum of the input values in a unit. There is a brain analogy for this too, as a nerve cell in a brain is only activated when a certain threshold (an electrical charge) is reached.
A neural network calculates a set of output values for a set of input values using this simple method. In the context of supervised machine learning, this can now be carried out on numerous data sets easily and efficiently.
What does a neural network calculate?
When we use a neural network for supervised machine learning, we assume that there is a (previously unknown) functional relationship between the input data and the target variables. Otherwise, we could never expect to predict the target variables at all on the basis of the input data. This raises the question of whether a suitable neural network exists to model every functional relationship.
This question has long remained unanswered, even though pioneers such as Warren McCulloch and Walter Pitts worked with neural networks in the 1940s and Frank Rosenblatt built the first neurocomputer in the late 1950s. It was in 1989 that George Cybenko proved the universal approximation theorem which states, in basic terms, that any continuous real function can be approximated by a suitable neural network.
How does a neural network train?
Training here simply means the successive adjustment of all weights in the neural network with the aim of being as well adapted as possible to the training data. The goal is to set the weights in such a way that the expected outputs or at least close approximations are calculated for sets of input data. This sounds easier than it is and was an unsolved mathematical problem for a surprisingly long time. It was only in the 1970s that a satisfactory solution could be found to the problem of weight adjustment in neural networks.
Building on the work of Stuart Dreyfus on control theory, Paul Werbos developed a method which we now refer to as backpropagation. The key idea is to minimise what is known as a cost function. For example, one common possible cost function is the sum of the quadratic deviations between the outputs of every output unit and the expected value. If this is understood as a function of all weights, the task is to determine the weights in such a way that the function is minimised.
The rest is then higher analysis. With the gradient descent method, at least local minima can be calculated, which is often sufficient in practice.
But: This is not always the case! Although neural networks are very often used to great success in machine learning, they do sometimes encounter a couple of obstacles and problems which we will delve into in more detail in a later article.