Deep learning – between hype and disillusionment
For some time now, machine learning with deep artificial neural networks (deep learning) has been experiencing a boom. As a result of rapid technological progress, very large and deep networks can now be trained quickly and efficiently. At the same time, neural networks have proven to be ideally suited for many applications, such as text and speech recognition.
Varying quality
In practice, however, the quality of the predictions made by a neural network can vary greatly. For example, very large disparities in the quality of predictions may be observed in different training runs with the same training data and the same network architecture. This is often due to weight initialization. So-called starting weights are required before training can begin. These are usually generated randomly, and the hope is that they will approximate the best weights, or at least good ones, over the course of the training run. In an encouragingly high number of cases, this already works quite well – but not always. Sometimes, it is very difficult to obtain stable results. And sometimes, traditional methods deliver much better results.
Gradient descent in good times and bad
To find out why, we should take a closer look at the process behind the weight adjustments for clues. The aim of the weight adjustments is to minimise the error function, and the gradient descent method is used by default. In this process, the error function is a function of the weights. With perhaps 200 hidden layers and several tens of thousands of units, deep learning has a high-dimensional definition range. In effect, the dimension is equal to the number of weights or units. This function can have a lot of local minima and the gradient descent method (hopefully) takes us to one exact local minimum. Nevertheless, it is not easy to find out how far our local minimum differs from an absolute minimum. In practice, this is often not necessary, since we don’t need the absolute minimum to make good predictions – ‘just’ a sufficiently small local minimum.
A hiker in the Alps
Sometimes, however, this can pose a problem. The whole concept becomes a little easier to conceive if we only have two weights, namely an error function with a two-dimensional definition range and the real numbers (one dimensional) as the target range. We can then picture the graphene of this function as a mountainous landscape – perhaps comparable to the Alps. Let’s follow a hiker who starts walking from a random place in the Alps and is looking for water. They are expecting to find water at the hundred lowest points in the mountains. Using the gradient descent method, it would be a simple case of the hiker always going downhill and ending up in some valley (or a saddle point, which would also be OK). Since there are many valleys in the Alps, our hiker may get lucky and find water. Or they could end up at a local minimum that is much higher than the ones with water. The hiker’s success depends on their starting point, which is determined by the chosen starting weights.
A view from the helicopter
Our analogy, as well as practical experience, demonstrates that this approach has a clear weakness. That said, it is the best method we have at our disposal today. The problem for our hiker is that they are moving on the plane, which is, in a certain way, ‘in the dimension of the weights’. As a result, they lack important information about the structure of the landscape in which they are moving. If the hiker had a helicopter or a hot-air balloon, they could try to find a convenient starting point from above (i.e. from a higher perspective).
Contextual knowledge
Something similar is done in practice, where we try to use our contextual knowledge to achieve the right results. Specifically, learning methods are not used on a ‘greenfield’ basis, but within the framework of the subject-specific question. Here, we try to base our choice of starting weights on everything we already know beforehand.
Or send out a few hundred hikers
Of course, there is always the option of continually repeating the training run until the required level of quality is attained. This would be akin to sending out several hundred hikers, instead of just one. We would have done our job even if only one of them finds water.