Machine learning – why does it (sometimes not) work?
The article ‘AI in the insurance industry – terms and definitions’ already explains what machine learning actually means. We want to go a little deeper here and discuss the questions posed in the title. Machine learning is a branch of artificial intelligence. The general term encompasses a wide variety of approaches and countless learning methods underneath them. The first rough classification differentiates between supervised, unsupervised and reinforcement learning. In this article, we want to focus on the first category, supervised machine learning, and explain a couple of things about it briefly.
How do we end up with findings?
In logic and natural science, there are two fundamental approaches to logical reasoning. In deductive inference, specific statements are derived from general principles, rules or axioms. This is a conventional approach in mathematics and requires principles, rules and axioms to exist and be known. The opposite of this is inductive inference. In this approach, relationships are learned from observations and general principles are derived from them. We encounter this approach in natural sciences such as physics. It should be noted that inductive statements can be falsified but not verified.
Classical programming normally takes a deductive approach. This means that input data are processed in line with a predefined plan (the algorithm) and an output is calculated. The programmer must know or draft the algorithm beforehand and then transfer it to the programming language they are using.
In contrast, supervised machine learning can be considered an attempt to automate the process of inductive inference. Unlike classical programming, there is no pre-known algorithm in supervised machine learning (and often not afterwards either). Instead, a model is trained based on a learning method using input data. The trained model is then used to predict certain target variables for new data. The target variable can be a nominal variable, e.g. the customer cancels their contract or the customer chooses a capital payment (classification problem), or a numerical variable, e.g. the amount of damage (regression problem).
Training and test
In order to predict the target variables, we need a certain number of data sets with structured data for various features as well as the values of the target variables. The data sets are split into two groups (mostly at random): training and test data sets. A learning method such as an artificial neural network can now be trained using training data. The neural network ‘learns’ the correlations between the other features and target variables, such as the fact that cancellation behaviour depends on age, contractual term and the insurance product, and how it does so. First of all, the trained model is re-applied to the training data in order to test the quality of the predictions and check whether the predictions relate to the known values of the target variables. The test data which have not yet been used then come into play. These data are used to test whether the learning method makes good predictions.
The following cases can occur when the quality is tested: Underfitting means that the predictions are insufficient for the training data and that no good predictions can be expected on the basis of the test data or entirely new data. Overfitting is when the predictions are excellent with the training data and very poor with the test data. The predictions should ideally be excellent with training and test data, in which case we can hope that predictions with completely new data will also be strong.
But beware: Again, it is still possible to make bad predictions on the basis of new data. In practice, this mostly happens when the new data are structurally different to the training data. If, for example, we have only used data sets with classical insurance products to train and test a cancellation forecast model, we cannot necessarily expect strong predictions for fund products.
A comprehensive analysis of the quality and suitability of the data has to be carried out in order to avoid such effects. In practice, exploratory data analysis is carried out before the data are used for training. This normally requires profound contextual knowledge, as does the evaluation of the results.
Applied correctly and with the right amount of expertise, machine learning provides numerous ways to solve known and even new problems and is successfully being used in practice, including in the insurance industry.