Improving investment prediction - it’s not just the data you have, it’s how you combine it that matters

A critical aspect of investing is using information available now to make inferences about what asset prices will be in the future. Investors determine which current information is useful and then use some form of a ‘model’ to process that information, be it a discretionary investor’s reasoning or an explicit quantitative algorithm.

In this piece I argue that we can improve investment predictions by improving the structure of the predictive models we use, showing that for a non-linear system having perfect knowledge of the system variables is not enough to make good predictions if we limit ourselves to a linear combination of those variables.

There is a growing use of machine learning and “alternative data” in investment to create new insights about assets and markets. But these insights are largely combined in a linear fashion when it comes to predicting future prices. Rather than exploring how non-linear combinations of these insights could improve prediction we find that much effort goes into getting the ‘right’ data. As we shall see, having the correct input variables is not always enough to make good out-of-sample predictions. Linear regression models make poor predictions when the underlying system is not linear even if they have the ‘right’ data. In contrast, machine learning models (such as random forests and neural networks) can learn non-linear relationships and use them to make accurate predictions out-of-sample. This shows a clear advantage of machine learning models over traditional linear models when data contains unknown non-linear effects, or when non-linear relationships are suspected.

In this piece I use a simple example to demonstrate how linear models are outperformed by two machine learning techniques (a neural network and a random forest) when the underlying relationships are non-linear, even if the linear model “knows” the correct variables. As there is a concern that these techniques result in models that are not understandable, I then demonstrate one technique to compare the relationship learnt by the two machine learning models to the actual relationship in the data.

This piece builds on the same example used in Marcos Lopez de Prado’s paper, “Beyond econometrics: A roadmap towards financial machine learning”, extending the machine learning models used to include a neural network in addition to the random forest used in that paper. The data for the example is generated from two uniformly distributed random variables by adding these variables and their product, plus noise (normally distributed about zero), to obtain a target value. This results in a non-linear, but simple, relationship between our target variable (y) and the two input variables (x1, x2).

We fit three different models to two-thirds of the data and use the remaining third for out-of-sample testing. The models are (i) linear regression, (ii) a neural network and (iii) a random forest.

We find that the root mean squared error (RMSE) of the linear model (in-sample and out-of-sample) is very poor while the neural network and random forest perform very well with low out-of-sample RMSE that is similar to the level of noise in the data itself.

Figure 1 plots the predicted target values versus the actual target values and shows (1) how poor the linear regression is, and (2) how well the two machine learning techniques did in this simple case.

Although in this example the random forest and neural network have demonstrated excellent out-of-sample predictive ability, machine learning techniques are often criticized as being hard to interpret and understand. Often they are described as ‘black boxes’ implying that the cost of this enhanced predictive power is an inability to understand how the models work.

While complicated models are complicated to understand they are not impossible, and with effort this ‘black box’ concern can be addressed.

A method that can be used to aid the understanding of how a complicated model is making predictions is observing the model’s ‘partial dependency’. This method explores how the predicted value changes when a selected input value changes. In this 2-dimensional example we can create a partial dependency plot (Figure 2) showing both input values and the colour representing the predicted value. This allows us to observe the functional form of each model and have a good understanding of their predictions.

In this 2-demensional case we can directly observe that the random forest and neural network have identified the non-linear relationship, but the linear regression model was unable to do this, note the bottom-left corner of the plots where the linear regression makes a large error compared to the machine learning techniques (and the true target values), also note the difference in scale of the linear regression predictions compared to the other models. This explains why the linear regression predictions are tightly grouped between +/-30 in figure 1.

Naturally, for a more complicated problem partial dependency is only one of many tools that could be used to better understand a machine learning model.

At this point, some may argue that if we’d included an interaction term in our linear regression model then it would have done a good job – and they would be completely correct. However, how do we know what non-linear term to include in our regression model? In this example this is an obvious fix because we know the relationship with certainty, but in the real world we normally don’t know the non-linear terms to include so this easy fix isn’t available to us. In practice, we could use machine learning tools to identify the non-linear relationship from the data, consider if this relationship makes sense and then add that relationship as a term to our linear model. There is nothing wrong with this as an approach (essentially a feature engineering exercise), but the important point is that model structure matters when trying to make good predictions, the right data with the wrong model won’t give good results.

Investor’s often confuse being able to explain the past with being able to predict the future. Being able to explain the past doesn’t mean you can predict the future. As an example, the model that best explained what happened in the past is likely to be highly over-fitted to that past data and therefore a poor predictor, unless the past repeats itself. Complicated linear regressions are prone to overfitting while machine learning tools are often more robust to these issues (although not immune) and so have the potential to provide useful tools for prediction, albeit they are less good at explaining the past in a simple narrative.

Quantitative investors spend a significant amount of time searching for new information but the model structure used to combine this information into a forecast remains largely linear in nature. Machine learning methods allow investors to build quantitative models that are non-linear, incorporating complex and contextual relationships, through data-driven learning, expert-based design or a combination of both.

Our Hubs

Improving investment prediction – it’s not just the data you have, it’s how you combine it that matters

Topics