financialnoob.me

Blog about quantitative finance

Pairs trading with Support Vector Machines

In this article I will implement and test a trading strategy inspired by the paper ‘Data mining for algorithmic asset management: an ensemble learning approach’ (Montana, Parella, 2009). The strategy presented in the paper is not really a pairs trading strategy. Instead it uses a synthetic asset (generated from several cross-sectional data streams) to determine if the target asset (the one we trade) is overpriced or underpriced.


The general idea of the algorithm is this:

  • start with n+1 data streams: price of the target asset plus n other data streams that are used to determine the fair price of the asset (prices of other assets, market factors, indicators, etc.)
  • at each time step (daily before the market close) we estimate the fair price of the target asset from the other market data using Support Vector Regression (SVR) algorithm
  • if the fair price is lower than the current market price, we sell the target asset (go short)
  • if the fair price is higher than the current market price, we buy the target asset (go long)

The amount of the data streams used to estimate the fair price can be quite high and a lot of them can be highly correlated with each other, providing redundant information to the algorithm. To solve this problem Principal Component Analysis (PCA) is used to extract several features explaining the most variance and uncorrelated with each other. The extracted features are then used as an input to the SVR model.

Support Vector Regression model has several hyperparameters that need to be determined:

  • C — regularization parameter
  • gamma— kernel coefficient (denoted as sigma in the paper)
  • epsilon — the width of the tube within which no penalty is given in the loss function

Instead of trying to determine a fixed set of hyperparameters to use, a whole ensemble of models (called experts in the paper) are trained and Weighted Majority Voting (WMV) algorithm is used to make a final prediction. Authors use 2560 models in total, each with a different set of hyperparameters.

The WMV algorithm works as follows:

  • All models start with a weight of one
  • Each model is trained on the input data and used to make a prediction about the direction of the next market move (whether the price of the target security will go up or down on the next day)
  • The final decision is determined by comparing the total weight of models predicting that the price will go up with the total weight of models predicting that the price will go down
  • If the final decision turns out to be correct, no adjustments to the weights are made
  • If the final decision is wrong, than the weights of the models that were wrong are multiplied by the parameter beta (which is selected by the user and should be between zero and one)

This algorithm allows us to adjust to dynamic market conditions by gradually decreasing the weights of the models making a lot of mistakes.

In the paper authors describe incremental algorithms both for PCA and SVR algorithms, that allow to update the models at each time step without having to retrain them. I wasn’t able to fully understand and implement the incremental algorithms yet, so I decided to test similar strategy but with the usual (not incremental) PCA and SVR.

The target asset that I’m going to use is VanEck Biotech ETF (BBH). Other data streams (used as an input) include the prices of the holdings of this ETF, prices of other biotech ETFs and the price of SPY ETF.

To backtest this strategy we need to determine several parameters, namely:

  • Parameter beta for downgrading the weights of the wrong models. (In the paper several values are tested.)
  • Number of principal components to extract and use as features. (In the paper only one component is used.)
  • Number of models to train, as well as the values of model hyperparameters to use in the grid. (In the paper 2560 models are used, but not information about the exact values of hyperparameters is provided. Only their ranges are given: epsilon varies between 0.00000001 and 0.1, C and gamma vary between 0.0001 and 1000.)
  • Number of last trading days to use for model training. (In the paper last 20 days are used.)

I will also try to test several values of beta and different numbers of principal components to determine what works best. I will use smaller number of models — 512. The number of trading days used in model training will be left the same — 20 days.

Let’s get started.


First I load the data and transform prices into cumulative return.

Then I separate the price of the target asset from the other input data.

Now I’d like to see what number of principal components is reasonable to use. To do this I will use a scree plot.

PCA scree plot

We see that the explained variance ratio drops significantly after 2nd-3rd components. So I think we should extract no more than 3 principal components as our features.

First I will backtest different values of beta, holding the number of principal components fixed.

And the following values of SVR hyperparameters (which gives 8*8*8=512 models):

I skip the first 50 days when calculating strategy’s performance to allow the weights of the WMV algorithm to adjust. Here are the results that I get:

We can see that all of the strategies are profitable and the value of the parameter beta doesn’t have a big impact on the performance of the algorithm. This agrees with the results in the paper, where the algorithm also performed well for a wide range of betas. Also notice that returns of the strategy are not correlated with returns of the traded asset.

Now I will fix beta at 0.5 and try to use different values for number of principal components used. The results are presented below.

We notice that using just one principal component gives the best results, which is the same number as used in the paper. Probably other components provide nothing more but a noise and that’s why the performance deteriorates significantly when we use two or three components.

Now I will take a closer look on the performance of the algorithm using 1 PC and beta=0.5. Below is the plot of the cumulative returns of the algorithm and the target asset (BBH).

Cumulative returns of the strategy and target asset

As we can see on the plot above, our strategy significantly outperforms simple buy and hold of the target asset. Now let’s see how well our algorithm performed compared to the market (SPY).

Cumulative returns of the strategy and the market

It seems that we can even outperform the market. Of course we didn’t account for transaction costs. Let’s try to add transaction costs and see if the strategy is still profitable. I am assuming two-way transaction cost of 0.2%, which should be enough to cover both broker fees and bid-ask spread. I will subtract transaction costs from the strategy returns only when we change position from long to short or vice versa. Below is the plot of the returns we get.

Cumulative returns of the strategy (with transaction costs) and target asset

Finally let’s compare some performance metrics.

Performance metrics

The Sharpe ratio of the SPY is a little higher than that of our algorithm, its maximum drawdown is a little smaller and maximum drawdown duration is shorter. But the return of our strategy is 1.7 times larger than that of SPY. This is before transaction costs. Adding transaction costs significantly lowers the performance of the strategy, but the results are still better than simple buy-and-hold of the traded asset. Also I think that the number I used for calculation of transaction costs is way too conservative, it may be possible to trade with lower fees which will improve the overall performance.

Now let’s check if the daily returns of our strategy are correlated with the daily returns of SPY and BBH.

Correlations of daily returns

As we see above the correlation coefficients are very low.

Further improvements:

  • Add more input data streams (more biotech ETFs, different indices, etc)
  • Add entry thresholds to enter long\short position only when the weights of all models voting for it exceed a certain number (right now we enter a position just when the weights of all models voting for it exceed the weights of models voting for opposite position, even if it is just by very small amount)
  • Try using larger number of experts (models)
  • Implement incremental PCA and SVR algorithms

Although this strategy is not really a pairs trading strategy, it still has the same advantages. Its returns are not correlated with the market or with returns of the target asset and potentially it can make money regardless of underlying market conditions.

Jupyter notebook with source code is available here.

If you have any questions, suggestions or corrections please post them in the comments. Thanks for reading.


References

[1] Data mining for algorithmic asset management: an ensemble learning approach (Montana, Parella, 2009)

[2] Learning to Trade with Incremental Support Vector Regression Experts (Montana, Parella, 2009)

Leave a Reply

Your email address will not be published. Required fields are marked *