Pairs trading. Analysis of several pair selection methods and trading strategies

July 26, 2022

In this article I will describe and implement 7 pair selection methods and 2 trading strategies based on a paper ‘Investigation of Stochastic Pairs Trading Strategies Under Different Volatility Regimes’ (Baronyan et al. 2010). Pair selection methods are based on several metrics and their combinations. Most of the metrics I’ve already implemented in my other pairs trading articles, but they were mostly used one at a time, not as combinations. Trading strategies are also similar to the ones I implemented before, but with some variations on how to calculate and model the spread.

I will use constituents of Vanguard Small-Cap Value ETF (ticker: VBR) as my stock universe. Time period from 2016–07–01 to 2018–12–31 will be used as a training period (for selecting pairs and calculating parameters of the trading strategy). Time period from 2019–01–01 to 2019–12–31 will be used as a testing (trading) period.

We start with pair selection techniques. All 7 proposed methods are combinations of the following metrics and tests:

Minimum Distance (MDM) — sum of squared distances between cumulative returns of two stocks.
Augmented Dickey-Fuller test (ADF) — testing the spread for stationarity (unit root).
Granger causality test (G) — testing whether the price of one stock is useful in predicting the price of another stock (you can read more about it in my other article).
Market Factor Ratio (MFR) — using market betas of two stocks and how close they are to each other.

I believe there is a typo in the paper and the formula for calculating Market Factor Ratio is wrong. Formula from the paper is shown below.

Our goal is to find two stocks with similar market betas. To do this we calculate MFR for each pair. Then the pairs are sorted by MFR in increasing order and top pairs (pairs with smallest MFR) are selected. Let’s plug in some numbers and see what happens.

If beta_1 is equal to beta_2, from the formula above we get MFR=0. Now assume that beta_1=0.1 and beta_2=1. In this case betas are not similar to each other and in fact are very different. But we get MFR=-0.9, which is smaller than zero and (according to the rules described above) is better.

I believe that correct formula is the following:

Beta_1 and beta_2 in the formula above are market betas of stock 1 and stock 2 respectively. More information about market beta and how to calculate it can be found here.

Now we also need to figure out how to calculate market return, which is required for calculating market betas. One option is using returns of SP500 index (or SPY ETF), but it consists of large-cap stocks whereas we use small-cap stocks. Our stock universe (727 stocks) comes from VBR ETF. This ETF is well diversified and contains only small-cap stocks, so I decided to use its returns to represent market returns.

All pair selection techniques proposed in the paper are just various combinations of the metrics described above. Let’s calculate these metrics for all pairs in our dataset, so that we can combine them afterwards and select pairs for trading.

Code for loading and preparing data is shown below.

After preparing the data we are ready to calculate the metrics.

Everything should be clear from the code above. Couple of things to note:

Spread is calculated as a ratio of stock prices (line 12).
Granger causality test is performed in both directions and p-value of each test is saved in the metrics dataframe.
Number of lags to include in Granger causality test is equal to 1 (I couldn’t find any information in the paper about lag values we should use).

When metrics for each pair are calculated, we are ready to perform pair selection. Seven methods are proposed in the paper:

{MFR} — select pairs with minimum Market Factor Ratio.
{ADF+G+MFR} — select pairs with minimum Market Factor Ratio, which also pass ADF and Granger causality tests.
{ADF+MFR} — select pairs with minimum Market Factor Ratio, which also pass ADF test.
{G+MFR} — select pairs with minimum Market Factor Ratio, which also pass Granger causality test.
{G} — select pairs with minimum sum of p-values of two Granger causality tests.
{ADF+G} — select pairs with minimum sum of p-values of two Granger causality tests, which also pass ADF test.
{MDM} — select pairs with minimum sum of squared distances.

We use each of the methods described above to select 5 pairs, which are then used for trading. I also control that all pairs consist of different stocks so that for 5 pairs we have 10 different stocks.

Code for {MFR} method of pair selection is shown below. We just sort the metrics dataframe based on MFR value in ascending order and select top 5 pairs (with the smallest MFRs).

To select pairs using {ADF+G+MFR} method we first create a condition that selected pairs pass ADF and Granger causality tests (p-values of each test are less than 0.05). Then we sort the pairs satisfying that condition and select pairs with the smallest MFR values. Code for doing it and selected pairs are shown below.

Next two methods are similar, just with different conditions. {ADF+MFR} requires that pairs only pass ADF test. {G+MFR} requires that pairs only pass two-way Granger causality test. And again pairs are selected based on the smallest MFR values.

Method {G} requires us to calculate the sum of p-values of two Granger causality tests and select pairs with the smallest value of that sum. We create a new column equal to the sum of two p-values and sort the dataframe based on it.

{ADF+G} method is similar to the previous one, but has an additional condition that selected pairs pass the ADF test.

And finally the last method selects pairs based on the smallest squared distances between cumulative returns of two stocks.

As we can see above different methods select completely different pairs. There is no overlap, we get 7*5=35 unique pairs.

Next step is choosing the length of the window that we will use to calculate spread parameters (mean and standard deviation). The selection is made based on the in-sample performance of the strategy. Our training period lasts from 2016–07–01 to 2018–12–31. I will use period from 2018–01–01 to 2018–12–31 as trading period for selecting window size W.

In the paper authors work with weekly returns and they test window lengths of 24, 36, 48, 60 and 72 weeks. We work with daily data, what window lengths should we test? In case of weekly returns the number of weeks is equal to the number of datapoints used, so should we use the same number of datapoints? Or should we keep the window lengths the same and just use more datapoints? I think the answer is somewhere in between, so I will test window lengths of 48, 72, 96, 120 and 144 days.

Now we need to run the trading strategy in-sample with different window lengths and select length W that provides the best total return. The question is should we use different window length for each pair? Or should we find one common W that gives the best total return? I decided to work with each pair individually.

First (simpler) strategy works as follows:

Calculate the spread (R) as a ratio of stock prices in a given pair.
Calculate the mean (mu) and standard deviation (sigma) of the spread using last W datapoints (W — window length).
Open and close positions according to the rules shown below.

Code for testing different window sizes is shown below. Each pair is processed individually. Data from 2018–01–01 to 2018–12–31 is used for backtests. Using different window sizes we calculate profits and save them in a dataframe.

First 10 rows of the resulting dataframe are shown below.

Pair performance with different window sizes

When we trade a given pair we select window length W which has the best in-sample performance. For example for the first pair (AEO-KFRC) we would use W=96.

Now we are ready to run a backtest. Code for it is shown below.

For each method we create positions dataframe. It contains positions that we should have at any day during trading period. These positions are then used to calculate returns and performance of the strategy. The code below shows how to calculate simple returns and cumulative returns from positions dataframe. On line 6 we divide the sum of returns by 5 because our capital is equally divided between 5 pairs. We then multiply it by 2 because we can use twice the amount of capital available.

Next we calculate different performance metrics and save them in a dataframe.

Performance metrics (simple 2std strategy)

On the screenshot above you can see that only two pair selection techniques provide positive returns — pair selection based on Granger causality test (G) and pair selection based on minimum distance (MDM). Total return of the strategy with MDM pair selection method is very close to zero and it will definitely be negative if we include transaction costs. Performance of this simple strategy with pair selection based on Granger causality test looks promising. It gives annual return of 38% and has Sharpe ratio 1.26.

Now let’s implement the second strategy from the paper. In this strategy we assume that spread follows Vasicek model shown below.

I’ve already used Vasicek model in my previous articles (here and here), but in a little bit different form. In those articles parameters of the model were estimated using Expectation Maximization algorithm. In this paper authors propose using generalized method of moments (GMM).

I am not going to describe all the theory behind GMM here. It’s a vast topic and there are many resources describing it, some of them are mentioned in the references in the end of this article. I will just implement this method using the formulas provided in the paper.

First let’s see how GMM works on synthetic data. Code for generating time series from Vasicek model is shown below. I’m using discretized version of the model here.

The picture above shows the generated time series. Now we want to estimate parameters of the underlying model. Luckily for us authors provide all the required formulas in the paper. I will just give a brief summary of what we need to do.

Vasicek model can be represented in discrete time as follows:

Defining some new variables and rearranging we get:

Define moment error functions:

In the paper it is assumed that sigma represents annual volatility and since they are using weekly data they divide it by 52 (number of trading weeks in a year) to get weekly volatility. We are using daily data and I assume that sigma represents daily volatility, so there are no fractions in the equations above.

To estimate parameters of the model using GMM we need to minimize the following quadratic form (criterion function):

There are many different choices for the weighting matrix W, but I will use the simplest case when W is an identity matrix. In that case we have the following criterion function to minimize:

Now we can implement and minimize this function in Python. Look at the code below. First we define a function that calculates the value of criterion function given model parameters and data. Then we run minimization function on synthetic data to estimate true parameters.

Recall that we have the following true parameters: sigma=0.1, theta=3.5, kappa=0.1. We get the following estimates from GMM:

As you see on the screenshot above estimated parameters are close to the true parameters. Estimate for kappa is not very accurate, but we actually don’t need to know kappa in order to implement our strategy. We just need to know theta (long term mean) and sigma (volatility).

Let’s try to run 1000 simulations of generating data (with the same parameters) and estimating model parameters using GMM.

Now we can take the mean of estimated parameters and check how close they are to true values.

Again we see that parameters theta and sigma are close to true values, whereas parameter kappa is not that close. We can still try to backtest the strategy because we don’t need to know the value of kappa to do it.

After parameters of the model are determined, the strategy is similar to what we’ve done before. We open a long position when the spread is more than two standard deviations below the long-term mean (theta). We open a short position when the spread is more than two standard deviations above the long-term mean.

Before backtesting the strategy we need to determine the optimal window length W. Recall that it is done based on in-sample performance of the strategy. We just need to make some minor changes in the code we used before.

The only new piece of code is in lines 23–27. Instead of just calculating the mean and standard deviation of the spread, we estimate parameters of the Vasicek model using GMM technique. First 10 rows of the resulting dataframe are shown below.

Now we can perform the backtest. Again only small portion of the code changes. Everything else is exactly as it was before.

Performance metrics (strategy based on Vasicek model)

Performance metrics of the strategy are shown on the screenshot above. We can notice that the same pair selection techniques (G and MDM) have positive returns. For every other technique total return is negative. We also see that performance of this new strategy compared to the previous one is better in both cases. Total return of the strategy with pair selection based on Granger causality test is increased from 38% to 53% and its Sharpe ratio is increased from 1.26 to 1.56. Total return of the strategy with pair selection based on minimum distance is increased from 0.9% to 9% and its Sharpe ratio is increased from 0.15 to 0.93. So we can see that trading strategy based on Vasicek model has better performance than simple strategy based on 2-SD rule.

Ideas for further improvements: