financialnoob.me

Blog about quantitative finance

Pairs trading. Pair selection. Cointegration (Part 1)

In two previous articles we talked about distance methods of pair selection. When we created a portfolio of two stocks we assumed equal capital allocation to each stock in the pair. This probably limited the amount of potential pairs that were selected for further analysis. Even if two stocks are influenced by the same risk factors, it does not necessarily mean those factors have the same effect on both stocks.

Consider this synthetic example. Assume we have two stocks: stock A and stock B. Their are subject to the same risk factor F, which follows a random walk process, but their coefficient of dependence on F (betas) are different. More formally we have the following:

Prices of stocks A and B

Now if we construct a pair portfolio by allocating the same amount of capital to each stock (long stock A and short stock B), we won’t be able to cancel the random walk component and therefore our portfolio won’t be stationary.You can see below that the mean of the portfolio is not stable. This is because we still have a random walk component in the portfolio price series.

Formula for equally weighted portfolio

Plot of equally weighted portfolio

But notice that it is still possible to create a stationary portfolio, we just need to change the weights of stocks in the portfolio so that the random walk component is canceled. Formula for our new portfolio will look like this:

For each unit of capital allocated to long position in stock A, we allocate hedge_ratio units of capital to short position in stock B.

To calculate the hedge ratio we can use Ordinary Least Squares (OLS)method. The code to calculate hedge ratio along with the plot of resulting portfolio is presented below.

Plot of the spread calculated using hedge ratio

This portfolio looks a lot more stationary than the one we got using equal weights.

Basically, if we have two time series and it is possible to construct a linear combination of them that is stationary (has constant mean and variance), those time series are said to be cointegrated. (This is not a rigorous mathematical definition, but it explains the basic principles). Let’s try to apply cointegration test (Cointegrated Augmented Dicker-Fuller test) to our price series to confirm that they are cointegrated.

The second number is p-value and it is very close to zero, so we can reject the null hypothesis that price series are not cointegrated. So when we have two stocks which prices are cointegrated we can try to use the method described above to create a mean-reverting portfolio. Now let’s try to apply this method to real data.


In the previous articles about the distance method of pair selection I demonstrated that many selected pairs are cointegrated according to CADF test, so we now know that simply choosing pairs that has CADF p-value < 0.01 is not enough to find pairs that will continue to be cointegrated in trading period. Therefore in this post I will combine some of the tests I used before with the cointegration approach.

In the first part I will perform the same tests as in the previous articles on distance method, but now I will allow the spread to have different weights for each stock. As a reminder, I am going to select only the pairs that are:

  • Cointegrated (CADF p-value < 0.01)
  • Hurst exponent of the spread < 0.5
  • Half-life of mean reversion of the spread more than 1 day and less than 30 days
  • Number of zero crossings of the spread > 12 per year

In the second part I will try to combine several metrics and use some machine learning techniques to determine which pairs should be chosen for trading.

Note: there are minor mistakes I made in functions used to calculate some of the metrics that became evident when I started working on this article:

  • Instead of calculating the number of zero crossings, we should calculate the number of historical mean crossings. It wasn’t important before, because most spreads’ means were very close to zero, but now it is not the case. Now we can have spreads with means far from zero.
  • The 2-SD band on the plots was calculated around zero instead of the historical mean. Again, it wasn’t important before, but it is now, for the same reasons as above.

I am going to use the same universe of stocks and the same time periods as I used before. You can read more about it here.


12 months formation period / 6 months trading period

After testing 263901 possible pairs we are left with 4760 pairs that satisfy all the criteria (compared to 1703 pairs found using equally weighted portfolio). So we have almost three times as much potential pairs.

Since we know construct pair portfolios that are not equally weighted, using Euclidean distance between the prices of two stocks. Instead I will choose pairs with the smallest distance between the price of portfolio and its mean. Below you can see the selected pairs and the plots of their portfolios.

Pairs sorted by distance from the mean (ascending)
Metrics of top 5 pairs during trading period

As we can see, there is no real improvement compared to equally weighted portfolios we tested before. All of the pairs diverge too far from their historical mean.

Now I’ll try selecting top 5 pairs with the highest number of zero crossings.

Pairs sorted by the number of zero crossings (descending)
Metrics of top 5 pairs during trading period

Again the same problem: all of the pairs diverge.

Let’s try selecting top 5 pairs with the highest Pearson correlation coefficient.

Pairs sorted by Pearson correlation coefficient (descending)
Metrics of top 5 pairs during trading period

Again no real improvement. Maybe we should try using longer formation period?


36 months formation period / 6 months trading period

Here I get 2639 potential pairs, compared to 236 we had before. More than ten times as many. Let’s see if this will improve the quality of pairs we select for trading.

Pairs with the smallest Euclidean distance from the mean:

Pairs sorted by distance from the mean (ascending)
Metrics of top 5 pairs during trading period

Here we see again that most pairs diverge too far from their historical mean. Interesting thing to notice is that the last 3 pairs all contain the same stock (EQC). We probably want to avoid trading such pairs. Having a portfolio of pairs, where several pairs contain the same stock, makes it very sensitive to risks coming from that stock. If the price of EQC raises or drops significantly because of some news specific to that company, we will experience losses in all three pairs, because all of them will diverge.

Now we will select pairs with the highest number of zero crossings.

Pairs sorted by the number of zero crossings (descending)
Metrics of top 5 pairs during trading period

The results are similar to what we had in the previous test.

Finally let’s try selecting pairs with the highest Pearson correlation coefficient.

Metrics sorted by Pearson correlation coefficient (descending)
Metrics of top 5 pairs during trading period

No pairs are good here, all of the diverge too far from the historical mean.


Conclusion

The main advantage of the cointegration method we have seen so far is that we have a lot more potential pairs. The main problem is that the techniques we use to select the ‘best’ pairs for trading don’t work as good as we need. In the next article I will try to test several machine learning techniques to select pairs based on several metrics at once.


Jupyter notebook with source code is available here.

Note: if you want to try to run this code on your laptop, you might want to use smaller stock universe; otherwise it might take a very long time.

If you have any questions, suggestions or corrections please post them in the comments. Thanks for reading.

Yuan Di prepared a Chinese adaptation of this article, which is available here.

Update 28.10.21: I have found some mistakes in the way I calculated distances. I have updated the article and the corresponding Jupyter notebook.

Leave a Reply

Your email address will not be published. Required fields are marked *