Pairs Trading. Modeling the spread as Gaussian state-space model with exogenous inputs

July 12, 2022

This article is based on the paper ‘A new approach to modeling and estimation for pairs trading’ (Do et al. 2006). It is similar to another paper I described before, but authors propose modeling the spread on returns level instead of prices level and also extend the spread model to include exogenous inputs.

Let’s start with describing the proposed model. The spread is defined as difference in log-returns of two stocks. It is assumed to follow Vasicek model defined below.

We can redefine the above model in discrete time format (state space form).

Now we define some new variables.

In the end we have the following model.

This is Linear Gaussian State-Space model (LGSS) and we can use the Expectation Maximization (EM) algorithm derived in the paper to estimate unknown parameters. Let’s try to implement it.

First step is generating some data according to the model. The code for doing it is provided below. We just implement the formulas of the state space model above.

One thing that is a little bit confusing is how to generate the variable r that represents market return. On page 18 of the paper it is said that market return follows geometric Brownian motion, but if we generate it that way, it will look more like cumulative returns, not log-returns. The spread that we are trying to model is a difference of log-returns, so adding cumulative returns to it seems like adding apples and oranges. Here I assume that rfollows lognormal distribution.

There is a simulation study in the paper with given parameters. Let’s generate some data from a model with such parameters. Code and plot are shown below.

Next step is estimating the model parameters. We are using the Shumway and Stoffer approach (described in detail here). It is based on EM algorithm with Kalman filter and smoother. You can find more information on state space models, Kalman filter and EM algorithm in my previous posts:

Similar estimation algorithm was used in the paper described here: Pairs trading. Modeling the spread with mean-reverting Gaussian Markov chain model.

At each iteration of the EM algorithm we run Kalman filter and Kalman smoother with the current estimates of model parameter. Then we update the parameters based on smoothed data. Authors provide formulas for the EM part of the algorithm, but first we need to figure out how to implement Kalman filter and smoother. It is not very trivial because of the exogenous inputs of the model.

Recall the state space equations of standard model for Kalman filter:

To be able to use standard Kalman filter\smoother formulas, we need to represent our spread model in that form. Note that there is no exogenous input in the observation equation, so we must include market return in our state variable. I will create a new state variable — two-dimensional vector z. Then we have:

You can check that if we perform matrix multiplication in the equations above we will get state and observation equations of our model. We also need to design process noise matrix Q and measurement noise matrix R. Our measurement is one-dimensional and according to our model variance of its noise is equal to H². Our state is two-dimensional and therefore process noise is a two-dimensional Gaussian random variable with 2×2 covariance matrix Q. We know that the spread x_t should have noise variance G². Let’s also assume that spread and market return are not correlated and that market returns have some small noise variance (nothing is said about it in the paper, so we assume that observed market return is very accurate). Overall we have:

Filtering is done with the following formulas:

After filtering we need to perform smoothing. The difference between smoothing and filtering is that smoothing is done when we have all the data and filtering is done online, adding one datapoint at a time. In smoothing we go backwards already knowing the whole history of the process, which allows us to get a more accurate estimation of the true state. Smoother equations are provided below.

The next step is estimating parameters of the model. Luckily all the formulas are provided in the appendix of the paper, I just copy-paste them here.

We will also need the following notation.

And some of auxiliary formulas.

Now we have everything we need to write a function that estimates parameters of the model. Code of the function is provided below.

Everything should be clear in the code above. We just implement the provided formulas. One detail that might be a little confusing is that we have parameters G and H of the model and matrices G and H of the Kalman filter (control and observarion matrices). I’m using variables named G and H to indicate control and observation matrices of Kalman filter and variables G_sq and H_sq to indicate squared parameters of the model (G² and H²). We are using the squared version of those parameters in all the formulas anyway. To check for convergence we compute the sum of absolute values of changes in parameters.

Now we can try to replicate a simulation study from the paper to check that the algorithm works as expected. Parameters of the model and initial guesses for EM algorithm are provided in the paper. Code for running the simulation is shown below.

Now we can plot the results.

As you can see on the plots above the EM algorithm quickly converges to values which are very close to the true values of parameters used to generate the data.

Now we are ready to backtest.

I am going to use prices of 100 small cap stocks from 2010 to 2013. Time period from 2010–07–01 to 2012–06–30 is used as a training period to select pairs for trading. Time period from 2012–07–01 to 2012–12–31 is used for trading.

Code for loading and preparing data is shown below.

Note that we need to determine what data to use as market returns. Since 100 stocks used are constituents of VBR ETF, I’m going to use its returns to represent return of the market. Actually we need to use excess market return (market return minus risk-free rate), but for simplicity I will use just log-returns of the VBR ETF.

The pair selection process will be as follows:

Out of all possible pairs select only pairs which are cointegrated according to the Johansen test (the null hypothesis is rejected at 95% level).
Try to fit our model on all of the pairs from the previous step and select only those with positive parameter A.
Calculate parameters of Vasicek model (theta, kappa, sigma) for each of the remaining pairs.
Select 5 pairs with the biggest kappa parameter (which indicates the speed of mean reversion).

Let’s see how to perform it step by step.

After performing this step we have a list of all pairs that are cointegrated according to the Johansen test (at 95% level).

Now we compute the spread of each pair and try to fit our model on it.

Then I select only pairs with positive parameter A and compute the parameters of Vasicek model.

If we now print the resulting dataframe sorting values by kappa, we get the following results:

Dataframe with parameters of Vasicek model

From the dataframe above we select five pairs with the highest kappa and making sure that all pairs consist of different stocks.

Selected pairs are shown on the screenshot above. Now we are ready to run a backtest.

The strategy is the following:

On each trading day we estimate the parameters of the spread (according to our model) for each of the pairs using trailing window of length W=500.
Using the estimated parameters I construct the true spread time series by subtracting the market return (multiplied by D) from the observed spread.
Calculate cumulative sum of the true (residual) spread using trailing window of length P=5.
Open short position if the current value of cumulative sum is bigger than theta.
Open long position if the current value of the cumulative sum is smaller than theta.

Along with the strategy described above I’ll also backtest a simpler (baseline) strategy to use it as a yardstick. This simpler strategy is based on statistical arbitrage strategy I implemented before, you can read more about it here: Cointegration-based multivariate statistical arbitrage.

The strategy is basically the same, we just assume that observed spread is true spread and that theta=0. So the algorithm is the following:

Calculate cumulative sum of the observed spread using rolling window of size P=5.
Open short position if the current value of cumulative sum is bigger than zero.
Open long position if the current value of the cumulative sum is smaller than zero.

Code for the backtest is provided below. I create different positions dataframes for each strategy.

That’s it. We can calculate returns and different performance metrics. Returns are calculated as follows.

On lines 2 and 3 we divide the result by 5 because our capital is allocated equally between 5 pairs and we multiply it by 2 because we can use twice the amount of capital (use proceeds from short positions to open long positions).

Plot of cumulative returns and performance metrics are shown below.

As we can see above performance of the main algorithm is a little bit better than the baseline algorithm. At the beginning of the backtest baseline algorithm performed even better than the main algorithm, but then its performance deteriorates.

You might have noticed that at the pair selection stage we actually used the proposed model — we selected pairs for trading based on the value of parameter kappa. If we just implemented the baseline strategy, probably the selected pairs and the overall performance would be different. Let’s try to test it.

Recall that in our stock universe we had 153 pairs that are cointegrated according to Johansen test. Somehow we need to select only five of them. I will calculate Kendall’s rank correlation coefficient (tau) for each pair and select pairs with the highest tau which are also cointegrated. Code for it is provided below.

We end up with completely different set of pairs.

Another thing in baseline strategy that was different from the original statistical arbitrage strategy is how the spread was constructed. Instead of assigning equal weights to each stock in a pair (so that the spread is just equal to difference in log-returns), we should have used weights derived from the eigenvector of Johansen test. Here I will fix this issue too. Code for backtesting is shown below.

Now let’s look at the metrics and cumulative returns of all three tested strategies.

The baseline strategy performs poorly when trading pairs are selected with the basic method we used. So even though it might seem that the proposed model doesn’t provide a lot of value in actual trading (its performance is comparable to that of baseline strategy), it might be helpful at pair selection stage.

There are some possible modifications that might improve the performance:

Use more risk factors as exogenous inputs (e.g. from Fama and French 3-factor model).
Try different values of parameters W and P.
Modify trading strategy by adding Granger causality test (read more here: Granger causality test in pairs trading).
Try some other pair selection techniques.

Jupyter notebook with source code is available here.

If you have any questions, suggestions or corrections please post them in the comments. Thanks for reading.

References

[1] A new approach to modeling and estimation for pairs trading (Do et al. 2006)

[2] Trading in the Presence of Cointegration (Galenko et al. 2009)

[3] Pairs Trading (Elliot et al. 2005)

[4] Pairs trading. Modeling the spread with mean-reverting Gaussian Markov chain model

[5] Cointegration-based multivariate statistical arbitrage