Forecasting With Level VARs Despite Non-Stationarity And/Or Cointegration. Intuition.


I sometimes run into people who fret about how you can’t use VARs for non-stationary or cointegrated data. For sure, there are problems with frequentist inference and IRFs under these conditions. See Sims, Stock, and Watson (1990) and Phillips and Durlauf (1986).

But for forecasting, a VAR in levels is competitive with a VAR in differences (non-stationary but not cointegrated) or a VECM (cointegrated). There are a few papers with no strong consensus. Christoffersen and Diebold (1998) find that “nothing is lost by ignoring cointegration” with respect to out-of-sample MSE.

Intuition: Non-Stationary, Not Cointegrated

When the data are non-stationary but not cointegrated, an AR in differences can trivially be re-written as an AR(2) in levels.

y_t - y_{t-1} = \alpha_0 + \beta(y_{t-1}-y_{t-2}) + \varepsilon_t

y_t = \alpha_1 + (1+\beta) y_{t-1} - \beta y_{t-2} + \varepsilon_t

The intercepts may change between the two may change as a function of having different control variables, but the relationship in the slope coefficients actually bares out if you try it with real data (within reasonable OLS margin of error).

The levels-AR(2) has more parameters to estimate, so we might expect that it could perform worse from a variance point-of-view. However, a non-stationary variable is cointegrated with lags of itself, meaning that OLS is super-consistent; i.e. it converges faster than normal to the true coefficients. So, these effects come out in the wash.

Intuition: Non-Stationary, Cointegrated

Yes, cointegration implies non-stationarity, but humor my desire for symmetrical headers. Here, we still have that cointegration implies super-consistency for the VAR in-levels. So whatever information is lost by ignoring the error-correction term is likely to come out in the wash. Similarly, the lagged dependent variable in the VAR in-levels will ensure that the residuals are stationary, which will guard against spurious lagged independent variables; i.e., spuriousness comes from unaccounted non-stationarity, but a lagged dependent variable is an accountant.

I will be following up with some simulation results.

Macro Random Forest Leads To Macro Gains In Macro Forecasting

Tomorrow is the 11th ECB Forecasting Conference.

I am excited to see so many top authors: Sims, Engle, Koop, Marcellino, Schorfeide, and many more.

It is fitting that so many progenitors of the innovative models of yesteryear and workhorse methods of today — VAR, ARCH, and Bayesian macroeconometrics — are here to oversee the next generation, who are forging tomorrow’s method in the fires of machine learning.

One such youngster is Phillippe Goulet Coloumbe, author of Taste #3: “The Macroeconomy As A Random Forest.”

In short, his method uses the usual splitting structure of trees over bootstrapped samples. However, regularized linear equations (e.g. a ridge AR(1)) appears in each terminal node as opposed to the vanilla conditional sample mean. By allowing the regressors — or some super/sub-set of regressors — to dictate the splits of trees, we can capture a myriad of non-linearities. For example, a split on a trend variable can capture the behavior of sharp structural change, and a split on the lagged dependent variable can capture the behavior of regime-switching. Averaging over many such trees incorporates all of these dynamics, approximating a true underlying non-linear structure. In short, it is a clever way to use a common machine learning algorithm to capture common time series dynamics, all in one package.

I use this method at work, often finding that it works pretty well.

Given the recent chatter about a labor shortage and upward pressure on wages, I have been interested in forecasting wage inflation via the Employment Cost Index (Wages & Salaries) year-over-year.

I conduct a pseudo-out-of-sample experiment over the last 20 quarters. For each quarter, I fit a factor-augmented VAR (OLS) and factor-augmented MRF, then forecast ECI one-quarter ahead.

OLS Mean Absolute Error = 0.12

MRF Mean Absolute Error = 0.07

Diebold Mariano-Test, p = 0.003

Data Scientists: Jack of All Trades, Master of One?

The Data Science Venn Diagram - Data Science: An Introduction - 2.2 -  YouTube
The Data Science Venn Diagram

There is a popular Venn diagram that purports that data science exists at the intersection of applied statistics, programming, and domain knowledge. Companies would love nothing more than to replace their statistician, software developer, and consultant with one person. Unfortunately, life experience says that very few people can be experts in all three distinct and difficult areas. So, one might say that a data scientist is a “jack of all trades, master of none”.

But in my experience, I find that many data scientists are people who have deep expertise in one area — they are statistics grad students or computer science grad students — who try to reach beyond their discipline’s traditional boundaries to secure the coveted “data scientist” title.

In other words, I would argue that data scientists are “jack of all trades, master of one“.

Take Mike, an econometrics PhD. He has a superior understanding of applied statistics, a pretty good grasp of economic theory (companies will appreciate that he took graduate micro but maybe not care about his electives), and a decent capacity to code (in so far as he needs to program his models and do his empirical work). A lot of data scientists tend to look like Mike: they have a hierarchy of skills.

Nowadays, there are “data science” programs designed so people like Mike do not have to waste their energy pivoting away from a pure econometrics skill set.

Consider Penn’s Data Science MSE. In this program, they douse you in a bit of statistics and a bit of programming. Then they send you off to do electives. I have a feeling that people will choose electives that are somewhat related to one another because it’s natural for people to want to build expertise. But let’s say you take a varied set of electives, as the Venn diagram suggests you should — is that better? I am personally doubtful. Because when the shit hits the fan, a master of none does not have the ability to diagnose problems on a deep level. The master of one will be at least as good on non-expert issues and far superior on expert issues.

Surprise, surprise — expertise still means something.

Penn’s Data Science MSE Curriculum

The Limited Usefulness of Rubin Causality For Decision Makers

I took two courses that explicitly touched on causal inference in college. Both began with the idea that the Rubin average causal effect of treatment D on outcome Y is given by:

E[Y|X,D=1] - E[Y|X,D=0]

The idea here is that many things may determine Y. D determines Y, but so do other things (X). If we can estimate the average Y given D=1 versus Y given D=0 while keeping X constant, then we are in business. Luckily, well-specified, squared-loss regressions are good statistical estimators of the conditional expectation function. So if we have a good model that is fully exogenous, all we would have to do is obtain \hat \beta from the following regression:

Y = \alpha + \beta D + \gamma X + \varepsilon

In economics, these models often have some endogenous D, so we use instrumental variables, sample selection corrections, matching, etc.

In any case, the Rubin approach can be very useful for solving specific problems. For example, if one can show that race or gender significantly predicts wages while holding other characteristics equal, we have good evidence of discrimination. This is a very direct and practical use for a human resources department trying to keep things fair.

But one can also argue that treating D as orthogonal to X is a sterile approach with a limited normative interpretation. If nearly everyone with D=1 has a different X than those with D=0, then the incremental effect of D=1 may not be of interest to people actually trying to use the research to improve outcomes.

Marriages preceded by a period of cohabitation tend to result in divorce at higher rates than marriages not preceded by cohabitation. This is unintuitive to some because it is reasonable to think that only couples with good cohabitation periods decide to marry; cohabitation serves as a screener for good couples. It turns out that people with less committal personalities generally opt for cohabitation and then ease into marriage later via social convention, while those who are committal tend to go straight for marriage. When researchers statistically isolate personality, they find cohabitation leads to better marriages.

So, if you are one of those couples who are heavily committal, how important is the above for you? You might think that you should cohabitate because, all else equal, it should improve your eventual marriage. But the marginal effect for you, as a committal couple, may be so small that delaying marriage is actually the sub-optimal thing to do. So what is important to you is not the effect of cohabitation or D=1, but D=1 given your personality X.

Excessive attention to the effect of D obscures the usefulness of research to people with known X. Many researchers do and should care about the heterogeneity of the effect of D. Unfortunately, acknowledging heterogeneity does not always make for the punchy (amusingly counter-intuitive ex-ante but almost obvious post-hoc) causality papers that are idolized by “Mostly Harmless Econometrics” fans.

ARIMA For Options Investing

In many intro time series classes, you come across the autoregressive integrated moving average (ARIMA) forecasting technique. But you only took economics or finance so you can infiltrate the capitalist beast and make out like a tapeworm. How does ARIMA make my bank account go arriba?

A lot of people fail when trying to use ARIMA to day-trade or swing-trade, so they instead opt for training a neural network with 1 million billion trillion parameters, increasing their electricity bill beyond whatever returns they may hope to make. But if you have patience, if you can suffer through the existential pain that is another year or two on this blighted rock, you may prefer this easier ARIMA options.

  1. Pick a stock deemed fairly safe.
  2. Download monthly price data. There is a trade-off in periodicity. Daily/weekly data is noisy, and quarterly/annual data leads to issues with estimation and structural change. I think monthly is a good balance.
  3. Choose an interval that has fairly linear price action – or price action that can be made linear with log or Box-Box transformations.
  4. Forecast 1-3 years out.
  5. Find an option that capitalizes on those forecasts while fitting your risk-reward tolerance.

Example: MSFT

> r <- auto.arima(y, seasonal = FALSE)
> r
Series: y

ar1 ma1 ma2
0.5817 -1.8075 0.8448
s.e. 0.1891 0.1126 0.1038

sigma^2 estimated as 49.56: log likelihood=-209.05
AIC=426.11 AICc=426.81 BIC=434.61
> forecast(r)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
May 2021 264.3469 255.3247 273.3692 250.5486 278.1453
Jun 2021 268.8622 257.4521 280.2723 251.4120 286.3125
Jul 2021 274.3771 261.4216 287.3326 254.5634 294.1908
Aug 2021 280.4733 266.2047 294.7420 258.6513 302.2954
Sep 2021 286.9078 271.3407 302.4749 263.0999 310.7156
Oct 2021 293.5389 276.5900 310.4879 267.6178 319.4601
Nov 2021 300.2845 281.8231 318.7459 272.0503 328.5187
Dec 2021 307.0966 286.9715 327.2217 276.3180 337.8753
Jan 2022 313.9475 292.0015 335.8935 280.3840 347.5110
Feb 2022 320.8209 296.8991 344.7426 284.2357 357.4060
Mar 2022 327.7073 301.6615 353.7532 287.8736 367.5410
Apr 2022 334.6014 306.2913 362.9115 291.3049 377.8979
May 2022 341.4999 310.7940 372.2058 294.5393 388.4605
Jun 2022 348.4010 315.1759 381.6261 297.5876 399.2145
Jul 2022 355.3036 319.4434 391.1639 300.4601 410.1472
Aug 2022 362.2071 323.6025 400.8117 303.1664 421.2478
Sep 2022 369.1111 327.6588 410.5634 305.7153 432.5069
Oct 2022 376.0154 331.6173 420.4134 308.1144 443.9163
Nov 2022 382.9198 335.4825 430.3571 310.3707 455.4689
Dec 2022 389.8243 339.2584 440.3903 312.4903 467.1584
Jan 2023 396.7289 342.9483 450.5096 314.4786 478.9793

Let’s be conservative and check out options for Jan 2023, assuming it hits the lower 80% interval, $343 (no, I am not assuming this is the lower 10th percentile in some posterior distribution, you Bayesian bastards).

I am liking the 330/335 Bull Call Spread option below. Assuming MSFT is at least $330 in Jan 2023 (which is probably conservative based on the model), 510% is a good return.

In the name of risk aversion, I am also happy with the 285/290 call spread. MSFT must be at least $286 in Jan 2023 (only about 10% up from today), and we get a 252% return. Beats the hell out of any savings account I know of.

Be sure to diversify.

Academia vs Industry: In Pursuit of Truth and Story

In industry, reduced-form economic models need to be simple and understandable. Assume your audience is passively familiar with regression from that one stats course in college around the time they realized beer bongs do not go well with their lexapro. Take your target Y and regress on predictors X via OLS. X is a few select predictors – sometimes leading/lagged, mostly not – and tell a story. Therefore, to forecast the future, you need to introduce exogenous estimates of future X. They can come from consensus, guts, another model (not simultaneously estimated), wherever. Anywhere but inside the model itself — unless you for some reason have an arbitrary term lagged by 8 months. By the way, cointegrated variables must all be in regressed differences without an error-correction term.

When this works, it really works well. We have a killer treasury bond model that is pretty accurate and quantifies relationships that most investors will understand in a way that confirms their priors about supply and demand or something like that. Use your regular OLS standard errors, even when blatantly inappropriate. A high R^2 makes you look like a genius; what is over-fitting?

In academia, you have an opposite problem. You are trying to approximate an unknown and unknowable data-generating process. You are interested in the joint relationship of Y | X, which are both random variables. You use squared-error-loss regression techniques because you want an estimate of E(Y|X) — either to make predictions or make statements about causality within that system. You know you will never quite get the true data-generating process, so always take predictions and causal inferences with some grain of salt – “all models are wrong, but some are useful.” And despite this Sisyphean effort – where I acknowledge I do not know and cannot know the true data-generating process – I can say with 100% certainty that your model is problematic. Your model suffers from unchecked heteroskedasticity, autoregressive conditional heteroskedasticity, spatial heteroskedasticity, serial correlation, cointegration, sample selection bias, simultaneity bias, omitted variable bias, measurement error, naive priors, and an inefficient estimator. But it’s okay — because reasonable people can disagree about something unknown and unknowable; the same way people reasonably disagree about God.

When this works, it also really works well. There are asymptotic properties that can steer you into the right direction. Those should be taken to advantage. I do not know how to truly define the concept of physical health, but I know that Michael Phelps is healthier than Artie Lang. And if enough people can come to this conclusion with their subjective feelings about health, it probably is true.

Like an enlightened centrist, I want combine these worlds. One goal I have is to loosen the stigma around lagged variables, especially dependent variables.

I love the Wold Representation Theorem because it mathematically formalizes something very intuitive: the past can predict the future. Any stationary time series (if not stationary, it can be made stationary through differences) can be written using a deterministic term and stochastic term, which is a linear function of it past errors.

y_t = \eta_t + \sum_k b_k \varepsilon_{t-k}, \ \ k = 1,2,3...

Often, more recent observations are more useful for prediction than those in the distant past. Hence, the beauty of the AR process.

y_t = \phi_0 + \phi_1 y_{t-1} = \frac{\phi_0}{1 - \phi_1} + \sum_k \phi_1^k \varepsilon_{t-k}, \ \ k = 1,2,3..., \ \ -1 < \phi_1 < 1

Generalize this idea to multivariate systems and you are on your way to a Nobel Prize (I love Sims, don’t hurt me).

Industry: But this makes things hard to interpret?

Somewhat. But I think this is how impulse response functions become useful. A one-unit change to y_1 j periods ago affects y_2 today by \Psi. There may be a lot to keep track of, but the interpretation is nice. Be careful about ordering if you use Cholesky decomposition.

Industry: But isn’t a lagged dependent variable just a way of saying “I could not find a real relationship with other variables”?

It’s easy to say that if you are just using ARIMA. But if you use a VAR, if y_1 is a function of y_2 and y_2 is a function of old, then y_1 is a function of old y_2 too.

Academics: Did you account for cointegration? Run a VECM.

This is where I start to get sympathetic to industry. I know a lot of brilliant economists who took a while to wrap their heads around cointegration. A VAR in levels is still statistically consistent (one would rather run the VAR in levels than a VAR in differences without the EC term). The short-run vs. long-run interpretation is not in the vanilla VAR, but I am not sure if the marginal benefits outweigh the costs. It should be easy to convince clients that the past matters for predicting the future. It is far less easy to convince them that because your non-stationary variables can be written as a stationary linear combination, you need this special term that may or may not substantially improve predictions. You probably can convince them that relationships tend to regress to a long-run equilibrium but probably do not care about “by how much.”

What is the point here? Industry people are smarter than you think. The soft bigotry of low GPAs must end. You don’t have to assume they are experts – that is why they come to you. But if you can provide a better service in a way that people can understand, do it.

VARs in R

I am somewhat irked by the lack of a comprehensive R package for multivariate time series. Rob Hyndman’s forecast/ fable package is an excellent, if not exhaustive (how could it really be?), resource for univariate time series.

Here, I am collecting a list of packages that work multivariate time series models, particularly vector autoregressions (VARs).

vars: Standard frequentist VARs and Sturctural VARs (SVARs). Normal VARs are estimated equation-by-equation by least squares, so a “varest” object is basically a collection of “lm” objects that pull from a common data matrix. The “VARselect” function is very useful for lag order selection via information criteria. Includes useful statistical tests for residual serial correlation, normality, and autoregressive conditional heteroskedasticity.

urca: Unit Root and Cointegration Analysis. Useful implementations of the Johansen cointegreation test and estimation of vector error correction models (VECMs). The “vec2var” function converts a VECM to its equivalent VAR representation.

BVAR: Straight-forward estimation and forecasting of Bayesian VARs with customizable, Minnesota-type priors. Hierarchical estimation in the fashion of Giannone, Lenza & Primiceri (2015).

bvarsv: Computes Bayesian VARs with stochastic volatility and time-varying parameters.

tvReg: The function “tvVAR” implements a time-varying-parameters VAR using kernel smoothing.

HDEconometrics: Allows for easy estimation via LASSO through “ic.glmnet” and “HDVar”.

More to come.