Regression is a tool for making comparisons?

Knowing what control variables does is extremely important

Jun 27, 2024

Well, here it is, my first substack post. Let me tell you how I decided that today is the day. I generally go through some #econtwitter before bed (as you should), and this is generally because Twitter (X) sends me notifications of such econ-related posts. Yesterday, I was notified of the tweet by Peter Hull, a professor at Brown. It was a repost of his earlier tweet, saying,

“Regression is a tool for making comparisons. If you don’t know / can’t easily explain what comparisons you’re trying to make, then you don’t understand the regression you’re running”.

Well, I felt attacked (jk). However, if Peter says so, I believe it (I don’t know him personally, but I took his “Design-Based Inference” class, and he is amazing). So, I decided to spend my night with an OLS regression model.

My friend reminded me of this meme when I said this.

Then I thought, why not start blogging on this topic? It is a great topic for my intended audience. So here it is.

Okay, let’s get to the point and break down the tweet. I will break them into three parts in the first section and then follow up with a section for controls and fixed effects and a bonus section for those interested in clustering errors.

1. “Regression is a tool for comparisons”.

Let’s dissect what he means by regression as a comparison tool. Regression is a fundamental tool in quantitative research. When we hear about regression, most of us think of it as a tool to estimate the relationship between two variables. That’s not wrong, but there is more to regression than that, especially from a credibility point of view.

Peter implies that regression is about finding the relationship by comparing specific things. Let’s take a simple hypothetical experiment as an example.

\(Y = \alpha + \beta D + \epsilon\)

Here, Y is the outcome variable, and D is a dummy variable that takes 1 when an individual is treated and 0 if not. What does a regression do in this situation? In this case, the regression explicitly compares treated individuals (D = 1) to untreated individuals (D = 0). The coefficient on D would give an average difference between the treated and untreated groups. Since it is an experiment, the treatment assignment is randomized, so we can say that the change in outcome is due to the treatment (If you are unsure why randomization does this, don’t worry we will look into this in the future). You might ask, but what if the independent variable is not binary? How does regression compare when a variable can take several different values like 0, 1, 2, and so on? To understand the logic of comparison let’s look at another example. Here I want to learn about the effect of years of schooling on earnings.

\(Earnings = \alpha + \beta Years of Schooling + \epsilon\)

Here, years of schooling can range from 0 for someone without schooling to some value for someone with schooling. Here, the logic of regression as a comparison is slightly nuanced. The standard interpretation of the coefficient beta is the average change in earnings associated with one additional year increase in schooling. This is also a result of a specific comparison. You are comparing, on average, how much more/less someone earns with X+1 years of schooling compared to someone with X years of schooling. The interpretation is not causal since years of schooling are not random, as in experiments, because it is affected by variables like parents’ income that also affect your future earnings.

So this is what he means by regression as a tool for comparison (at least, that’s what I think).

2. "If you don't know / can't easily explain what comparisons you're trying to make."

Personally, this is the most important part of this tweet. Understanding this will help someone currently pursuing a PhD or wanting to get a PhD to understand research and papers and conduct a better paper.

Peter here is trying to emphasize the importance of having a purpose for the regressions you run. I think this is influenced by design-based thinking, which Peter is an expert on. This line of thinking about comparisons helps you decide the type of regression you want to run and be confident about it.

So, in essence, even before running the regression, you should be clear about the questions you are trying to answer. What comparison would be suitable to answer your questions? For instance, if you want to understand the impact of a training program on wages. Ask yourselves what is the kind of comparison that would help you answer this question? Would comparing individuals who received training to those who did not give you the answer? Or do you want to compare individuals who received training, who are the same in every aspect and only differ in receiving training? These types of questions help you think about the type of regression you want to run and, in turn, help you to explain easily.

3. “Then you don’t understand the regressions you’re running.”

This from Peter, I think, is a caution. And this goes together with point 2. If you do not know what comparisons you are making in regression, you are essentially getting results out of the black box. Regression is not just a technical procedure that you run, but a useful tool to answer questions by making valid comparisons. Without a proper understanding of the comparisons needed for your question, there is a risk of misinterpretation of the results.

Peter then adds to the thread.

Controls can play two roles in this story 1) They can determine what units you're comparing (e.g. "design-based" controls isolating clean treatment/IV contrasts) 2) They can determine what features of units are compared (e.g. fixed effects converting outcome levels to trends)

Roles of Controls in Regression

In the basic sense, the controls’ role in regression is to help separate the true effect of the independent variable by accounting for other factors that could influence the results (this influence is called confounding). For instance, if you can control for all of the confounding factors in an OLS regression, the coefficient of the independent variable is a causal effect. But why is that the case? For this, we must understand how adding controls changes Peter's notion of comparison. Let’s take a simple example.

\(Earnings = \alpha + \beta_1 Years of Schooling + \beta_2 Age+\epsilon\)

I have extended the previous model of earnings on years of schooling to include Age as a control. For simplicity, let’s assume that Age is the only confounding variable, so controlling for it means that β₁ is a causal effect. It is a causal effect because we are comparing, on average, how much more/less someone earns with X+1 years of schooling compared to someone with X years of schooling for people of the same age. In essence, we are making the same kind of comparison but with similar age groups, such as comparing differences in earnings for an additional year of schooling among individuals who are 18 years old.

Roles of Fixed Effects in Regression

Fixed effects are another kind of control that is extensively used in economics. Peter says fixed effects convert outcome levels to trends. What does that mean? We are used to hearing that fixed effects control for time-invariant characteristics. However, when we run a regression with fixed effects, the fixed effects model adjusts for the baseline differences in outcomes between the units and looks into changes over time (trend). When we use fixed effects, the question we are asking changes. Without fixed effects, our question would be, “Which group has a higher level?” but with fixed effects, our question changes to “Which group is changing faster?” Let’s take an example.

\(Earnings = \alpha + \beta_1 Years of Schooling + \beta_2 Age+ \lambda_s +\epsilon\)

I now extend the previous model to include school-fixed effects denoted by λₛ in the equation. With this addition of fixed effects, the comparison that we make changes. We are still comparing, on average, how much more/less someone earns with X+1 years of schooling compared to someone with X years of schooling for people of the same age, but this time, we are comparing only within the same school. So, the estimate we get when we run this regression is not based on the comparison between school differences; rather, it is within the school. Think of fixed effects like controlling for the personality of each school. Similarly, using time-fixed effects compares units within the same time period, controlling for overall time trends.

So, the key takeaway from this is that it is very important to understand your choices when running regressions. The choice of controls, including fixed effects, can dramatically change the following:

What question are you answering with your regression?
How do you interpret your coefficients?
What assumptions are you making about the data-generating process?

Bonus: Error Clustering in Regression

These days, error clustering is everywhere. But there is always a question: how do you decide how to cluster your error? Understanding the comparisons you need to make to answer your question will allow you to cluster the error appropriately. First, let’s understand how clustering of error changes our comparison. Error clustering does not change the nature of the comparison we are making; it changes how we compare the precision of our estimates. In the above example, if we cluster the error at the state level, we allow the possibility that the observations within the same state might be correlated. This is particularly important for panel or repeated cross-section settings because these schools within the same region might experience the same shocks or policies, which might affect the outcomes. If the estimates remain significant with clustering, we can be more confident that the effect is consistent across different regions. However, if the estimates are insignificant after clustering, it suggests that the observed average effect might vary by region and is not consistent across all regions. This does not necessarily mean that there is no effect at all; it means the effect may be only present in some regions, or the effect is weaker when we account for regional similarities.

Understanding regression as a comparison tool is crucial for any aspiring econometrician. By clearly defining our comparisons, choosing appropriate controls and fixed effects, and properly clustering errors, we can ensure our analyses are meaningful and statistically sound. Remember, a thoughtful regression is a powerful regression.

Alright, this much for today. See you in the next post.

Sabin's Substack

Discussion about this post