Name Company Industry TotalCompMil
1 A. Bender COOPER COMPANIES INC Health 3.796108
2 A. Giannopoulos MICROS SYSTEMS INC IT 1.937943
3 A. Lafley PROCTER & GAMBLE CO Staples 19.086460
4 A. Mixon III INVACARE CORP Health 5.056142
5 A. Myers WASTE MANAGEMENT INC Industrial 5.695333
6 A. Perenchio UNIVISION COMMUNICATIONS INC Discretionary 0.000000
NetSalesMil Log10TotalComp SalaryThou
1 411.790 6.579339 461.3
2 400.191 6.287341 690.0
3 43373.000 7.280725 1600.0
4 1247.176 6.703819 948.0
5 11574.000 6.755519 1000.0
6 1311.015 NA 0.0
Histogram and boxplot
hist(execComp$SalaryThou, breaks=20, ylim=c(0,500), xlab="Salary (thousands of $)", main="Salaries of 1501 US CEOs in 2003") boxplot(execComp$SalaryThou, add=TRUE, horizontal=TRUE, at=475, boxwex=50, outcex=0.5)
par(cex=2)hist(execComp$SalaryThou, breaks=20, ylim=c(0,500), xlab="Salary (thousands of $)", main="Salaries of 1501 US CEOs in 2003") boxplot(execComp$SalaryThou, add=TRUE, horizontal=TRUE, at=475, boxwex=50, outcex=0.5)
Numerical summaries
Pay attention to the units and variable types.
summary(execComp)
Name Company Industry TotalCompMil
Length:1501 Length:1501 Length:1501 Min. : 0.000
Class :character Class :character Class :character 1st Qu.: 1.257
Mode :character Mode :character Mode :character Median : 2.533
Mean : 4.628
3rd Qu.: 5.477
Max. :74.750
NA's :6
NetSalesMil Log10TotalComp SalaryThou
Min. : 0.0 Min. :0.000 Min. : 0.0
1st Qu.: 471.8 1st Qu.:6.100 1st Qu.: 450.0
Median : 1237.4 Median :6.404 Median : 650.0
Mean : 5181.9 Mean :6.411 Mean : 696.8
3rd Qu.: 4044.8 3rd Qu.:6.739 3rd Qu.: 900.0
Max. :257157.0 Max. :7.874 Max. :3993.0
NA's :1 NA's :7
In the numerical summaries, you have seen:
measures of central tendency: mean, median
measures of location: median, quartiles (more generally, quantiles)
Each of these have their meanings and quirks.
Next up is a measure of spread called the standard deviation. The name may seem mysterious but it will be clear later when we talk about transformations.
We can calculate the standard deviation of every column of the dataset, but you quickly run into problems.
apply(execComp, 2, sd)
Name Company Industry TotalCompMil NetSalesMil
NA NA NA NA NA
Log10TotalComp SalaryThou
NA 374.3382
apply(execComp, 2, sd, na.rm=TRUE)
Name Company Industry TotalCompMil NetSalesMil
NA NA NA 6.231619e+00 1.469691e+04
Log10TotalComp SalaryThou
5.006985e-01 3.743382e+02
The name standard deviation has two components: standard and deviation.
The word “deviation” refers to the difference of each observation from a reference point.
For the standard deviation, the reference point is the mean.
The word “standard” refers to the ability to compare things on a common scale.
To illustrate where the word “standard” comes from, consider the data on salaries for CEOS. Recall the mean and standard deviation:
mean(execComp$SalaryThou, na.rm=TRUE)
[1] 696.7979
sd(execComp$SalaryThou, na.rm=TRUE)
[1] 374.3382
Next, let us look at some CEOs:
head(execComp[,c("Name", "SalaryThou")])
Name SalaryThou
1 A. Bender 461.3
2 A. Giannopoulos 690.0
3 A. Lafley 1600.0
4 A. Mixon III 948.0
5 A. Myers 1000.0
6 A. Perenchio 0.0
From the extract, you can see that rows 1, 2, and 6 have CEOs whose salaries are below the mean.
But how far below the mean? We can express this distance in terms of the original units: thousand dollars.
Distances are affected by the units chosen! So, one approach is to express this distance as a unitless quantity.
For example, A. Bender is \(\left| \frac{461.3-696.7979}{374.3382} \right|\approx 0.63\) standard deviations below the mean. In other words, A. Bender’s salary is about 3/5 of a standard deviation below the mean.
What we have done is to construct a \(z\)-score. This plays a role in getting a sense of magnitudes of typical datasets.
It turns out that for many datasets, it is very unlikely to have observations in the data which are more than 3 standard deviations above or below the mean. This is called the empirical rule.
There are definitely exceptions to this rule. But the rule, along with visualizations and other summaries, can be useful to detect potential issues with data.
The empirical rule states that for “bell-shaped” histograms,
About 68% of the observations are within one standard deviation of the mean.
About 95% of the observations are within two standard deviations of the mean.
About 99% of the observations are within three standard deviations of the mean.
CEO salaries have a long right tail, and:
mSal <-mean(execComp$SalaryThou, na.rm=TRUE)sdSal <-sd(execComp$SalaryThou, na.rm=TRUE) zSal <- (execComp$SalaryThou-mSal)/sdSalhist(zSal, ylim =c(0, 700), xlab ="Standard Units", main ="Standardized Salaries of 1501 US CEOs in 2003")
Let us get a rough sense of how well the empirical rule applies to CEO salaries.
mean(abs(zSal) >1, na.rm=TRUE)
[1] 0.2098601
mean(abs(zSal) >2, na.rm=TRUE)
[1] 0.0326449
mean(abs(zSal) >3, na.rm=TRUE)
[1] 0.01399067
There are other numerical summaries:
measures of spread: mean absolute deviation
measures of symmetry: skewness
measures of tail information: kurtosis, outliers
Almost all of these are univariate in nature, or applied one variable at a time.
Data transformations
There is a ton of whitespace in the pictures earlier.
It may be a good idea to “show more data” by considering a transformation.
But transformations may or may not have effects of numerical summaries.
Transformations could either be linear or nonlinear.
Linear transformations
Translation: adding or subtracting a constant amount
Scaling: multiplying or dividing by a constant amount
Changing millions USD to USD
Changing USD to EUR
Combination of both
Changing Celsius to Fahrenheit
Standardizing variables
Nonlinear transformations
One useful transformation is the logarithmic transformation. Logarithmic transformations are related to growth rates. We will see more of this later.
Here we use base 10. This may make sense for monetary amounts.
Dropping zero net sales and zero compensation did not change the reported values too much.
Think of these numerical summaries as averages or comparisons of averages.
Intercept: The average compensation of CEOs with net sales of zero is about 3.77 million dollars.
Slope: When we compare firms whose net sales differ by 1 million dollars, the average compensation of their CEOs differ by about 0.00017 million dollars or better yet, 170 dollars.
Even better: When we compare firms whose net sales differ by 1 trillion dollars, the average compensation of their CEOs differ by about 170000 dollars.
Take note that you are NOT comparing the same CEO! We are comparing different CEOs.
In the current context, near zero net sales is plausible. So, the intercept would make sense.
Note that for communication purposes, it is really good practice to reduce the number of digits reported (more digits NOT the same as more accurate!) considering the context.
Regression with only an intercept
What happens when you apply lm() without NetSalesMil?
You will note that the intercept matches the mean of TotalCompMil. This pattern holds in general.
You can also think of the intercept as the slope of a regressor which takes on the value 1 for all observations.
Note that as soon as you include NetSalesMil, the meaning of the intercept will change.
Regression lines, fitted values, and residuals
In the compensation-sales relationship, the regression line is \[\widehat{\mathtt{TotalCompMil}}_t=3.7670804+0.0001673*\mathtt{NetSalesMil}_t\]
The regression line where I did not include NetSalesMil is \[\widehat{\mathtt{TotalCompMil}}_t=4.626\]
Both these lines could be used to produce fitted values. Just plug-in values for NetSalesMil from the data.
This plugging-in matters for the first line.
It does not matter for the second line.
It is possible to use these lines to make predictions outside of the data used for linear regression analysis. This also produces fitted values, which are more accurately termed as predicted values.
The regression line always passes through the point of averages or “the center”.
For the regression line \[\widehat{\mathtt{TotalCompMil}}_t=3.7670804+0.0001673*\mathtt{NetSalesMil}_t\] observe that
The difference between the TotalCompMil from the data and the fitted value produces the residuals: \[\mathtt{TotalCompMil}_t-\widehat{\mathtt{TotalCompMil}}_t\]
Provided that there is an intercept, the residuals are guaranteed to sum up to zero. So, the mean of the residuals is always equal to zero.
You can verify easily for the case where our regression line is just \[\widehat{\mathtt{TotalCompMil}}_t=4.626\]
For the regression line \[\widehat{\mathtt{TotalCompMil}}_t=3.7670804+0.0001673*\mathtt{NetSalesMil}_t\] observe that
The black line is the regression line obtained from lm().
Compared to the red line, the black line has a smaller slope in absolute value.
It turns out that there is a factor that produces the smaller slope relative to the red line.
This factor is called the correlation coefficient.
Correlation and fit
The correlation coefficient has the following properties:
It has no units and is computed using pairs of variables.
It is always between -1 and 1.
It is unaffected by linear transformations of the variables, up to the sign.
The square of the correlation coefficient is called the R-squared, which is usually used as a measure of fit.
Through some algebraic manipulations, R-squared can be shown to be
the ratio of the variance of the fitted values to the variance of the regressand
one minus the ratio of the variance of the residuals to the variance of the regressand
The key is to recognize that, in general, the variance of the regressand is equal to the sum of the variance of the fitted values and the variance of the residuals.
Call:
lm(formula = TotalCompMil ~ NetSalesMil, data = eCsub)
Residuals:
Min 1Q Median 3Q Max
-21.679 -2.787 -1.658 0.860 69.945
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.767e+00 1.573e-01 23.95 <2e-16 ***
NetSalesMil 1.673e-04 1.017e-05 16.46 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.729 on 1489 degrees of freedom
Multiple R-squared: 0.1539, Adjusted R-squared: 0.1533
F-statistic: 270.8 on 1 and 1489 DF, p-value: < 2.2e-16
Formulas behind lm()
Let \(Y_t\) be the value of the regressand for the \(t\)th observation.
Let \(X_t^\prime=(1, X_{1t}, X_{2t}, \ldots X_{kt})\) be the corresponding value of the regressors.
lm() computes the intercept and slopes, and together we call it \(\widehat{\beta}=\left(\widehat{\beta}_0,\widehat{\beta}_1,\ldots,\widehat{\beta}_k\right)^\prime\).
The fitted values are \[\widehat{Y}_t=X_t^\prime\widehat{\beta}=\widehat{\beta}_0+\widehat{\beta}_1X_{1t}+\ldots+\widehat{\beta}_k X_{kt}\]
lm() is based on a computationally efficient way of calculating \[\widehat{\beta}=\left(\frac{1}{n}\sum_{t=1}^n X_tX_t^\prime\right)^{-1}\left(\frac{1}{n}\sum_{t=1}^n X_tY_t\right)\]
Note that it can happen that \(\widehat{\beta}\) will not exist. But usually software compensates for this.
The technical term for the reason why \(\widehat{\beta}\) will not exist is perfect multicollinearity.
Because the formula represents a general case, try working with special cases:
Set \(X_t^\prime=1\). This is linear regression with only an intercept.
Set \(X_t^\prime=(1,X_{1t})\). Many call this case as simple linear regression. You have seen an example before: the compensation-sales relationship.
The general case where \(X_t^\prime=(1, X_{1t}, X_{2t}, \ldots X_{kt})\) is sometimes called multiple linear regression.
Try showing that in the case of simple linear regression: \[\widehat{\beta}_1=\frac{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)\left(Y_{t}-\overline{Y}\right)}{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)^2},\ \widehat{\beta}_0=\overline{Y}-\widehat{\beta}_1\overline{X}_1 \] where \[\overline{Y}=\frac{1}{n}\sum_{t=1}^n Y_t,\ \ \ \ \overline{X}_1=\frac{1}{n}\sum_{t=1}^n X_{1t}\]
Let \[\begin{eqnarray}s_{X_1Y} &=& \frac{1}{n}\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)\left(Y_{t}-\overline{Y}\right),\\ s_{X_1}^2&=&\frac{1}{n}\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)^2, \ \ s_Y^2=\frac{1}{n}\sum_{t=1}^n\left(Y_{t}-\overline{Y}\right)^2\end{eqnarray}\]
These are the covariance of \(X_1\) and \(Y\), the variances of \(X_1\) and \(Y\).
Let \(r_{X_1Y}\) be the correlation coefficient of \(X_1\) and \(Y\), i.e. \[r_{X_1Y}=\frac{s_{X_1Y}}{\sqrt{s_{X_1}^2s_{Y}^2}}=\frac{s_{X_1Y}}{s_{X_1}s_{Y}}\]
You can show that in the simple linear regression case, the slope can be rewritten as \[\widehat{\beta}_1=\frac{s_{X_1Y}}{s_{X_1}^2}=r_{X_1Y}\frac{s_Y}{s_{X_1}}\]
A more concise way of writing \(\widehat{\beta}\) is to use more matrix notation.
It is connected to how a spreadsheet looks like: \[\underset{\left(n\times1\right)}{\mathbf{Y}}=\left(\begin{array}{c}
Y_{1}\\
\vdots\\
Y_{n}
\end{array}\right),\underset{\left(n\times \left(k+1\right)\right)}{\mathbf{X}}=\left(\begin{array}{c}
X_{1}^{\prime}\\
\vdots\\
X_{n}^{\prime}
\end{array}\right)=\left(\begin{array}{cccc}
1 & {X_{11}} & \cdots & {X_{k1}}\\
\vdots & \vdots & \ddots & \vdots\\
1 & {X_{1n}} & \cdots & {X_{kn}}
\end{array}\right)\]
So, we can write \[\widehat{\beta}= \left(\mathbf{X}^\prime \mathbf{X}\right)^{-1}\mathbf{X}^\prime\mathbf{Y}\]
Frisch-Waugh-Lovell (FWL) Theorem
The regression slope also has another algebraic property, which is extremely useful in practice.
You can add more regressors in your linear regression analysis. But you will also have as many slopes. How do we communicate the results?
Showing \[\widehat{\beta}= \left(\mathbf{X}^\prime \mathbf{X}\right)^{-1}\mathbf{X}^\prime\mathbf{Y}\] is probably not very helpful.
The key ideas are:
Slopes as comparisons of averages
The expression for the slope in simple linear regression: \[\widehat{\beta}_1=\frac{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)\left(Y_{t}-\overline{Y}\right)}{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)^2} \]
Get the residuals from a linear regression of TotalCompMil on an intercept.
Get the residuals from a linear regression of NetSalesMil on an intercept.
(Residual regression) Compute the slope from a linear regression of the residuals in the first step on the residuals in the second step.
The resulting slope is exactly equal to the slope computed from a simple linear regression of TotalCompMil and NetSalesMil.
What you have seen is the simplest version of FWL.
The more general version is as follows. Focus on the last entry of \(\widehat{\beta}\), namely \(\widehat{\beta}_k\).
Get the residuals from a linear regression of \(Y\) on all regressors except \(X_k\).
Get the residuals from a linear regression of \(X_k\) on all regressors except \(X_k\).
(Residual regression) Compute the slope from a linear regression of the residuals in the first step on the residuals in the second step. The resulting slope is exactly equal to \(\widehat{\beta}_k\).
Intercept: The average log compensation of CEOs with log net sales of zero is about 2.75. What about the units? Is it million dollars? Hard to communicate. But it roughly translates to predicted compensation of about 562 dollars.
Slope: When we compare firms whose log net sales differ by 1, the average log compensation of their CEOs differ by about 0.4.
Even better: When we compare firms whose net sales differ by 1 percent, the average compensation of their CEOs differ by about 0.4 percent.
Where did the relative changes (in percent) come from? Note that \(\log x_2 - \log x_1 = \log\left(\dfrac{x_2}{x_1}\right)=\log\left[1+\left(\dfrac{x_2}{x_1}-1\right)\right] \approx \dfrac{x_2}{x_1}-1\)
What happens when you use a different base for the logarithmic transformation?
It is hard to make sense of the visualization, because there are 10 industries.
You can also make separate visualizations for each industry. Each industry will have fewer observations relative to the total number of observations.
But a numerical summary might be a good alternative. We introduce interaction terms to describe differences in the compensation-sales relationship across industries.
Intercept: The average log compensation of CEOs with log net sales of zero in the Discretionary industry is about 2.48, which roughly translates to 301 dollars.
Slope of log10NetSales: When we compare firms in the same industry whose net sales differ by 1 percent, the average compensation of their CEOs differ by about 0.43 percent.
So you might wonder where are the industry differences?
So far, you have encountered many results from linear regressions.
Let us focus on a theme shared by these results. You might compare different results. Take for example:
Linear regression of TotalCompMil on a column of ones
Linear regression of TotalCompmil on a column of ones and industry dummies
In both results, you have two different intercepts with different meanings. But are they related to each other somehow?
For the linear regression of TotalCompMil on a column of ones, the intercept computed was the overall mean of TotalCompMil (regardless of industry or whatever group).
For the linear regression of TotalCompMil on a column of ones, the intercept computed was the mean of TotalCompMil for the Discretionary industry.
To see how they are related, we look at a simpler version of the setup of short and long regressions.
Let us use some notation.
Short regression: \(\widetilde{Y}_t = \widetilde{\beta}_0\)
Long regression: \(\widehat{Y}_t = \widehat{\beta}_0+ \widehat{\beta}_1 X_{1t}\)
How is \(\widehat{\beta}_0\) related to \(\widetilde{\beta}_0\) algebraically?
Note that \(\widetilde{Y}_t\) and \(\widehat{Y}_t\) are different fitted values.
We can obtain a linear regression of \(X_{1}\) on a column of ones, i.e., \(\widehat{X}_{1t}=\widehat{\delta}_0\).
Next, \(X_{1t}\) differs from the fitted value \(\widehat{X}_{1t}\) by a residual.
Therefore, we must have \[\widetilde{\beta}_0= \widehat{\beta}_0+\widehat{\beta}_1 \widehat{\delta}_0\]
Try interpreting this algebraic relationship where you have two subgroups and \(X_{1t}\) is a dummy variable for one of the two subgroups.
The algebraic relationships that connect the short regression with the long regression are much more general.
For example:
Short regression: \(\widetilde{Y}_t = \widetilde{\beta}_0+\widetilde{\beta}_1 X_{1t}\)
Long regression: \(\widehat{Y}_t = \widehat{\beta}_0+ \widehat{\beta}_1 X_{1t}+ \widehat{\beta}_2 X_{2t}\)
You can show that \(\widetilde{\beta}_0= \widehat{\beta}_0+\widehat{\beta}_2 \widehat{\delta}_0\) and \(\widetilde{\beta}_1= \widehat{\beta}_1+\widehat{\beta}_2 \widehat{\delta}_1\), where \(\widehat{\delta}_0\) and \(\widehat{\delta}_1\) are computed from a linear regression of \(X_{2}\) on \(X_{1}\) and a column of ones.
Return to the example at the beginning:
Linear regression of TotalCompMil on a column of ones
Linear regression of TotalCompmil on a column of ones and industry dummies
The meaning of the intercept changed when comparing the short and the long regression. Can you make sense of where the changes come from?
Why is it important to know these algebraic relationships?
To demystify linear regressions
To use it later to assess sensitivity of results
Returning to subgroup analysis
After that detour on short and long regressions, we now revisit the compensation-sales relationship with industry differences.
We saw that adding industry dummies only changed how we communicate the results regarding the intercept.
Of course, the numerical value of the slope has changed but the way we communicated it is the same as before!
We now introduce interaction terms: these terms allow us to allow for group differences in the slope.
What are interaction terms? The interaction term of one regressor with another regressor is just their product.
If you interact the industry dummies with sales, then, in our context, you will get 9 interaction terms.
From here, can you figure out in what sense will industry differences are reflected in the compensation-sales relationship?
How does it differ from the regression without interaction terms?
Intercept: The average log compensation of CEOs with log net sales of zero in the Discretionary industry is about 2.05, which translates to about 112 dollars.
Slope of log10NetSales: When we compare firms in the Discretionary industry whose net sales differ by 1 percent, the average compensation of their CEOs differ by about 0.47 percent.
What about the role of the interaction terms?
For the Telecom industry: When we compare firms in the Telecom industry whose net sales differ by 1 percent, the average compensation of their CEOs differ by about 0.6 percent.
You can interpret the other coefficients of the interactions in a similar way.
Compare these to the results without interaction terms.
What if you interact variables which are both continuous?
Consider an example based on a student attendance dataset from Wooldridge’s Introductory Econometrics.
Here we compute a linear regression of the standardized outcome on a principles of microeconomics final exam at the University of Michigan on percentage of classes attended, prior college GPA, and their interaction.
We can obtain the data online and the data description can be found here.
This is an example of a CSV type file, but the separator is no longer a comma, but a tab (pressed the Tab key on a keyboard).
This is also an example, where the header is not available. Thus, you have to supply it using the data description.
You have to modify the read.csv() command to account for these features.
Coefficient of atndrte: When we compare students who have zero prior college GPA but the ones whose attendance rates are higher by 1 percentage point (note this!) would have average standardized final exam scores lower by 0.03.
Much better: When we compare students who have zero prior college GPA but the ones whose attendance rates are higher by 30 percentage points would have average standardized final exam scores lower by 0.96, which is roughly a difference of a standard deviation in the original final exam scores.
An even better way to communicate the preceding results is to recenter the regressors before computing the linear regression.
The recentering could the relative to the mean or relative to interesting values.
When we compare students who have prior college GPA equal to its mean but whose attendance rates are higher by 1 percentage point, their average standardized final exam scores would be lower by 0.
Even better: When we compare students who have prior college GPA equal to its mean but one has attendance rates higher by 2 percentage points, their average standardized final exam scores would be lower by 0.01, which is roughly a difference of a hundredth of a standard deviation in the original final exam scores.
Try interpreting the coefficient of priGPA. You have the original results and the results with the recentered variables.
Try interpreting the intercept in the results with the recentered variables.
Try different recenterings and see what happens.
I end here with really thinking about what the word “linear” means in linear regression.
Operations on data
You have seen visualizations and numerical summaries of data.
Behind every visualization and numerical summary, there is a set of operations on data.
Typical data operations involve: summing up, scaling, transforming, subsetting or filtering, arranging in some order, standardizing, comparing
Everything we have done is based on the data we have on hand.
We also saw many linear regressions. All of them are about comparisons involving averages.
What we are going to do next is to think hard about summaries and linear regressions before doing any computations.
More things to think about
Smoothing the histogram to capture broad features
Time series data and other types of data
Does it make sense to apply summary statistics directly?