Describing Distributions

(when the data is on hand)

Andrew Pua

What are we doing this week?

  • Course information

  • Jump into data-based activities

    • Descriptive statistics
    • Linear regressions
  • Communicate results from a case study on executive compensation

  • Work on some of the algebra behind linear regressions

  • Fast track into R

How much do executives earn?

  • We are going to study how much money CEOs in the US earn.

  • CEO compensation is a topic that attracts attention from almost everyone.

  • Many feel that all CEOs get paid high salaries, and perhaps their compensation levels are not entirely well justified.

  • Feel free to look into the debates and rich literature about executive compensation.

Origins of the data

  • I used textbook data from Stine and Foster (2004).

    • Comes in xls format but I converted it into csv format.
    • Comes in two versions: one with sales information and the other with salary information.
  • Start to get used to the following:

    • Using non-Excel based formats
    • Merging different datasets
    • Choosing good variable names
dataset1 <- read.csv("sia1_ceo_comp.csv")
dataset2 <- read.csv("04_4m_exec_comp_2003.csv")
names(dataset1) <- c("Name", "Company", "Industry", "TotalCompMil", 
                     "NetSalesMil", "Log10TotalComp")
names(dataset2) <- c("SalaryThou", "Name", "Company", "Industry", 
                     "TotalCompMil")
execComp <- merge(dataset1, dataset2, 
                  by=c("Name","Company", "Industry", "TotalCompMil")) 
head(execComp)
             Name                      Company      Industry TotalCompMil
1       A. Bender         COOPER COMPANIES INC        Health     3.796108
2 A. Giannopoulos           MICROS SYSTEMS INC            IT     1.937943
3       A. Lafley          PROCTER & GAMBLE CO       Staples    19.086460
4    A. Mixon III                INVACARE CORP        Health     5.056142
5        A. Myers         WASTE MANAGEMENT INC    Industrial     5.695333
6    A. Perenchio UNIVISION COMMUNICATIONS INC Discretionary     0.000000
  NetSalesMil Log10TotalComp SalaryThou
1     411.790       6.579339      461.3
2     400.191       6.287341      690.0
3   43373.000       7.280725     1600.0
4    1247.176       6.703819      948.0
5   11574.000       6.755519     1000.0
6    1311.015             NA        0.0

Histogram and boxplot

hist(execComp$SalaryThou, breaks=20, ylim=c(0,500), 
     xlab="Salary (thousands of $)", 
     main="Salaries of 1501 US CEOs in 2003") 
boxplot(execComp$SalaryThou, add=TRUE, horizontal=TRUE, 
        at=475, boxwex=50, outcex=0.5)
par(cex=2)
hist(execComp$SalaryThou, breaks=20, ylim=c(0,500), 
     xlab="Salary (thousands of $)", 
     main="Salaries of 1501 US CEOs in 2003") 
boxplot(execComp$SalaryThou, add=TRUE, horizontal=TRUE, 
        at=475, boxwex=50, outcex=0.5)

Numerical summaries

  • Pay attention to the units and variable types.
summary(execComp)
     Name             Company            Industry          TotalCompMil   
 Length:1501        Length:1501        Length:1501        Min.   : 0.000  
 Class :character   Class :character   Class :character   1st Qu.: 1.257  
 Mode  :character   Mode  :character   Mode  :character   Median : 2.533  
                                                          Mean   : 4.628  
                                                          3rd Qu.: 5.477  
                                                          Max.   :74.750  
                                                          NA's   :6       
  NetSalesMil       Log10TotalComp    SalaryThou    
 Min.   :     0.0   Min.   :0.000   Min.   :   0.0  
 1st Qu.:   471.8   1st Qu.:6.100   1st Qu.: 450.0  
 Median :  1237.4   Median :6.404   Median : 650.0  
 Mean   :  5181.9   Mean   :6.411   Mean   : 696.8  
 3rd Qu.:  4044.8   3rd Qu.:6.739   3rd Qu.: 900.0  
 Max.   :257157.0   Max.   :7.874   Max.   :3993.0  
 NA's   :1          NA's   :7                       
  • In the numerical summaries, you have seen:

    • measures of central tendency: mean, median
    • measures of location: median, quartiles (more generally, quantiles)
  • Each of these have their meanings and quirks.

  • Next up is a measure of spread called the standard deviation. The name may seem mysterious but it will be clear later when we talk about transformations.

  • We can calculate the standard deviation of every column of the dataset, but you quickly run into problems.
apply(execComp, 2, sd)
          Name        Company       Industry   TotalCompMil    NetSalesMil 
            NA             NA             NA             NA             NA 
Log10TotalComp     SalaryThou 
            NA       374.3382 
apply(execComp, 2, sd, na.rm=TRUE)
          Name        Company       Industry   TotalCompMil    NetSalesMil 
            NA             NA             NA   6.231619e+00   1.469691e+04 
Log10TotalComp     SalaryThou 
  5.006985e-01   3.743382e+02 
  • The name standard deviation has two components: standard and deviation.

    • The word “deviation” refers to the difference of each observation from a reference point.
    • For the standard deviation, the reference point is the mean.
    • The word “standard” refers to the ability to compare things on a common scale.
  • To illustrate where the word “standard” comes from, consider the data on salaries for CEOS. Recall the mean and standard deviation:
mean(execComp$SalaryThou, na.rm=TRUE) 
[1] 696.7979
sd(execComp$SalaryThou, na.rm=TRUE) 
[1] 374.3382
  • Next, let us look at some CEOs:
head(execComp[,c("Name", "SalaryThou")])
             Name SalaryThou
1       A. Bender      461.3
2 A. Giannopoulos      690.0
3       A. Lafley     1600.0
4    A. Mixon III      948.0
5        A. Myers     1000.0
6    A. Perenchio        0.0
  • From the extract, you can see that rows 1, 2, and 6 have CEOs whose salaries are below the mean.
  • But how far below the mean? We can express this distance in terms of the original units: thousand dollars.
  • Distances are affected by the units chosen! So, one approach is to express this distance as a unitless quantity.
  • For example, A. Bender is \(\left| \frac{461.3-696.7979}{374.3382} \right|\approx 0.63\) standard deviations below the mean. In other words, A. Bender’s salary is about 3/5 of a standard deviation below the mean.
  • What we have done is to construct a \(z\)-score. This plays a role in getting a sense of magnitudes of typical datasets.
  • It turns out that for many datasets, it is very unlikely to have observations in the data which are more than 3 standard deviations above or below the mean. This is called the empirical rule.
  • There are definitely exceptions to this rule. But the rule, along with visualizations and other summaries, can be useful to detect potential issues with data.
  • The empirical rule states that for “bell-shaped” histograms,

    • About 68% of the observations are within one standard deviation of the mean.
    • About 95% of the observations are within two standard deviations of the mean.
    • About 99% of the observations are within three standard deviations of the mean.
  • CEO salaries have a long right tail, and:
mSal <- mean(execComp$SalaryThou, na.rm=TRUE)
sdSal <- sd(execComp$SalaryThou, na.rm=TRUE) 
zSal <- (execComp$SalaryThou-mSal)/sdSal
hist(zSal, ylim = c(0, 700), xlab = "Standard Units", 
     main = "Standardized Salaries of 1501 US CEOs in 2003")
  • Let us get a rough sense of how well the empirical rule applies to CEO salaries.
mean(abs(zSal) > 1, na.rm=TRUE)
[1] 0.2098601
mean(abs(zSal) > 2, na.rm=TRUE)
[1] 0.0326449
mean(abs(zSal) > 3, na.rm=TRUE)
[1] 0.01399067
  • There are other numerical summaries:

    • measures of spread: mean absolute deviation
    • measures of symmetry: skewness
    • measures of tail information: kurtosis, outliers
  • Almost all of these are univariate in nature, or applied one variable at a time.

Data transformations

  • There is a ton of whitespace in the pictures earlier.
  • It may be a good idea to “show more data” by considering a transformation.
  • But transformations may or may not have effects of numerical summaries.
  • Transformations could either be linear or nonlinear.

Linear transformations

  • Translation: adding or subtracting a constant amount

  • Scaling: multiplying or dividing by a constant amount

    • Changing millions USD to USD
    • Changing USD to EUR
  • Combination of both

    • Changing Celsius to Fahrenheit
    • Standardizing variables

Nonlinear transformations

  • One useful transformation is the logarithmic transformation. Logarithmic transformations are related to growth rates. We will see more of this later.
  • Here we use base 10. This may make sense for monetary amounts.
execComp$log10TotalComp <- log10(execComp$TotalCompMil*10^6)
execComp$log10Sal <- log10(execComp$SalaryThou*10^3)
summary(execComp$log10Sal)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -Inf   5.653   5.813    -Inf   5.954   6.601 
summary(execComp$log10TotalComp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   -Inf   6.099   6.404    -Inf   6.739   7.874       6 
  • Another base to use is \(e \approx 2.718\). This is used more frequently in economics.
execComp$logTotalComp <- log(execComp$TotalCompMil*10^6)
execComp$logSal <- log(execComp$SalaryThou*10^3)
summary(execComp$logSal)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -Inf   13.02   13.38    -Inf   13.71   15.20 
summary(execComp$logTotalComp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   -Inf   14.04   14.74    -Inf   15.52   18.13       6 
  • Other types of transformations such as squaring a variable could be useful, but there numerical summaries might not be meaningful.
par(cex=2)
hist(execComp$log10Sal[execComp$log10Sal > 0], breaks = 20, 
     ylim = c(0, 700), xlab = "Log Salary", 
     main = "Log Salaries of 1501 US CEOs in 2003") 
boxplot(execComp$log10Sal[execComp$log10Sal > 0], add = TRUE, 
        horizontal = TRUE, at = 700, boxwex = 50, outcex = 0.5)

Visualizing variables together

par(cex=2)
plot(execComp$NetSalesMil, execComp$TotalCompMil, xlim=c(0,280000), ylim=c(0,90), 
     xlab = "Net Sales in Millions", ylab = "Total Compensation in Millions")
par(cex=2)
execComp$log10NetSales <- log10(execComp$NetSalesMil*10^6)
plot(execComp$log10NetSales, execComp$log10TotalComp, xlim = c(4, 12), 
     ylim = c(0, 9), xlab = "Log Net Sales", ylab = "Log Total Compensation")
boxplot(execComp$log10TotalComp, add = TRUE, horizontal = FALSE, 
        at = 12, outcex = 0.25)
boxplot(execComp$log10NetSales, add = TRUE, horizontal = TRUE, 
        at = 9, outcex = 0.25)

Linear regression analysis

  • We can look at one possible numerical summary of the previous plots.
  • Here I report the results for the full dataset.
lm(TotalCompMil ~ NetSalesMil, data=execComp)

Call:
lm(formula = TotalCompMil ~ NetSalesMil, data = execComp)

Coefficients:
(Intercept)  NetSalesMil  
  3.7602330    0.0001674  
  • TotalCompMil is called the regressand or outcome variable and NetSalesMil is called the regressor or predictor.
  • Next, I report the results when I drop firms with no sales and whose CEO has no reported compensation.
eCsub <- subset(execComp, (TotalCompMil>0 & NetSalesMil>0))
lm(TotalCompMil ~ NetSalesMil, data=eCsub)

Call:
lm(formula = TotalCompMil ~ NetSalesMil, data = eCsub)

Coefficients:
(Intercept)  NetSalesMil  
  3.7670804    0.0001673  
  • Dropping zero net sales and zero compensation did not change the reported values too much.
  • Think of these numerical summaries as averages or comparisons of averages.

    • Intercept: The average compensation of CEOs with net sales of zero is about 3.77 million dollars.
    • Slope: When we compare firms whose net sales differ by 1 million dollars, the average compensation of their CEOs differ by about 0.00017 million dollars or better yet, 170 dollars.
    • Even better: When we compare firms whose net sales differ by 1 trillion dollars, the average compensation of their CEOs differ by about 170000 dollars.
  • Take note that you are NOT comparing the same CEO! We are comparing different CEOs.
  • In the current context, near zero net sales is plausible. So, the intercept would make sense.
  • Note that for communication purposes, it is really good practice to reduce the number of digits reported (more digits NOT the same as more accurate!) considering the context.

Regression with only an intercept

  • What happens when you apply lm() without NetSalesMil?
lm(TotalCompMil ~ 1, data=eCsub)

Call:
lm(formula = TotalCompMil ~ 1, data = eCsub)

Coefficients:
(Intercept)  
      4.626  
mean(eCsub$TotalCompMil)
[1] 4.626229
  • You will note that the intercept matches the mean of TotalCompMil. This pattern holds in general.
  • You can also think of the intercept as the slope of a regressor which takes on the value 1 for all observations.
  • Note that as soon as you include NetSalesMil, the meaning of the intercept will change.

Regression lines, fitted values, and residuals

  • In the compensation-sales relationship, the regression line is \[\widehat{\mathtt{TotalCompMil}}_t=3.7670804+0.0001673*\mathtt{NetSalesMil}_t\]
  • The regression line where I did not include NetSalesMil is \[\widehat{\mathtt{TotalCompMil}}_t=4.626\]
  • Both these lines could be used to produce fitted values. Just plug-in values for NetSalesMil from the data.
    • This plugging-in matters for the first line.
    • It does not matter for the second line.
  • It is possible to use these lines to make predictions outside of the data used for linear regression analysis. This also produces fitted values, which are more accurately termed as predicted values.
  • The regression line always passes through the point of averages or “the center”.

  • For the regression line \[\widehat{\mathtt{TotalCompMil}}_t=3.7670804+0.0001673*\mathtt{NetSalesMil}_t\] observe that

mean(eCsub$TotalCompMil)
[1] 4.626229
mean(eCsub$NetSalesMil)
[1] 5135.554
mean(fitted(lm(TotalCompMil ~ NetSalesMil, data=eCsub)))
[1] 4.626229
  • The difference between the TotalCompMil from the data and the fitted value produces the residuals: \[\mathtt{TotalCompMil}_t-\widehat{\mathtt{TotalCompMil}}_t\]

  • Provided that there is an intercept, the residuals are guaranteed to sum up to zero. So, the mean of the residuals is always equal to zero.

  • You can verify easily for the case where our regression line is just \[\widehat{\mathtt{TotalCompMil}}_t=4.626\]

  • For the regression line \[\widehat{\mathtt{TotalCompMil}}_t=3.7670804+0.0001673*\mathtt{NetSalesMil}_t\] observe that
mean(eCsub$TotalCompMil - fitted(lm(TotalCompMil ~ NetSalesMil, data=eCsub)))
[1] 1.907445e-16
mean(residuals(lm(TotalCompMil ~ NetSalesMil, data=eCsub)))
[1] 1.911481e-16

The regression slope

  • The regression slope actually depends on another numerical summary for relationships between two variables.
  • It can be shown that the means and the standard deviations of each variable enter into the calculation.
  • But there is one more number that ties them all together.
  • This is best seen by looking at the incorrect intuition most people have about the regression line being the “best-fitting” line.
par(cex=2)
plot(eCsub$NetSalesMil, eCsub$TotalCompMil, xlim=c(0,280000), ylim=c(0,90), 
     xlab = "Net Sales in Millions", ylab = "Total Compensation in Millions")
abline(a= mean(eCsub$TotalCompMil)-sd(eCsub$TotalCompMil)/sd(eCsub$NetSalesMil)*mean(eCsub$NetSalesMil), 
       b=sd(eCsub$TotalCompMil)/sd(eCsub$NetSalesMil), col="red")
abline(a=coef(lm(TotalCompMil ~ NetSalesMil, data=eCsub))[[1]], 
       b=coef(lm(TotalCompMil ~ NetSalesMil, data=eCsub))[[2]])
  • The black line is the regression line obtained from lm().
  • Compared to the red line, the black line has a smaller slope in absolute value.
  • It turns out that there is a factor that produces the smaller slope relative to the red line.
  • This factor is called the correlation coefficient.

Correlation and fit

  • The correlation coefficient has the following properties:

    • It has no units and is computed using pairs of variables.
    • It is always between -1 and 1.
    • It is unaffected by linear transformations of the variables, up to the sign.
  • The square of the correlation coefficient is called the R-squared, which is usually used as a measure of fit.

  • Through some algebraic manipulations, R-squared can be shown to be

    • the ratio of the variance of the fitted values to the variance of the regressand
    • one minus the ratio of the variance of the residuals to the variance of the regressand
  • The key is to recognize that, in general, the variance of the regressand is equal to the sum of the variance of the fitted values and the variance of the residuals.

var(eCsub$TotalCompMil)
[1] 38.75998
var(fitted(lm(TotalCompMil ~ NetSalesMil, data=eCsub)))
[1] 5.96502
var(residuals(lm(TotalCompMil ~ NetSalesMil, data=eCsub)))
[1] 32.79496
sd(residuals(lm(TotalCompMil ~ NetSalesMil, data=eCsub)))
[1] 5.726688
sd(residuals(lm(TotalCompMil ~ NetSalesMil, data=eCsub)))*sqrt(1489/1488)
[1] 5.728612
  • The last value is called a residual standard error in R.
  • The factor \(\sqrt{1489/1488}\) is called a degrees of freedom adjustment.
summary(lm(TotalCompMil ~ NetSalesMil, data=eCsub))

Call:
lm(formula = TotalCompMil ~ NetSalesMil, data = eCsub)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.679  -2.787  -1.658   0.860  69.945 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.767e+00  1.573e-01   23.95   <2e-16 ***
NetSalesMil 1.673e-04  1.017e-05   16.46   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.729 on 1489 degrees of freedom
Multiple R-squared:  0.1539,    Adjusted R-squared:  0.1533 
F-statistic: 270.8 on 1 and 1489 DF,  p-value: < 2.2e-16

Formulas behind lm()

  • Let \(Y_t\) be the value of the regressand for the \(t\)th observation.
  • Let \(X_t^\prime=(1, X_{1t}, X_{2t}, \ldots X_{kt})\) be the corresponding value of the regressors.
  • lm() computes the intercept and slopes, and together we call it \(\widehat{\beta}=\left(\widehat{\beta}_0,\widehat{\beta}_1,\ldots,\widehat{\beta}_k\right)^\prime\).
  • The fitted values are \[\widehat{Y}_t=X_t^\prime\widehat{\beta}=\widehat{\beta}_0+\widehat{\beta}_1X_{1t}+\ldots+\widehat{\beta}_k X_{kt}\]
  • lm() is based on a computationally efficient way of calculating \[\widehat{\beta}=\left(\frac{1}{n}\sum_{t=1}^n X_tX_t^\prime\right)^{-1}\left(\frac{1}{n}\sum_{t=1}^n X_tY_t\right)\]
  • Note that it can happen that \(\widehat{\beta}\) will not exist. But usually software compensates for this.
  • The technical term for the reason why \(\widehat{\beta}\) will not exist is perfect multicollinearity.
  • Because the formula represents a general case, try working with special cases:

    • Set \(X_t^\prime=1\). This is linear regression with only an intercept.
    • Set \(X_t^\prime=(1,X_{1t})\). Many call this case as simple linear regression. You have seen an example before: the compensation-sales relationship.
  • The general case where \(X_t^\prime=(1, X_{1t}, X_{2t}, \ldots X_{kt})\) is sometimes called multiple linear regression.

  • Try showing that in the case of simple linear regression: \[\widehat{\beta}_1=\frac{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)\left(Y_{t}-\overline{Y}\right)}{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)^2},\ \widehat{\beta}_0=\overline{Y}-\widehat{\beta}_1\overline{X}_1 \] where \[\overline{Y}=\frac{1}{n}\sum_{t=1}^n Y_t,\ \ \ \ \overline{X}_1=\frac{1}{n}\sum_{t=1}^n X_{1t}\]
  • Let \[\begin{eqnarray}s_{X_1Y} &=& \frac{1}{n}\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)\left(Y_{t}-\overline{Y}\right),\\ s_{X_1}^2&=&\frac{1}{n}\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)^2, \ \ s_Y^2=\frac{1}{n}\sum_{t=1}^n\left(Y_{t}-\overline{Y}\right)^2\end{eqnarray}\]

  • These are the covariance of \(X_1\) and \(Y\), the variances of \(X_1\) and \(Y\).

  • Let \(r_{X_1Y}\) be the correlation coefficient of \(X_1\) and \(Y\), i.e. \[r_{X_1Y}=\frac{s_{X_1Y}}{\sqrt{s_{X_1}^2s_{Y}^2}}=\frac{s_{X_1Y}}{s_{X_1}s_{Y}}\]
  • You can show that in the simple linear regression case, the slope can be rewritten as \[\widehat{\beta}_1=\frac{s_{X_1Y}}{s_{X_1}^2}=r_{X_1Y}\frac{s_Y}{s_{X_1}}\]
  • A more concise way of writing \(\widehat{\beta}\) is to use more matrix notation.

  • It is connected to how a spreadsheet looks like: \[\underset{\left(n\times1\right)}{\mathbf{Y}}=\left(\begin{array}{c} Y_{1}\\ \vdots\\ Y_{n} \end{array}\right),\underset{\left(n\times \left(k+1\right)\right)}{\mathbf{X}}=\left(\begin{array}{c} X_{1}^{\prime}\\ \vdots\\ X_{n}^{\prime} \end{array}\right)=\left(\begin{array}{cccc} 1 & {X_{11}} & \cdots & {X_{k1}}\\ \vdots & \vdots & \ddots & \vdots\\ 1 & {X_{1n}} & \cdots & {X_{kn}} \end{array}\right)\]

  • So, we can write \[\widehat{\beta}= \left(\mathbf{X}^\prime \mathbf{X}\right)^{-1}\mathbf{X}^\prime\mathbf{Y}\]

Frisch-Waugh-Lovell (FWL) Theorem

  • The regression slope also has another algebraic property, which is extremely useful in practice.
  • You can add more regressors in your linear regression analysis. But you will also have as many slopes. How do we communicate the results?
  • Showing \[\widehat{\beta}= \left(\mathbf{X}^\prime \mathbf{X}\right)^{-1}\mathbf{X}^\prime\mathbf{Y}\] is probably not very helpful.
  • The key ideas are:

    • Slopes as comparisons of averages
    • The expression for the slope in simple linear regression: \[\widehat{\beta}_1=\frac{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)\left(Y_{t}-\overline{Y}\right)}{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)^2} \]
adj.Comp <- residuals(lm(TotalCompMil ~ 1, data=eCsub))
adj.Sales <- residuals(lm(NetSalesMil ~ 1, data=eCsub))
lm(adj.Comp ~ adj.Sales)

Call:
lm(formula = adj.Comp ~ adj.Sales)

Coefficients:
(Intercept)    adj.Sales  
 -1.292e-16    1.673e-04  
lm(adj.Comp ~ adj.Sales - 1)

Call:
lm(formula = adj.Comp ~ adj.Sales - 1)

Coefficients:
adj.Sales  
0.0001673  
  • Notice what happened:

    • Get the residuals from a linear regression of TotalCompMil on an intercept.
    • Get the residuals from a linear regression of NetSalesMil on an intercept.
    • (Residual regression) Compute the slope from a linear regression of the residuals in the first step on the residuals in the second step.
  • The resulting slope is exactly equal to the slope computed from a simple linear regression of TotalCompMil and NetSalesMil.

  • What you have seen is the simplest version of FWL.

  • The more general version is as follows. Focus on the last entry of \(\widehat{\beta}\), namely \(\widehat{\beta}_k\).

    • Get the residuals from a linear regression of \(Y\) on all regressors except \(X_k\).
    • Get the residuals from a linear regression of \(X_k\) on all regressors except \(X_k\).
    • (Residual regression) Compute the slope from a linear regression of the residuals in the first step on the residuals in the second step. The resulting slope is exactly equal to \(\widehat{\beta}_k\).

Impact of logarithmic transformation

lm(log10TotalComp ~ log10NetSales, data=eCsub)

Call:
lm(formula = log10TotalComp ~ log10NetSales, data = eCsub)

Coefficients:
  (Intercept)  log10NetSales  
       2.7543         0.4002  
  • Intercept: The average log compensation of CEOs with log net sales of zero is about 2.75. What about the units? Is it million dollars? Hard to communicate. But it roughly translates to predicted compensation of about 562 dollars.
  • Slope: When we compare firms whose log net sales differ by 1, the average log compensation of their CEOs differ by about 0.4.
  • Even better: When we compare firms whose net sales differ by 1 percent, the average compensation of their CEOs differ by about 0.4 percent.
  • Where did the relative changes (in percent) come from? Note that \(\log x_2 - \log x_1 = \log\left(\dfrac{x_2}{x_1}\right)=\log\left[1+\left(\dfrac{x_2}{x_1}-1\right)\right] \approx \dfrac{x_2}{x_1}-1\)
  • What happens when you use a different base for the logarithmic transformation?
eCsub$logNetSales <- log(eCsub$NetSalesMil*10^6)
lm(logTotalComp ~ logNetSales, data=eCsub)

Call:
lm(formula = logTotalComp ~ logNetSales, data = eCsub)

Coefficients:
(Intercept)  logNetSales  
     6.3421       0.4002  
  • Notice that the computed slope is the same.

Subgroup analysis

  • Communicating the results from the compensation-sales relationship are about comparing subgroups which may unlikely be observable.
  • Finding firms whose net sales differ by exactly 1 million dollars (or 1 trillion dollars or differ by 1 percent) would be extremely rare.
  • What happens if these subgroups are actually observable and are labeled correctly in the data?
  • We are going to look at industry differences in compensation.
par(cex=2)
boxplot(execComp$TotalCompMil ~ execComp$Industry, las=2, ylim=c(0,80),
        xlab="", ylab="Total Compensation", outcex=0.25)
par(cex=2)
boxplot(execComp$log10TotalComp ~ execComp$Industry, las=2, ylim=c(0,8),
        xlab="", ylab="Log Total Compensation", outcex=0.25)
  • Let us try to find the source of the missing industry.
eCsub <- subset(execComp, TotalCompMil>0)
table(eCsub$Industry)

              Discretionary        Energy     Financial        Health 
            1           277            78           202           149 
   Industrial            IT     Materials       Staples       Telecom 
          229           278           108            69            23 
      Utility 
           80 
tapply(eCsub$TotalComp, as.factor(eCsub$Industry), mean)
              Discretionary        Energy     Financial        Health 
     0.428701      4.684144      5.083188      6.139536      5.365866 
   Industrial            IT     Materials       Staples       Telecom 
     3.471729      3.985904      3.207473      6.375649      9.978968 
      Utility 
     3.328927 

Linear regression with dummy variables

eCsub <- subset(execComp, (TotalCompMil>0 & Industry!=""))
lm(TotalCompMil ~ as.factor(Industry), data=eCsub)

Call:
lm(formula = TotalCompMil ~ as.factor(Industry), data = eCsub)

Coefficients:
                  (Intercept)      as.factor(Industry)Energy  
                       4.6841                         0.3990  
 as.factor(Industry)Financial      as.factor(Industry)Health  
                       1.4554                         0.6817  
as.factor(Industry)Industrial          as.factor(Industry)IT  
                      -1.2124                        -0.6982  
 as.factor(Industry)Materials     as.factor(Industry)Staples  
                      -1.4767                         1.6915  
   as.factor(Industry)Telecom     as.factor(Industry)Utility  
                       5.2948                        -1.3552  
  • The previous analysis is essentially reproducing the industry averages.
  • The presentation is just slightly different but carry exactly the same information.
  • Putting in as.factor(Industry) treats the variable Industry which labels the type of industry into 9 separate variables called dummy variables.
  • Dummy variables are indicator variables which take on only two values 0 and 1.
  • What if we look at log compensation?
eCsub <- subset(execComp, (log10TotalComp>0 & Industry!=""))
lm(log10TotalComp ~ as.factor(Industry), data=eCsub)

Call:
lm(formula = log10TotalComp ~ as.factor(Industry), data = eCsub)

Coefficients:
                  (Intercept)      as.factor(Industry)Energy  
                      6.43482                        0.02629  
 as.factor(Industry)Financial      as.factor(Industry)Health  
                      0.07629                        0.03858  
as.factor(Industry)Industrial          as.factor(Industry)IT  
                     -0.08446                       -0.12435  
 as.factor(Industry)Materials     as.factor(Industry)Staples  
                     -0.09178                        0.16560  
   as.factor(Industry)Telecom     as.factor(Industry)Utility  
                      0.22936                       -0.06108  

Visualizing variables together across subgroups

  • Let us now revisit our scatterplot involving compensation and sales.

  • In the previous visualization, we did not show industry differences.

  • Two options:

    • Show one plot with industry labels.
    • Show as many plots as there are industry labels.
par(cex=2)
eCsub <- subset(execComp, (log10TotalComp>0 & log10NetSales>0 & Industry!=""))
plot(eCsub$log10TotalComp ~ eCsub$log10NetSales,
     col=as.factor(eCsub$Industry), xlim = c(4, 12), 
     ylim = c(0, 9), xlab = "Log Net Sales", ylab = "Log Total Compensation")
  • It is hard to make sense of the visualization, because there are 10 industries.
  • You can also make separate visualizations for each industry. Each industry will have fewer observations relative to the total number of observations.
  • But a numerical summary might be a good alternative. We introduce interaction terms to describe differences in the compensation-sales relationship across industries.
lm(log10TotalComp ~ log10NetSales + as.factor(Industry), data=eCsub)

Call:
lm(formula = log10TotalComp ~ log10NetSales + as.factor(Industry), 
    data = eCsub)

Coefficients:
                  (Intercept)                  log10NetSales  
                      2.47925                        0.42690  
    as.factor(Industry)Energy   as.factor(Industry)Financial  
                      0.08264                        0.08704  
    as.factor(Industry)Health  as.factor(Industry)Industrial  
                      0.17972                       -0.05477  
        as.factor(Industry)IT   as.factor(Industry)Materials  
                      0.10874                       -0.06203  
   as.factor(Industry)Staples     as.factor(Industry)Telecom  
                     -0.01948                        0.18085  
   as.factor(Industry)Utility  
                     -0.14089  
  • How do you communicate the results?

    • Intercept: The average log compensation of CEOs with log net sales of zero in the Discretionary industry is about 2.48, which roughly translates to 301 dollars.
    • Slope of log10NetSales: When we compare firms in the same industry whose net sales differ by 1 percent, the average compensation of their CEOs differ by about 0.43 percent.
  • So you might wonder where are the industry differences?

par(cex=2)
plot(eCsub$log10TotalComp ~ eCsub$log10NetSales, xlim = c(7, 12), 
     ylim = c(3, 9), xlab = "Log Net Sales", 
     ylab = "Log Total Compensation", type="n")
points(eCsub$log10NetSales[eCsub$Industry=="Telecom"], eCsub$log10TotalComp[eCsub$Industry=="Telecom"], col="blue")
points(eCsub$log10NetSales[eCsub$Industry=="Discretionary"], eCsub$log10TotalComp[eCsub$Industry=="Discretionary"], col="black")
fit <- coef(lm(log10TotalComp ~ log10NetSales + as.factor(Industry), data=eCsub))
curve(cbind(1,x,0,0,0,0,0,0,0,0,0) %*% fit, add=TRUE)
curve(cbind(1,x,0,0,0,0,0,0,0,1,0) %*% fit, add=TRUE, col="blue")

Short and long regressions

  • So far, you have encountered many results from linear regressions.

  • Let us focus on a theme shared by these results. You might compare different results. Take for example:

    • Linear regression of TotalCompMil on a column of ones
    • Linear regression of TotalCompmil on a column of ones and industry dummies
  • In both results, you have two different intercepts with different meanings. But are they related to each other somehow?

    • For the linear regression of TotalCompMil on a column of ones, the intercept computed was the overall mean of TotalCompMil (regardless of industry or whatever group).

    • For the linear regression of TotalCompMil on a column of ones, the intercept computed was the mean of TotalCompMil for the Discretionary industry.

  • To see how they are related, we look at a simpler version of the setup of short and long regressions.

  • Let us use some notation.

    • Short regression: \(\widetilde{Y}_t = \widetilde{\beta}_0\)
    • Long regression: \(\widehat{Y}_t = \widehat{\beta}_0+ \widehat{\beta}_1 X_{1t}\)
  • How is \(\widehat{\beta}_0\) related to \(\widetilde{\beta}_0\) algebraically?

  • Note that \(\widetilde{Y}_t\) and \(\widehat{Y}_t\) are different fitted values.

  • We can obtain a linear regression of \(X_{1}\) on a column of ones, i.e., \(\widehat{X}_{1t}=\widehat{\delta}_0\).

  • Next, \(X_{1t}\) differs from the fitted value \(\widehat{X}_{1t}\) by a residual.

  • Therefore, \[\begin{eqnarray} \widehat{Y}_t &=& \widehat{\beta}_0+ \widehat{\beta}_1 X_{1t} \\ &=& \widehat{\beta}_0+ \widehat{\beta}_1 \left(X_{1t}-\widehat{X}_{1t}\right) + \widehat{\beta}_1 \widehat{X}_{1t} \\ &=& \widehat{\beta}_0+ \widehat{\beta}_1 \left(X_{1t}-\widehat{X}_{1t}\right)+ \widehat{\beta}_1 \widehat{\delta}_0 \end{eqnarray}\]

  • Taking averages of both sides, we have \[\begin{eqnarray} \frac{1}{n} \sum_{t=1}^n \widehat{Y}_t &=& \widehat{\beta}_0+ \widehat{\beta}_1 \frac{1}{n} \sum_{t=1}^n \left(X_{1t}-\widehat{X}_{1t}\right)+ \widehat{\beta}_1 \widehat{\delta}_0 \\ \frac{1}{n} \sum_{t=1}^n \widehat{Y}_t &=& \widehat{\beta}_0+\widehat{\beta}_1 \widehat{\delta}_0 \\ \Rightarrow \overline{Y} &=& \widehat{\beta}_0+\widehat{\beta}_1 \widehat{\delta}_0 \end{eqnarray}\]
  • Similarly, we also have

\[\frac{1}{n} \sum_{t=1}^n \widetilde{Y}_t = \widetilde{\beta}_0 \Rightarrow \overline{Y} =\widetilde{\beta}_0\]

  • Therefore, we must have \[\widetilde{\beta}_0= \widehat{\beta}_0+\widehat{\beta}_1 \widehat{\delta}_0\]

  • Try interpreting this algebraic relationship where you have two subgroups and \(X_{1t}\) is a dummy variable for one of the two subgroups.

  • The algebraic relationships that connect the short regression with the long regression are much more general.

  • For example:

    • Short regression: \(\widetilde{Y}_t = \widetilde{\beta}_0+\widetilde{\beta}_1 X_{1t}\)
    • Long regression: \(\widehat{Y}_t = \widehat{\beta}_0+ \widehat{\beta}_1 X_{1t}+ \widehat{\beta}_2 X_{2t}\)
  • You can show that \(\widetilde{\beta}_0= \widehat{\beta}_0+\widehat{\beta}_2 \widehat{\delta}_0\) and \(\widetilde{\beta}_1= \widehat{\beta}_1+\widehat{\beta}_2 \widehat{\delta}_1\), where \(\widehat{\delta}_0\) and \(\widehat{\delta}_1\) are computed from a linear regression of \(X_{2}\) on \(X_{1}\) and a column of ones.

  • Return to the example at the beginning:

    • Linear regression of TotalCompMil on a column of ones
    • Linear regression of TotalCompmil on a column of ones and industry dummies
  • The meaning of the intercept changed when comparing the short and the long regression. Can you make sense of where the changes come from?

  • Why is it important to know these algebraic relationships?

    • To demystify linear regressions
    • To use it later to assess sensitivity of results

Returning to subgroup analysis

  • After that detour on short and long regressions, we now revisit the compensation-sales relationship with industry differences.

  • We saw that adding industry dummies only changed how we communicate the results regarding the intercept.

    • Of course, the numerical value of the slope has changed but the way we communicated it is the same as before!
  • We now introduce interaction terms: these terms allow us to allow for group differences in the slope.

  • What are interaction terms? The interaction term of one regressor with another regressor is just their product.

  • If you interact the industry dummies with sales, then, in our context, you will get 9 interaction terms.

  • From here, can you figure out in what sense will industry differences are reflected in the compensation-sales relationship?

  • How does it differ from the regression without interaction terms?

lm(log10TotalComp ~ log10NetSales + as.factor(Industry) 
   + log10NetSales:as.factor(Industry), data=eCsub)

Call:
lm(formula = log10TotalComp ~ log10NetSales + as.factor(Industry) + 
    log10NetSales:as.factor(Industry), data = eCsub)

Coefficients:
                                (Intercept)  
                                    2.05464  
                              log10NetSales  
                                    0.47272  
                  as.factor(Industry)Energy  
                                    0.48786  
               as.factor(Industry)Financial  
                                   -0.18931  
                  as.factor(Industry)Health  
                                    1.17585  
              as.factor(Industry)Industrial  
                                    0.45139  
                      as.factor(Industry)IT  
                                    1.27377  
               as.factor(Industry)Materials  
                                   -0.70969  
                 as.factor(Industry)Staples  
                                    0.47068  
                 as.factor(Industry)Telecom  
                                   -1.05686  
                 as.factor(Industry)Utility  
                                   -1.21224  
    log10NetSales:as.factor(Industry)Energy  
                                   -0.04370  
 log10NetSales:as.factor(Industry)Financial  
                                    0.03010  
    log10NetSales:as.factor(Industry)Health  
                                   -0.10979  
log10NetSales:as.factor(Industry)Industrial  
                                   -0.05469  
        log10NetSales:as.factor(Industry)IT  
                                   -0.13074  
 log10NetSales:as.factor(Industry)Materials  
                                    0.07078  
   log10NetSales:as.factor(Industry)Staples  
                                   -0.05258  
   log10NetSales:as.factor(Industry)Telecom  
                                    0.13062  
   log10NetSales:as.factor(Industry)Utility  
                                    0.11243  
  • How do you communicate the results?

    • Intercept: The average log compensation of CEOs with log net sales of zero in the Discretionary industry is about 2.05, which translates to about 112 dollars.
    • Slope of log10NetSales: When we compare firms in the Discretionary industry whose net sales differ by 1 percent, the average compensation of their CEOs differ by about 0.47 percent.
  • What about the role of the interaction terms?

    • For the Telecom industry: When we compare firms in the Telecom industry whose net sales differ by 1 percent, the average compensation of their CEOs differ by about 0.6 percent.
  • You can interpret the other coefficients of the interactions in a similar way.

  • Compare these to the results without interaction terms.

par(cex=2)
fit <- coef(lm(log10TotalComp ~ log10NetSales + as.factor(Industry)
               + log10NetSales:as.factor(Industry), data=eCsub))
plot(eCsub$log10TotalComp ~ eCsub$log10NetSales, xlim = c(7, 12), 
     ylim = c(3, 9), xlab = "Log Net Sales", 
     ylab = "Log Total Compensation", type="n")
points(eCsub$log10NetSales[eCsub$Industry=="Telecom"], eCsub$log10TotalComp[eCsub$Industry=="Telecom"], col="blue")
points(eCsub$log10NetSales[eCsub$Industry=="Discretionary"], eCsub$log10TotalComp[eCsub$Industry=="Discretionary"], col="black")
curve(cbind(1,x,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0) %*% fit, add=TRUE)
curve(cbind(1,x,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,x,0) %*% fit, add=TRUE, col="blue")

More on interaction terms

  • What if you interact variables which are both continuous?
  • Consider an example based on a student attendance dataset from Wooldridge’s Introductory Econometrics.
  • Here we compute a linear regression of the standardized outcome on a principles of microeconomics final exam at the University of Michigan on percentage of classes attended, prior college GPA, and their interaction.
  • We can obtain the data online and the data description can be found here.

    • This is an example of a CSV type file, but the separator is no longer a comma, but a tab (pressed the Tab key on a keyboard).
    • This is also an example, where the header is not available. Thus, you have to supply it using the data description.
    • You have to modify the read.csv() command to account for these features.
  • But we can also use the corresponding Stata-formatted data.

  • Here are the results:
library(foreign)
attend <- read.dta("https://qcpages.qc.cuny.edu/~rvesselinov/statadata/attend.dta")
lm(stndfnl ~ atndrte + priGPA, data = attend)

Call:
lm(formula = stndfnl ~ atndrte + priGPA, data = attend)

Coefficients:
(Intercept)      atndrte       priGPA  
  -1.635271    -0.001156     0.680158  
lm(stndfnl ~ atndrte + priGPA + atndrte:priGPA, data = attend)

Call:
lm(formula = stndfnl ~ atndrte + priGPA + atndrte:priGPA, data = attend)

Coefficients:
   (Intercept)         atndrte          priGPA  atndrte:priGPA  
       0.92289        -0.03204        -0.46527         0.01360  
  • Coefficient of atndrte: When we compare students who have zero prior college GPA but the ones whose attendance rates are higher by 1 percentage point (note this!) would have average standardized final exam scores lower by 0.03.

    • Much better: When we compare students who have zero prior college GPA but the ones whose attendance rates are higher by 30 percentage points would have average standardized final exam scores lower by 0.96, which is roughly a difference of a standard deviation in the original final exam scores.
  • An even better way to communicate the preceding results is to recenter the regressors before computing the linear regression.
  • The recentering could the relative to the mean or relative to interesting values.
attend$c.atndrte <- attend$atndrte - 75
attend$c.priGPA <- attend$priGPA - mean(attend$priGPA)
lm(stndfnl ~ c.atndrte + c.priGPA + c.atndrte:c.priGPA, data = attend)

Call:
lm(formula = stndfnl ~ c.atndrte + c.priGPA + c.atndrte:c.priGPA, 
    data = attend)

Coefficients:
       (Intercept)           c.atndrte            c.priGPA  c.atndrte:c.priGPA  
         -0.045270            0.003139            0.554757            0.013600  
  • When we compare students who have prior college GPA equal to its mean but whose attendance rates are higher by 1 percentage point, their average standardized final exam scores would be lower by 0.

  • Even better: When we compare students who have prior college GPA equal to its mean but one has attendance rates higher by 2 percentage points, their average standardized final exam scores would be lower by 0.01, which is roughly a difference of a hundredth of a standard deviation in the original final exam scores.

  • Try interpreting the coefficient of priGPA. You have the original results and the results with the recentered variables.
  • Try interpreting the intercept in the results with the recentered variables.
  • Try different recenterings and see what happens.
  • I end here with really thinking about what the word “linear” means in linear regression.

Operations on data

  • You have seen visualizations and numerical summaries of data.

    • Behind every visualization and numerical summary, there is a set of operations on data.
    • Typical data operations involve: summing up, scaling, transforming, subsetting or filtering, arranging in some order, standardizing, comparing
  • Everything we have done is based on the data we have on hand.
  • We also saw many linear regressions. All of them are about comparisons involving averages.
  • What we are going to do next is to think hard about summaries and linear regressions before doing any computations.

More things to think about

  • Smoothing the histogram to capture broad features

  • Time series data and other types of data

    • Does it make sense to apply summary statistics directly?
    • How to visualize? Can we use histograms?
  • Effect of transformations on summaries

  • Subgroups formed when regressors are not discrete

  • Designing visualizations / graphics