Exercise Set 03

Technical exercise 1: a step towards understanding the regression slope as comparisons

Let \(X\) be a variable indexed by \(t=1,\ldots,n\). It can be shown that, in general, we have

\[ \sum_{t=1}^n \left(X_t-\overline{X}\right)\left(Y_t-\overline{Y}\right) = \sum_{i=1}^n\sum_{j=1}^n \left(X_j-X_i\right)\left(Y_j-Y_i\right) \tag{1}\] What you are going to do is prove this for \(n=2\) only so that you get a feel for Equation 1 could come to be. Make sure to cite which of the properties you have used (labels for the properties are available).

Can you distinguish among the following expressions: \[\sum_{t=3}^6 X_t \ \ , \ \ \sum_{j=3}^6 X_j \ \ \ , \ \ \sum_{j=3}^6 X_t\] Which are equal to each other? Which are very different from each other?
Write out or flesh out the components of the expression \[\sum_{i=1}^3\sum_{j=2}^3 X_{ij}\]
Set \(n=2\) in Equation 1 and prove the equality.¹

Technical exercise 2: Linear regression with linearly transformed variables

Suppose a linear regression of \(Y\) on \(X_1\) (with an intercept) was computed. As always, let \(X_{1t}\) and \(Y_t\) be the \(t\)th observation of the regressand and the regressor, respectively, for \(t=1, 2,\ldots, n\).

Suppose we linearly transformed the data. Let \(W_t=aX_{1t}+b\), where \(a\) and \(b\) are constants. In addition, let \(Z_t=cY_t+d\), where \(c\) and \(d\) are constants.

Write down the formula or expression for the regression slope for the linear regression of \(Y\) on \(X_1\) in this context. There is no need to derive it.
After transforming both \(X_1\) and \(Y\) to \(W\) and \(Z\), respectively, write down the formula or expression for the regression slope for the linear regression of \(Z\) on \(W\) in this context. There is no need to derive it.
Focus on the formulas in items #1 and #2. Use the results of Technical Exercise 2 in Exercise Set 02 (with the appropriate modifications to the variables and constants) to show that after transforming both \(X_1\) and \(Y\) to \(W\) and \(Z\), respectively, and determine how the regression slope in #2 is related to the regression slope in #1.²

Technical exercise 3: Another special case

You are going to be working out the details of regression with only one regressor and without an intercept. Let \(Y_t\) be the \(t\)th observation of the regressand. Recall that lm() is OLS and that we are minimizing a sum of squared residuals.

Since our regression line for this case is just \(\widehat{Y}_t=\widehat{\beta}_1X_{1t}\), where \(\widehat{\beta}_1\) is just some constant to be determined, you should be able to use what you learned in mathematical economics to minimize \[\sum_{t=1}^n \left(Y_t-\widehat{Y}_t\right)^2=\sum_{t=1}^n \left(Y_t-\widehat{\beta}_1X_{1t}\right)^2 \tag{2}\] with respect to \(\widehat{\beta}_1\).

Find the optimal value of \(\widehat{\beta}_1\).
What will be the average of the residuals? Do you think it is zero? Prove your finding.

Authoring your second Quarto document

You will create your second Quarto document containing answers to two exercises. One is using the data again from Hamermesh and Parker (2005). Another is a simulation exercise.

Demonstrating some of the math behind `lm()`

Refer to Technical Exercise 3 of this exercise set. Fit a least squares regression of course evaluations on beauty, without an intercept. Compute the mean of the residuals. What do you notice and what would you conclude?
Fit a least squares regression of course evaluations on beauty and female, with an intercept. Apply the algorithm found in the slides and demonstrate the Frisch-Waugh-Lovell (FWL) Theorem for the regression coefficient of female. Given what you know about FWL and how to interpret coefficients of dummy variables in a simple linear regression, how would you interpret the regression coefficient of female?
Using the least squares regression of course evaluations on beauty and female, with an intercept in Item #2, generate the fitted values and the residuals. Verify computationally that
1. the mean of the fitted values is equal to the mean of course evaluations
2. the mean of the residuals is equal to zero
3. the correlation coefficient (using the cor() command in R, use ?cor to learn more about the command) of fitted values and the residuals is zero
4. the correlation coefficient of beauty and the residuals is zero.
5. the correlation coefficient of female and the residuals is zero.

The behavior of the least squares regression coefficients

In this exercise, you will be conducting your first Monte Carlo simulation. A Monte Carlo simulation allows you to generate a large number of artificial datasets obeying some design or specification. After generating these artificial datasets, we apply some methods or procedures learned in class and explore their behavior for repeated application on these datasets.

You need the concept of a for loop. Read fasteR Lesson 17 to learn more about what a for loop is all about.³

# Number of observations
n <- 50
# How many times to loop
reps <- 400
# Storage for OLS results (2 entries per replication)
beta.store <- matrix(NA, nrow=reps, ncol=2)
# Create a directory to store plot pictures
dir.create(paste(getwd(), "/pics/", sep=""))
# Monte Carlo loop
for (i in 1:reps)
{
  X.t <- rbinom(n, 1, 0.3)  # Generate X
  eps.t <- (rnorm(n, 0, 4))*(X.t == 1)+(rnorm(n, 0, 1))*(X.t == 0)
  Y.t <- 3 + 2*X.t + eps.t   # Generate Y
  temp <- lm(Y.t ~ X.t)
  beta.store[i,] <- coef(temp)
  filename <- paste(getwd(), "/pics/", i, ".png", sep="")
  png(filename)
  plot(X.t, Y.t)
  abline(temp)
  graphics.off()
}

Include the code above into your Quarto document and let it run as part of your document.⁴
Can you describe what you think is being done for lines 15-16? What is the role of the loop here?
Display the first few entries of beta.store. Do you know the length of each column?
Display two histograms: one histogram for each column of beta.store.
Calculate the means and standard deviations for each column of beta.store. focusing on the means, can you hazard a guess and give a sense of how the means relate to line 14 of the given code.
Confirm whether the empirical rule is a good way to describe the simulated data in beta.store.
Open your working directory⁵ and look inside the pics folder. You will see a LOT of scatterplots. Observe how many there are. Describe the scatterplots and lines you see (think of it as a crude animation).
Repeat the simulation but this time alter the number of observations to 200. Redo Item #5 after this change. Comment on how the column means and column standard deviations have changed.

What you will be expected to do

You will be submitting to my email a zip file (not rar, not 7z) with filename surname_exset03.zip containing

Scanned PDF solutions to the technical exercises (do be mindful of the size of the file, keep under 15 MB if possible) with filename surname_tech03.pdf
Your qmd file with filename surname_exset03.qmd and replace surname with your actual surname.
The HTML file associated with your qmd file.

Footnotes

Completely optional for this exercise set: Try \(n=4\). Pretty soon, you should be able to figure out what happens for general \(n\).↩︎
Completely optional for this exercise set, but for additional practice: There are many things to explore such as, what happens to the intercepts, what happens if only one of the variables were transformed, what happens if you standardize. You could also explore what happens to R-squared and other quantities from the output of lm().↩︎
You might need to load files from previous lessons related to the pima dataset.↩︎
Alternatively, you could copy and paste the entire code onto the R console to give you a sense of what is happening, so that you can plan out how you would complete the exercises.↩︎
To find your working directory, type getwd() in the R console.↩︎