Exercise Set 06

Technical Exercises

Technical Exercise 1: practice with distributions and moments

Imagine you have a non-transparent container. Inside this container, you have 5 ping-pong balls which have no discernible features apart from a numerical label:

Numerical labels on balls in an urn

While blindfolded, draw two balls from this container with replacement. What this means is that after drawing the first ball, you return the ball before drawing the second ball. Assume that someone is right by your side recording the outcome of the two draws.

Define two random variables \(X_1\) and \(X_2\) to be the recorded outcome of the first and second draws, respectively. Further define another random variable \[Y=\frac{X_1+X_2}{2}\] You can safely assume that all possible outcomes of the pair \((X_1, X_2)\) are equally likely to occur.

  1. Obtain the probability distribution of \(Y\).
  2. Using what you have found in item 1, calculate \(\mathbb{E}\left(Y\right)\) and \(\mathsf{Var}\left(Y\right)\).
  3. \(X_1\) and \(X_2\) are independent random variables. Can you explain why?
  4. Do you think \(X_1\) and \(X_2\) are identically distributed random variables? Show why or why not.
  5. Can you think of a way to change the conditions of the experiment so that you can break the independence of \(X_1\) and \(X_2\)?

Technical Exercise 2: properties of the variance and covariance

Note that a consequence of the axioms is that \[\mathbb{E}\left(\sum_{j=1}^n c_jX_j\right)=\sum_{j=1}^n c_j \mathbb{E}\left(X_j\right)\]

This property turns out to be extremely useful in evaluating expected values. In this exercise, you will be proving some properties of the variance and covariance which are really expected values at their core.

Let \(X\), \(Y\), and \(W\) be random variables. Let \(a\), \(b\), \(c\), \(d\), and \(e\) be constants. Assume that expected values under consideration exist. Prove that:

  1. \(\mathsf{Var}\left(aX+bY+c\right)=a^2\mathsf{Var}\left(X\right)+b^2\mathsf{Var}\left(Y\right)+2ab \mathsf{Cov}\left(X,Y\right)\). Start from either the definition of the variance or from the other form of the variance.
  2. Using the previous item, compute the simplified value of \(\mathsf{Var}\left(2X-3Y+4\right)\) if \(\mathbb{E}\left(X\right)=0\), \(\mathbb{E}\left(X^2\right)=3\), \(\mathbb{E}\left(Y\right)=1\), \(\mathbb{E}\left(Y^2\right)=3\), and \(\rho_{X,Y}=0.5\).

Technical Exercise 3: independence and uncorrelatedness

In the slides, I presented a simulation showing that if \(X\) follows a standard normal distribution and if we let \(Y=X^2\), then \(\mathsf{Cov}\left(X,Y\right)=0\). But the simulation is not a proof and it is merely suggestive of what we could see in the long run. Note that the moments of the standard normal distribution are as follows: \(\mathbb{E}\left(X\right)=0\), \(\mathbb{E}\left(X^2\right)=1\), \(\mathbb{E}\left(X^3\right)=0\), \(\mathbb{E}\left(X^4\right)=3\).

  1. Use the given information to prove that \(\mathsf{Cov}\left(X,Y\right)=0\).
  2. Use the given information to prove that \(\mathsf{Cov}\left(X^2,Y\right)\neq 0\).
  3. We discussed in class that it is clear from \(Y=X^2\) that \(X\) and \(Y\) could not possibly be independent of each other. Can you use results in items #1 and #2 to reach the same conclusion? Explain.

Technical Exercise 4: more practice with interaction terms

In Exercise Set 04, many of you had substantial difficulties in interpreting regression coefficients, especially when there are interaction terms. Below is an analysis of the distance from college dataset. You can find the details about this dataset here.

rm(list=ls())
library(foreign)
college <- read.dta("https://www.princeton.edu/~mwatson/Stock-Watson_3u/Students/EE_Datasets/CollegeDistance.dta")
names(college)
 [1] "female"   "black"    "hispanic" "bytest"   "dadcoll"  "momcoll" 
 [7] "ownhome"  "urban"    "cue80"    "stwmfg80" "dist"     "tuition" 
[13] "incomehi" "ed"      
apply(college, 2, mean)
    female      black   hispanic     bytest    dadcoll    momcoll    ownhome 
 0.5453109  0.1925711  0.1498946 51.0019310  0.2020548  0.1393572  0.8192835 
     urban      cue80   stwmfg80       dist    tuition   incomehi         ed 
 0.2439410  7.6548735  9.5564989  1.7249210  0.9131396  0.2863541 13.8292940 
apply(college[, c("ed", "bytest", "cue80", "stwmfg80", "dist", "tuition")], 2, sd)
       ed    bytest     cue80  stwmfg80      dist   tuition 
1.8139688 8.8192514 2.8657700 1.3644112 2.1338357 0.2835778 
college$c.dist <- college$dist - 1
college$c.cue80 <- college$cue80 - mean(college$cue80)
college$c.tuition <- college$tuition - mean(college$tuition)
lm(ed ~ c.dist + c.cue80 + c.tuition + black + dadcoll + momcoll + dadcoll:momcoll + black:c.tuition, data = college)

Call:
lm(formula = ed ~ c.dist + c.cue80 + c.tuition + black + dadcoll + 
    momcoll + dadcoll:momcoll + black:c.tuition, data = college)

Coefficients:
    (Intercept)           c.dist          c.cue80        c.tuition  
       13.62648         -0.05305          0.01760          0.16000  
          black          dadcoll          momcoll  dadcoll:momcoll  
       -0.36653          1.11807          0.92463         -0.48436  
c.tuition:black  
       -0.31440  

Interpret all the coefficients properly.

Explorations in R and reading more on Quarto

Good news! You will not be submitting anything in Quarto for this exercise set. But it would be a good idea to learn more about the following topics, as they are part of learning how to read more R code (without necessarily having access to a computer) and to practice linking what we are doing in class, what you are doing with the exercises, and the feedback from the exercises.

Reading more about Quarto

You might find the Quarto guide on the following topics to be of interest:

Simulating IID die rolls

You have seen simulated coin tosses. Now, you are going to explore simulated die rolls.

# Outcome of one fair die roll
x <- sample(1:6, 1, replace = TRUE)
x
# Outcome of two independent, fair die rolls
x <- sample(1:6, 2, replace = TRUE)
x

Let \(X\) be the outcome of the die roll. I showed in class how to get a sense of what \(\mathbb{E}\left(X\right)=7/2\) means. Are you able to do this yourself?

Rounding off values

You may have noticed that from the slides and from the solutions that I tend to round off to a few decimal places. What justifies this practice, aside from having cleaner communication of results? Reporting too many decimal places gives a false sense of accuracy and for people who may not understand how statistics work, they may feel that things are more accurate than they actually are.

In the next R code, you will be getting a sense of why rounding off is justifiable in the context of data analysis.

mean(sample(1:6, 100, replace = TRUE))
mean(sample(1:6, 10000, replace = TRUE))
mean(sample(1:6, 1000000, replace = TRUE))

What do you notice is happening? Try to repeat these three commands multiple times. Get a sense of what is happening and try to express your answer to the question I posed at the beginning.

Simulating draws from a standard normal distribution

# 5 draws from a standard normal
x <- rnorm(5)
x
hist(x) ## stop here
mean(x)
var(x)

When you executed hist(x), do you see a bell-shape? Without running the code mean(x) and var(x), do you know more or less what results you will see? Can you confirm this? Repeat the whole exercise but this time increasing the number of draws.

In Technical Exercise 3 of this exercise set, I reported the moments of the standard normal distribution. How would you modify the code provided to verify that \(\mathbb{E}\left(X^4\right)=3\)? Compare the number of draws you needed to verify \(\mathbb{E}\left(X\right)=0\) and the number of draws you needed to verify \(\mathbb{E}\left(X^4\right)=3\).

Why you have to be careful with R-squared

In the solutions to the Exercise Set 04, I noticed some students talking about the R-squared and making comparisons involving this quantity. Furthermore, I noticed some students running a regression without an intercept and coming to the conclusion that it gives “better results” because of the higher R-squareds. Here I try to show you why you should not make a big deal about R-squareds.

# A simulation to show issues regarding R-squared
# Number of replications
reps <- 10^3
# Container for storing R-squareds
r2 <- numeric(reps)
adj.r2 <- numeric(reps)
# The number of observations for each simulated/artificial dataset
num.obs <- 100
# The number of columns of X excluding the column of ones
num.vars <- 50
for (i in 1: reps)
{
  y <- rbinom(num.obs, 1, 0.5)
  x <- matrix(rbinom(num.obs*num.vars, 1, 0.5), nrow = num.obs)
  dat <- data.frame(cbind(y, x)) 
  temp <- summary(lm(y ~ x , data = dat))
  r2[i] <- temp$r.squared # Record R-squared
  adj.r2[i] <- temp$adj.r.squared # Record R-squared
}
par(mfrow = c(1, 2))
hist(r2)
hist(adj.r2)

Try to execute this simulation to give you a sense of the behavior of R-squared and adjusted R-squared.

  1. What do you expect R-squared will be in this simulation? Look at how y and the x’s are generated.
  2. What do you notice from the histograms?
  3. Increase num.vars to 50. Repeat the simulation. What do you notice?
  4. Try modifying the code so that you are running a regression without an intercept.
  5. Do you now understand the limitations of the phrase “goodness of fit”?

Another resource which I wanted to share with you is this blog entry. I think at this stage, you should be able to comfortably read the R code in the blog entry. There are some new commands, but I think you could handle it at this stage.

Correlation vs interaction

I have noticed in the solutions statements to the following effect: Because the regressors are strongly correlated, we should put an interaction term. Here you will be exploring this statement more critically.

x1 <- rbinom(5, 1, 0.5)
x2 <- 1 - x1
  1. Without doing any calculations, can you guess the correlation coefficient between x1 and x2?
  2. Use R to calculate the correlation coefficient.
  3. What do you think will happen if you introduce an interaction term x1:x2 and include it as part of your regressors in the lm() command?

Simulating draws from a container

In Technical Exercise 1, you encountered a container with 5 ping-pong balls. You are going to simulate draws from this container.

# Outcome of one draw
x <- sample(c(1, 2, 2, 4, 6), 1, replace = TRUE)
x

Modify the code above so that you are able to check using simulation whether your calculated values for \(\mathbb{E}\left(Y\right)\) and \(\mathsf{Var}\left(Y\right)\) are roughly correct?

What you will be expected to do

You will be submitting to my email a zip file (not rar, not 7z) with filename surname_exset06.zip, replacing surname with your actual surname, and making sure it contains scanned PDF solutions to the technical exercises (do be mindful of the size of the file, keep under 15 MB if possible) with filename surname_tech06.pdf.