Describing Distributions

(before seeing the data)

Andrew Pua

Previously…

We calculated summaries and linear regressions using whatever data we have on hand.
- We explored how these aspects may be related to each other algebraically.
- We also discussed how to communicate the results.
These descriptions could be insightful for what they are but what exactly do we learn from these descriptions?
We have to think hard about these summaries and linear regressions before doing any computations.

References

I have drawn some of the technical materials from Whittle (2000), a very unconventional book on probability.
Other materials were drawn from Aronow and Miller (2019), Angrist and Pischke (2009, 2014).
Other references mentioned in the housekeeping slides were also used but are less recognizable.

Empirical vs theoretical quantities

What you have seen before are empirical quantities.
These empirical quantities are subject to variation.
- Recall your first Monte Carlo simulation in Exercise Set 3.
- You have seen crude animations to this effect.
To truly understand what these empirical quantities are actually all about, we cannot avoid abstraction.

What I mean by abstraction is that we are engaging in a thought experiment where we think of empirical quantities as having some “order” which could be studied mathematically.
We can create theoretical (idealized) counterparts of these empirical quantities by:
- transferring desirable properties of these empirical quantities into their theoretical counterparts
- looking at how these empirical quantities behave under repeated sampling and when you accumulate more observations

Behavior of the mean

The starting point is that what we observe in the data have arisen from a random process or what others call a data generating process.
- This is a philosophical advance and a different way of thinking.
- Random is not the same as haphazard.

Think of the experiment where you toss a coin.
- There are two possible outcomes or realizations (heads or tails).
- Which outcome shows up is not known in advance.
With enough observations of the act of coin tossing under similar conditions, a particular type of long-run behavior will emerge.
I will show an actual coin-tossing experiment and a simpler Monte Carlo simulation.

Let $X_t$ be equal to 1 if the $t$th coin toss produces heads.
Let $n$ be the number of tosses.
Focus on the quantity \[p\left(n\right)=\frac{1}{n} \sum_{t=1}^n X_t\]
This is the relative frequency or proportion of heads as a function of the number of tosses.

Learning to simulate in R

To toss a fair coin once: rbinom(1, 1, 0.5)
Assign $n$ to 50: n <- 50
Toss a fair coin $n$ times, and assign the outcomes to the object x: x <- rbinom(n, 1, 0.5)
Produce a plot of the object x: plot(x)
Calculate the relative frequency of heads, i.e., how many heads have been observed divided by how many tosses have been made: z <- cumsum(x)/(1:n)
Plot the relative frequency against tosses plot(z)

Results of the program

The results will be different for every computer.

rbinom(1, 1, 0.5)

[1] 1

rbinom(10, 1, 0.5)

 [1] 0 1 0 0 0 1 0 0 0 1

n <- 10
1:n

 [1]  1  2  3  4  5  6  7  8  9 10

cumsum(1:n)

 [1]  1  3  6 10 15 21 28 36 45 55

n <- 50; x <- rbinom(n, 1, 0.5); z <- cumsum(x)/(1:n)
plot(x)

plot(z)

plot(1:n, z, type = "l", xlab = "number of coin flips", 
     ylab = "relative frequency", ylim = c(0,1))
lines(1:n, rep(0.5, n), lty = 3, col = 2, lwd = 3)

n <- 500; p <- 1/2; x <- rbinom(n, 1, p); z <- cumsum(x)/(1:n)
plot(1:n, z, type = "l", xlab = "number of coin flips", 
     ylab = "relative frequency", ylim = c(0,1))
lines(1:n, rep(p, n), lty = 3, col = 2, lwd = 3)

n <- 500; p <- 1/3; x <- rbinom(n, 1, p); z <- cumsum(x)/(1:n)
plot(1:n, z, type = "l", xlab = "number of coin flips", 
     ylab = "relative frequency", ylim = c(0,1))
lines(1:n, rep(p, n), lty = 3, col = 2, lwd = 3)

n <- 10000; p <- 1/3; x <- rbinom(n, 1, p); z <- cumsum(x)/(1:n)
plot(1:n, z, type = "l", xlab = "number of coin flips", 
     ylab = "relative frequency", ylim = c(0,1), log="x")
lines(1:n, rep(p, n), lty = 3, col = 2, lwd = 3)

What should you notice?

Consider the tossing of a coin whether fair or not.
- We are uncertain of the outcomes that will materialize for any specific coin toss.
- Focusing on the mean, there is some form of order emerging in the long run.
BUT … there are terms and conditions, which you will learn in the course.

Learning to simulate in R, part 2

Set the number of repetitions: nsim <- 10^4
Repeat for nsim times “tossing of a fair coin 10 times”: set1 <- replicate(nsim, rbinom(10, 1, 0.5))
Calculate the relative frequency for every repetition: props1 <- colMeans(set1)
Visualize and summarize: hist(props1, freq=FALSE), table(props1)

nsim <- 10^4
set1 <- replicate(nsim, rbinom(10, 1, 0.5))
set2 <- replicate(nsim, rbinom(40, 1, 0.5))
props1 <- colMeans(set1)
props2 <- colMeans(set2)
table(props1)

props1
   0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9    1 
   3   85  441 1218 2075 2473 1988 1197  413   98    9

c(mean(props1), sd(props1))

[1] 0.4986900 0.1569642

table(props2)

props2
 0.15 0.175   0.2 0.225  0.25 0.275   0.3 0.325  0.35 0.375   0.4 0.425  0.45 
    1     1     1     2     8    13    42   106   233   360   544   830  1018 
0.475   0.5 0.525  0.55 0.575   0.6 0.625  0.65 0.675   0.7 0.725  0.75 0.775 
 1165  1287  1201   993   843   593   359   202   106    52    27    10     3

c(mean(props2), sd(props2))

[1] 0.50057000 0.07900031

par(mfrow=c(1,2))
hist(props1, freq=FALSE)
hist(props2, freq=FALSE)

nsim <- 10^4
set1 <- replicate(nsim, rbinom(10, 1, 0.05))
set2 <- replicate(nsim, rbinom(40, 1, 0.05))
set3 <- replicate(nsim, rbinom(160, 1, 0.05))
set4 <- replicate(nsim, rbinom(640, 1, 0.05))
props1 <- colMeans(set1)
props2 <- colMeans(set2)
props3 <- colMeans(set3)
props4 <- colMeans(set4)
c(mean(props1), sd(props1))

[1] 0.04992000 0.06931431

c(mean(props2), sd(props2))

[1] 0.0499375 0.0342261

c(mean(props3), sd(props3))

[1] 0.05005688 0.01747789

c(mean(props4), sd(props4))

[1] 0.050029062 0.008605005

par(mfrow=c(2,2))
hist(props1, freq=FALSE); hist(props2, freq=FALSE); hist(props3, freq=FALSE); hist(props4, freq=FALSE)

z1 <- (props1-mean(props1))/sd(props1); z2 <- (props2-mean(props2))/sd(props2); z3 <- (props3-mean(props3))/sd(props3); z4 <- (props4-mean(props4))/sd(props4)
par(mfrow=c(2,2))
hist(z1, freq=FALSE); hist(z2, freq=FALSE); hist(z3, freq=FALSE); hist(z4, freq=FALSE)

What should you notice?

When we apply the mean repeatedly as a procedure, the histogram of means has center, spread, and shape. Notice what is happening as the number of tosses increases.
When we use the mean, there is another form of order emerging in the long run, especially with respect to the histogram of means.
BUT … there are terms and conditions, which you will learn in the course.

Revisiting Exercise Set 3, $n=50$

# Number of observations
n <- 50
# How many times to loop
reps <- 10^4
# Storage for OLS results (2 entries per replication)
beta.store <- matrix(NA, nrow=reps, ncol=2)
# Monte Carlo loop
for (i in 1:reps)
{
  X.t <- rbinom(n, 1, 0.3)  # Generate X
  eps.t <- (rnorm(n, 0, 4))*(X.t == 1)+(rnorm(n, 0, 1))*(X.t == 0)
  Y.t <- 3 + 2*X.t + eps.t   # Generate Y
  temp <- lm(Y.t ~ X.t)
  beta.store[i,] <- coef(temp)
}

apply(beta.store, 2, mean)

[1] 3.001264 2.021769

apply(beta.store, 2, sd)

[1] 0.1698382 1.0741819

par(mfrow=c(1,2))
hist(beta.store[,1], freq=FALSE); hist(beta.store[,2], freq=FALSE)

z0 <- (beta.store[,1]-mean(beta.store[,1]))/sd(beta.store[,1])
z1 <- (beta.store[,1]-mean(beta.store[,1]))/sd(beta.store[,1])
par(mfrow=c(1,2))
hist(z0, freq=FALSE); hist(z1, freq=FALSE)

evalues <- eigen(cov(beta.store))$values
evectors <- eigen(cov(beta.store))$vectors
std.vector <- (beta.store - cbind(rep(mean(beta.store[,1]), reps), rep(mean(beta.store[,2]), reps))) %*% (evectors %*% diag(1/sqrt(evalues)) %*% t(evectors))
par(mfrow=c(1,2))
plot(beta.store[,1], beta.store[,2])
plot(std.vector[,1], std.vector[,2], asp = 1, xlim = c(-4, 4), ylim = c(-4, 4))

cov(beta.store)

            [,1]        [,2]
[1,]  0.02884500 -0.02731141
[2,] -0.02731141  1.15386678

cov(std.vector)

              [,1]          [,2]
[1,]  1.000000e+00 -3.747434e-17
[2,] -3.747434e-17  1.000000e+00

The previous two commands are covariance matrices based on the artificial data.
The standardization is a little bit different compared to what you have encountered before, but is remarkably similar.

We have to subtract a mean and then divide by a standard deviation (or multiply by the square root of reciprocal of the variance).
In the bivariate case, we also subtract a mean (this time a vector) and then multiply by the square root of the inverse of the covariance matrix.

Revisiting Exercise Set 3, $n=200$

# Number of observations
n <- 200
# How many times to loop
reps <- 10^4
# Storage for OLS results (2 entries per replication)
beta.store <- matrix(NA, nrow=reps, ncol=2)
# Monte Carlo loop
for (i in 1:reps)
{
  X.t <- rbinom(n, 1, 0.3)  # Generate X
  eps.t <- (rnorm(n, 0, 4))*(X.t == 1)+(rnorm(n, 0, 1))*(X.t == 0)
  Y.t <- 3 + 2*X.t + eps.t   # Generate Y
  temp <- lm(Y.t ~ X.t)
  beta.store[i,] <- coef(temp)
}

apply(beta.store, 2, mean)

[1] 2.998878 1.993786

apply(beta.store, 2, sd)

[1] 0.08504352 0.52287951

par(mfrow=c(1,2))
hist(beta.store[,1], freq=FALSE); hist(beta.store[,2], freq=FALSE)

cov(beta.store)

             [,1]         [,2]
[1,]  0.007232401 -0.006785292
[2,] -0.006785292  0.273402986

cov(std.vector)

             [,1]         [,2]
[1,] 1.000000e+00 8.763772e-17
[2,] 8.763772e-17 1.000000e+00

What should you notice?

You are seeing how the procedure behind lm(), which is really calculating regression coefficients, is behaving in different artificial datasets.
- Recall that lm() involves functions of averages!
Of course, these artificial datasets are generated from the same population, or that they are governed by the same data generating process (DGP).
Pay attention to the axes and what role standardization plays.

Why should you care?

It means that some operations we apply to the data may have some long-run properties, even though there is some variability ever present in data.
In particular, the mean is such an important empirical quantity which has some long-run regularity.
This motivates us to use the mean in practice.
- Just look at opinion polls, market research, insurance activities, etc.
The question is what is the mean converging to?

What are we to do next

We need to introduce the following new concepts:

Random variables: They are neither random nor variable!
Expected values: They are not values which we expect to see!
Probability distributions and densities
Moments
Independence
Convergence of sequences of random variables

Random variables: coin toss

We introduced $X_t$ as taking value equal to 1 if the $t$th coin toss produces heads, and 0 otherwise.
- Before tossing coins or even tossing the $t$th coin, we know that $X_t$ could be equal to 1 or 0, but not exactly which.
- At the end of the day, $X_t$ is really a function.
$X_t$ maps from where to where?
- Introduce a sample space $\Omega=\{\mathrm{Heads},\mathrm{Tails}\}$.
Here we have, $X_t\left(\mathrm{Heads}\right)=1$ and $X_t\left(\mathrm{Tails}\right)=0$.

Random variables: die roll

The sample space here is $\Omega=\{1,2,3,4,5,6\}$.
Since we are interested in the outcome of the die roll, we can define a random variable $X_t$ to be equal to the value from the $t$th die roll.
To specify this function, we can write $X_t\left(\omega\right)=\omega$ for $\omega\in\Omega$.
Of course, how we label this sample space should not matter, but your random variable would have to be tweaked somehow.

Expectation or expected value

In our coin toss example, there seems to be convergence of the mean to some quantity.
We are now going to build this quantity, which is called the expectation or expected value.
Since the mean (as we calculate it) has to satisfy some natural properties, it is hard to argue against wanting to endow the expectation with the same properties.

We are now going to define an operator called the expectation operator, denoted by $\mathbb{E}\left(\cdot\right)$. This operator is a rule which assigns a scalar random variable $X$ to a number $\mathbb{E}\left(X\right)$, but this rule has to satisfy the following axioms:
- If $X\geq 0$, then $\mathbb{E}\left(X\right)\geq 0$.
- If $c$ is a constant, then $\mathbb{E}\left(cX\right)=c\mathbb{E}\left(X\right)$.
- If $\mathbb{E}\left(X_1+X_2\right)=\mathbb{E}\left(X_1\right)+\mathbb{E}\left(X_2\right)$.
- $\mathbb{E}\left(1\right)=1$.

Axioms are statements which we take to be true, but is compatible with our experience.
The axioms by themselves do NOT tell you how to actually calculate an expectation!
A useful consequence of the axioms is that: \[\mathbb{E}\left(\sum_{j=1}^n c_jX_j\right)=\sum_{j=1}^n c_j \mathbb{E}\left(X_j\right)\]

Do note that it is possible for $\mathbb{E}\left(X\right)$ to actually not exist. We will encounter this later, as this arises in practice.
From the axioms, we can also derive a definition for the probability of an event $A$.
- An event $A$ is a subset of $\Omega$.
- The probability of an event $A$ is \[\mathbb{P}\left(A\right)=\mathbb{E}\left(I\left(A\right)\right),\] where $I\left(A\right)=1$ if $\omega\in A$ and $I\left(A\right)=0$ if $\omega\not\in A$.

$I\left(A\right)$ is called an indicator random variable.
- You have encountered something similar to indicator random variables before. Recall dummy variables?
We will not spend so much time calculating probabilities. In fact, the presentation is a bit nontraditional.
- If you ever took a statistics course before, they will usually start with probability as the primitve concept.
- In contrast, I start with expectation as the primitive concept. There is only one book (that I am aware of) that does this: Whittle (2000).

How do we actually calculate an expectation?

Suppose $\Omega=\{\omega_1,\ldots,\omega_K\}$, a finite sample space.
Let $X$ be a random variable defined on this sample space.
In this context, $X$ is a discrete random variable.

Based on the axioms, the expectation can be shown to be equal to \[\mathbb{E}\left(X\right)=\sum_{k=1}^K p_kX\left(\omega_k\right)\] with $p_k=\mathbb{P}\left(\omega_k\right)$, where $p_1,\ldots,p_K$ has to be constrained as follows:
- $p_k\geq 0$ for all $k$
- $\sum_k p_k =1$.

Returning to examples

For a coin toss, \[\begin{eqnarray}\mathbb{E}\left(X_t\right) &=&\mathbb{P}\left(\mathrm{Heads}\right)*X_t\left(\mathrm{Heads}\right)+\mathbb{P}\left(\mathrm{Tails}\right)*X_t\left(\mathrm{Tails}\right)\\ &=&\mathbb{P}\left(\mathrm{Heads}\right)\end{eqnarray}\]
For a die roll, \[\begin{eqnarray}\mathbb{E}\left(X\right) &=&\mathbb{P}\left(1\right)*X\left(1\right)+\mathbb{P}\left(2\right)*X\left(2\right)+\mathbb{P}\left(3\right)*X\left(3\right)\\ &&+ \mathbb{P}\left(4\right)*X\left(4\right)+\mathbb{P}\left(5\right)*X\left(5\right)+\mathbb{P}\left(6\right)*X\left(6\right)\\ &=& \mathbb{P}\left(1\right)*1+\mathbb{P}\left(2\right)*2+\mathbb{P}\left(3\right)*3\\ &&+ \mathbb{P}\left(4\right)*4+\mathbb{P}\left(5\right)*5+\mathbb{P}\left(6\right)*6\end{eqnarray}\]

Notice that you need probabilities to be able to calculate and simplify these expectations.
- If you assume that the coin is fair, then $\mathbb{E}\left(X_t\right)=1/2$.
- If you assume that the die is fair, then $\mathbb{E}\left(X\right)=7/2$.
These probability assignments have to obey the conditions indicated earlier. These probability assignments form a discrete probability distribution for a random variable.

If the coin is fair, then the discrete probability distribution for $X_t$ is given by $\mathbb{P}\left(\mathrm{Heads}\right)=1/2$ and $\mathbb{P}\left(\mathrm{Tails}\right)=1/2$.
If the die is fair, then the discrete probability distribution for $X$ is given by $\mathbb{P}\left(i\right)=1/6$ for all $i=1,2,3,4,5,6$.
You can definitely dream up other discrete probability distributions for a coin or a die roll, provided that you satisfy the conditions.

What if $\Omega$ takes on infinitely many values?

This may not feel practically relevant but having infinitely many values can be a good approximation to the case of having a very large value of $K$.
Setting of thought experiment:
- Start from the sample space of a die roll $\Omega=\{1,2,3,4,5,6\}$.
- The random variable is still $X\left(\omega\right)=\omega$.

The thought experiment:
- Imagine adding more faces to the die, in the following sense: $\Omega=\{1,1.5,2,3,4,5,6\}$.
- If you still use the old $\mathbb{P}\left(i\right)=1/6$ for all $i=1,2,3,4,5,6$ as a probability distribution for $X$, then you effectively have $\mathbb{P}\left(1.5\right)=0$.
- What if you want $\mathbb{P}\left(1.5\right)>0$? How will this affect the probability assignments?
- Now, try adding more and more faces to the die in the sense that you now have $\Omega=[1,6]$.

What would the probability assignments look like?
What happens when $\Omega$ is the entire real line?
Clearly, calculating an expectation here is going to be a bit different compared to what you have seen.
Probability distributions are also going to be conceptually more complicated.

What will change is that the summation found in the expectation is replaced by an integral, i.e., \[\mathbb{E}\left(X\right)=\int_{-\infty}^{\infty} X\left(\omega\right)f\left(\omega\right)\,\,d\omega\] provided that this integral exists.
Just like the discrete case, there would be constraints on $f$ which will guarantee that $\mathbb{E}\left(X\right)$ satisfies the axioms.

In particular, $f$ should satisfy:
- $f\left(\omega\right)\geq 0$
- $\displaystyle\int_{-\infty}^\infty f\left(\omega\right)\,\,d\omega=1$
$f$ is called a probability density on $\Omega$. $X$ has a continuous probability distribution on $\Omega$.
You are not expected to know how to explicitly calculate expectations of random variables having continuous probability distributions if you are given $f$. Integration skills are needed for this.

Probabilities and probability densities are different from each other but are related.
Their empirical counterparts are relative frequencies and relative frequency per unit.
When we talked about histograms, we talked about two different ways of presenting them.
- When frequencies are on the vertical axis, then the areas of the rectangles have no meaning.
- If we set the areas of the rectangles to represent frequencies, then the vertical length becomes the frequency divided by the length of the bin. This is density!

Moments

The $j$th moment of a random variable $X$ is given by $\mathbb{E}\left(X^j\right)$.
The first moment of a random variable $X$ is the expectation or expected value of $X$.
The variance of a random variable $X$, denoted by $\mathsf{Var}\left(X\right)$, is given by \[\mathsf{Var}\left(X\right)=\mathbb{E}\left[\left(X-\mathbb{E}\left(X\right)\right)^2\right]\] which depends on the second moment of $X$.

You can show that $\mathsf{Var}\left(X\right)=\mathbb{E}\left(X^2\right)-\mathbb{E}\left(X\right)^2$.
Interestingly, we have \[\mathsf{Var}\left(X\right)=\min_{c}\mathbb{E}\left[\left(X-c\right)^2\right]\]
- We will return to this result later because it means that $\mathbb{E}\left(X\right)$ has optimal properties for prediction.
The standard deviation of a random variable $X$ is given by $\sqrt{\mathsf{Var}\left(X\right)}$.

You have to wonder whether the standard deviation you have seen before is somewhat related to the standard deviation of a random variable.
- They actually can be connected, although they are conceptually different.
- Recall our discussion on the long-run behavior of the mean.
We can also form the standardized version of a random variable $X$. What do you think it would look like?

There are higher-order moments which are used in practice, but not as much as the first two moments.
- They do arise in finance, labor, and applied econometrics, but for very specific settings.
The third and fourth moments of $X$ are related to symmetry and tail behavior of the distribution of the random variable $X$.
- You may have heard about skewness and kurtosis before.

Generalizations to multiple random variables

Everything we have done could (more or less) be extended to the case of multiple random variables.
- One can think of random vectors or random matrices.
- Probability distributions and probability densities will now be joint distributions and densities.
Moments for random vectors are less straightforward. For our purposes, I will be explicit about these as we go along.

Covariance and correlation

The covariance of two random variables $X$ and $Y$ is defined as: \[\begin{eqnarray}\mathsf{Cov}\left(X,Y\right) &=& \mathbb{E}\left[\left(X-\mathbb{E}\left(X\right)\right)\left(Y-\mathbb{E}\left(Y\right)\right)\right]\\ &=&\mathbb{E}\left(XY\right)-\mathbb{E}\left(Y\right)\mathbb{E}\left(X\right)\end{eqnarray}\]
The correlation coefficient of two random variables $X$ and $Y$ is defined as: \[\rho_{X,Y} = \frac{\mathsf{Cov}\left(X,Y\right)}{\sqrt{\mathsf{Var}\left(X\right)\mathsf{Var}\left(Y\right)}}\]

Application: Which is real, which is fake?

This application is based on Gelman and Nolan (2009).
Suppose you have two coin sequences (heads is 1, tails is 0):

Sequence A: 0011100011/0010000100/0010001000/1000000001/0011001010

1100001111/1100110001/0101100100/1000100000/0011111001

Sequence B: 0100010100/1100010100/1110100110/0011110100/0111010001

1000110111/1000100101/1011011100/0110010001/0010000100

As a starting point, flip a coin 5 times. Assume that the outcomes of those three coins are equally likely to occur. Define
- $Y$ to be the length of the longest run (run = a sequence of consecutive flips of the same type)
- $X$ to be the number of switches from heads to tails or tails to heads
What is $\mathsf{Cov}\left(X,Y\right)$ for this special case?
What happens when you have more than 5 tosses – say 100 tosses? Simulation might be easier here!

library(scatterplot3d) 
x <- c(4, 2, 3, 1, 2, 1, 0)
y <- c(1, 2, 2, 3, 3, 4, 5)
z <- c(2, 6, 7, 4, 7, 4, 2)
prob.dist <- cbind(x, y, z/32)
scatterplot3d(prob.dist, type="h")

x <- rbinom(50000, 1, 0.5); x <- matrix(x, nrow = 10000)
result <- apply(x, 1, rle)
dist <- matrix(rep(0,nrow(x)*2), ncol=2)
for(i in 1:nrow(x))
{
  temp <- result[[i]]$lengths
  dist[i,]<-c(length(temp)-1, max(temp))
}
plot(dist[,1], jitter(dist[, 2], 1), xlab="Number of switches", ylab="Length of longest run")

cov(dist)

           [,1]       [,2]
[1,]  1.0139804 -0.8863458
[2,] -0.8863458  0.9749505

cor(dist)

           [,1]       [,2]
[1,]  1.0000000 -0.8914503
[2,] -0.8914503  1.0000000

Notice how the simulated value of the covariance of $X$ and $Y$ matches your calculated covariance.

What if we have 100 flips?

x <- rbinom(1000000, 1, 0.5); x <- matrix(x, nrow = 10000)
result <- apply(x, 1, rle)
dist <- matrix(rep(0,nrow(x)*2), ncol=2)
for(i in 1:nrow(x))
{
  temp <- result[[i]]$lengths
  dist[i,]<-c(length(temp)-1, max(temp))
}
plot(dist[,1], jitter(dist[, 2], 1), xlab="Number of switches", ylab="Length of longest run")

cov(dist)

         [,1]      [,2]
[1,] 24.18158 -3.927190
[2,] -3.92719  3.267601

cor(dist)

           [,1]       [,2]
[1,]  1.0000000 -0.4417993
[2,] -0.4417993  1.0000000

Notice that $X$ and $Y$ are negatively correlated (based on the simulation).
Calculating the real value of $\rho_{X,Y}$ for 100 coin flips is extremely difficult!
You should be able to use this result to answer the question stated at the beginning of the application.

The concept of independence

Here we define one notion of the “absence of relationships” among random variables.
The random variables $X_1, \ldots, X_n$ are independent if \[\mathbb{E}\left(\prod_{k=1}^n H_k\left(X_k\right)\right)=\prod_{k=1}^n\mathbb{E}\left(H_k\left(X_k\right)\right)\] for all scalar-valued functions $H_k$ and provided that the expected values are well-defined.

Independence is not part of the list of axioms but it is a useful concept forming the basis of a lot of basic results in probability and statistics.
Frequently, we also see an IID random sampling assumption in the theory. IID refers to independent and identically distributed random variables.
Identical distribution means that $X_1, \ldots, X_n$ have a common distribution. In other words each $X_i$ has the same probability distribution.

When two random variables $X$ and $Y$ are independent, they have zero covariance.
But zero covariance between $X$ and $Y$ does not imply that $X$ and $Y$ are independent.
Consider the case where $X$ is a random variable which takes on values ${−1, 0, 1}$, where $\mathbb{P}\left(-1\right)=\mathbb{P}\left(0\right)=\mathbb{P}\left(1\right)=1/3$ and zero elsewhere. Let $Y=X^2$.
You can show that $\mathsf{Cov}\left(X,Y\right)=0$ but $X$ and $Y$ are not independent.

Here is another example.

x <- rnorm(10^4)
y <- x^2
par(mfrow = c(1, 2))
hist(x, freq = FALSE)
plot(x, y)

The “absence of a linear relationship” is called uncorrelatedness. $X_1$ and $Y$ are uncorrelated if $\mathsf{Cov}\left(X_1,Y\right)=0$.
Unlike independence, uncorrelatedness involves pairs of random variables.
Independence implies uncorrelatedness but uncorrelatedness does not necessarily imply independence.

Stories behind special distributions: Bernoulli/binomial

Experiments resulting in “success” or “failure” but not both
You have to define what “success” means and ensure that “failure” is the complement.
- Bernoulli r.v.’s are indicator r.v.’s.

Fix $n$ as a positive integer and $0<p<1$. If there are $n$ independent Bernoulli trials with each trial having the same probability of success $p$, then the total number of successes $X$ has a Binomial distribution with parameters $n$ and $p$. In short, $X\sim \mathsf{Bin}\left(n,p\right)$.
You have seen a simulated version of a binomially distributed r.v.’s in the previous slides.
Next, you will plots of the probability distributions for binomial r.v.’s.

p <- 0.5; n <- 10; x <- seq(0, n, 1)
plot(x, dbinom(x, n, p), type = "h")

p <- 0.5; n <- 100; x <- seq(0, n, 1)
plot(x, dbinom(x, n, p), type = "h")

p <- 0.5; n <- 1000; x <- seq(0, n, 1)
plot(x, dbinom(x, n, p), type = "h")

p <- 0.05; n <- 10; x <- seq(0, n, 1)
plot(x, dbinom(x, n, p), type = "h")

p <- 0.05; n <- 100; x <- seq(0, n, 1)
plot(x, dbinom(x, n, p), type = "h")

p <- 0.05; n <- 1000; x <- seq(0, n, 1)
plot(x, dbinom(x, n, p), type = "h")

Stories behind special distributions: Discrete uniform

Experiments with a finite number of outcomes with each outcome having equal probability
If you have a random variable $X$ representing the outcome of a fair coin toss, then $X$ has a discrete uniform distribution.
If you have a random variable $X$ representing the outcome of a fair die roll, then $X$ also has a discrete uniform distribution.

By default, the sample() command is used to draw random numbers from the discrete uniform.
- You could sample with replacement or without replacement (default).
- But sample() has options to allow for non-fair die rolls if you want.

sample(1:6, 40, replace = TRUE)
sample(1:6, 4, replace = FALSE)
sample(1:6, 40, replace = TRUE, prob = c(5,1,1,1,1,1))

Stories behind special distributions: Continuous uniform

A way to model “lack of knowledge” or to think about something being “completely random” (meaning “equally likely”)
If $X$ has a continuous uniform distribution on $(0,1)$, denoted as $X\sim U(0,1)$, then its probability density given by \[f\left(x\right) =\begin{cases} 1 & \mathrm{if}\ \ 0<x<1 \\ 0 & \mathrm{otherwise} \end{cases}\]

If $X\sim U(0,1)$, then $\mathbb{E}\left(X\right)=1/2$, $\mathsf{Var}\left(X\right)=1/12$
We can construct an r.v. with any discrete/continuous distribution we want. This allows us to do simulation or drawing random numbers from desired distributions.
You can draw random numbers from the continuous uniform distribution $ U(0,1)$.
You can also draw random numbers from $U(a,b)$.

par(mfrow = c(1,2))
hist(runif(10^4), freq = FALSE); hist(runif(10^4, min = 2, max = 5), freq = FALSE)

Stories behind special distributions: Standard normal

This distribution is denoted by $N\left(0,1\right)$.
The word “standard” comes from the standardizing idea you have encountered before.
The probability density is given by \[\phi\left(x\right) = \frac{1}{\sqrt{2\pi}} \exp \left(-\frac{1}{2}x^2\right).\]
If $X\sim N(0,1)$, then $\mathbb{E}\left(X\right)=0$ and $\mathsf{Var}\left(X\right)=1$.

If $Y=\sigma X+\mu$, then $Y\sim N(\mu, \sigma^2)$.
The probability density is given by \[f\left(y\right) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left(-\frac{1}{2}\left(\frac{y-\mu}{\sigma}\right)^2\right).\]
If $Y\sim N(\mu, \sigma^2)$, then $\mathbb{E}\left(X\right)=\mu$ and $\mathsf{Var}\left(X\right)=\sigma^2$.

You can also draw random numbers from $N(0,1)$ and $N(\mu,\sigma^2)$.

par(mfrow = c(1, 2))
hist(rnorm(10^4), freq = FALSE)
hist(rnorm(10^4, mean = 4, sd = 5), freq = FALSE)

The area between -1 to 1 is roughly 68%. The area between -2 to 2 is roughly 95%. The area between -3 to 3 is roughly 99.7%.
Typically, probabilities from the standard normal are obtained from a table or from software.

c(pnorm(-3), pnorm(-2), pnorm(-1), pnorm(0), pnorm(1), pnorm(2), pnorm(3))

[1] 0.001349898 0.022750132 0.158655254 0.500000000 0.841344746 0.977249868
[7] 0.998650102

Set $\mu=0$ and allow $\sigma$ to vary (0.5 in blue, 1 in black, 1.5 in red, 2 in green).

curve(dnorm(x, mean = 0, sd = 0.5), from = -6, to = 6, col = 4, ylab = "")
curve(dnorm(x, mean = 0, sd = 1), from=-6, to=6, add = TRUE, ylab = "")
curve(dnorm(x, mean = 0, sd = 1.5), from=-6, to = 6, add = TRUE, col=2, ylab = "")
curve(dnorm(x, mean = 0, sd = 2), from=-6, to = 6, add = TRUE, col=3, ylab = "")

Set $\sigma=1$ and allow $\mu$ to vary (-0.5 in red, 0 in black, 0.5 in green).

curve(dnorm(x,mean=-0.5,sd=1), from=-6, to=6, col=2, ylab="")
curve(dnorm(x,mean=0,sd=1), from=-6, to=6, add=TRUE, ylab="")
curve(dnorm(x,mean=0.5,sd=1), from=-6, to=6, add=TRUE, col=3, ylab="")

The Standard Bivariate Normal Distribution

We can extend the standard normal distribution to two dimensions.
We have to think about the more interesting case where two standard normal random variables are not independent of each other.
We will be introducing a parameter $\rho$ to reflect this dependence.

Let \[Z_1 = U_1,\quad Z_2=\rho U_1+\sqrt{1-\rho^2} U_2\] where $U_1\sim N\left(0,1\right)$, $U_2\sim N\left(0,1\right)$ and $U_1$ is independent of $U_2$.
It has the following properties: \[\mathbb{E}\left(Z_1\right)=\mathbb{E}\left(Z_2\right)=0, \quad \mathsf{V}\left(Z_1\right)=\mathsf{V}\left(Z_2\right)=1, \quad \mathsf{Cov}\left(Z_1,Z_2\right)=\rho\]

It can be shown that the probability density of $(Z_1, Z_2)$ is given by

\[f\left(z_1,z_2\right)=\frac{1}{2\pi\sqrt{1-\rho^2}} \exp\left(-\frac{1}{2}\frac{z_1^2+z_2^2-2\rho z_1z_2}{1-\rho^2}\right).\] This density is called the standard bivariate normal distribution with parameter $\rho$, i.e., $\left(Z_1,Z_2\right) \sim \mathsf{SBVN}\left(\rho\right)$.

The bivariate normal distribution with $\rho=0$ is the case where uncorrelatedness implies independence.

$\rho \in \{-0.9, -0.6, 0\}$

$\rho \in \{0.6, 0.9, 0.99\}$

Extend to bivariate normal

We can now extend to bivariate normal distributions with general means and variances.
Let $X_1=\mu_1+\sigma_1 Z_1$ and $X_2=\mu_2+\sigma_2 Z_2$.
Observe that \[\mathbb{E}\left(X_1\right)=\mu_1, \quad \mathbb{E}\left(X_2\right)=\mu_2, \quad \mathsf{V}\left(X_1\right)=\sigma^2_1, \quad \mathsf{V}\left(X_2\right)=\sigma^2_2.\]
Further observe that \[\sigma_{12}=\mathsf{Cov}\left(X_1,X_2\right)=\mathsf{Cov}\left(Z_1,Z_2\right)\sigma_1\sigma_2=\rho\sigma_1\sigma_2.\]

Application: Learning a common population mean

Let $X_1, \ldots, X_n$ be IID random variables.
Let the population mean be $\mu=\mathbb{E}\left(X_t\right)$.
Let the population variance be $\sigma^2=\mathsf{Var}\left(X_t\right)$.
This population mean and variance are common across all random variables, because of the identical distribution assumption.

Let the sample mean be

\[\overline{X} = \frac{1}{n}\sum_{t=1}^n X_t\]

We can show that

\[\begin{eqnarray} \mathbb{E}\left(\overline{X}\right) &=& \mu\\ \mathsf{Var}\left(\overline{X}\right) &=& \frac{1}{n}\cdot \sigma^2 \end{eqnarray}\]

We can also write “$X_1, \ldots, X_n$ be IID random variables” as a statistical model \[X_t=\mu+\varepsilon_t\] where $\varepsilon_t$ are IID random variables with $\mathbb{E}\left(\varepsilon_t\right)=0$ and $\mathsf{Var}\left(\varepsilon_t\right)=\sigma^2$.
$\mu$ is a population parameter of interest we are interested to learn about.

nsim <- 10^4
set1 <- replicate(nsim, rbinom(10, 1, 0.05))
set2 <- replicate(nsim, rbinom(40, 1, 0.05))
set3 <- replicate(nsim, rbinom(160, 1, 0.05))
set4 <- replicate(nsim, rbinom(640, 1, 0.05))
props1 <- colMeans(set1); props2 <- colMeans(set2); props3 <- colMeans(set3); props4 <- colMeans(set4)
# spreads from the simulated sampling distribution
c(sd(props1), sd(props2), sd(props3), sd(props4))

[1] 0.068803949 0.034139244 0.017137012 0.008598289

The spreads from the simulated sampling distribution require us to know everything about the DGP. We typically do not know this in practice (in fact we are trying to learn it!).

# average of standard errors computed for each replication
c(mean(apply(set1, 2, sd))/sqrt(10), mean(apply(set2, 2, sd))/sqrt(40), mean(apply(set3, 2, sd))/sqrt(160), mean(apply(set4, 2, sd))/sqrt(640))

[1] 0.043927828 0.031217359 0.016961816 0.008596058

The computed standard errors applied to each replication uses a good plug-in or a good guess for the unknown variance $\sigma^2$, but uses the formula seen earlier. These could be computed in practice.

# formula-based standard error (had we known the unknown variance)
c(sqrt(0.05*0.95/10), sqrt(0.05*0.95/40), sqrt(0.05*0.95/160), sqrt(0.05*0.95/640))

[1] 0.06892024 0.03446012 0.01723006 0.00861503

The last set of standard errors exploits knowledge about the unknown DGP and serves as a benchmark. We typically do not know this in practice.
When $n$ is small, the standard errors we can compute in practice may not perform very well. But it gets better with large $n$.

How practical are these results?

IID random variables is really the same as obtaining a random sample.
The sample mean $\overline{X}$ will be different every time you gather a random sample from the same population.
It is possible to write down all possible values of $\overline{X}$, but we do not know which will be realized for a particular random sample.

This means that $\overline{X}$ is also a random variable which has a probability distribution (which some people call the sampling distribution of $\overline{X}$).
In particular, the expectation of $\overline{X}$ is actually the population average.
- Some say that $\overline{X}$ is an unbiased estimator of the population average.
- This means that there is no tendency to systematically overestimate or underestimate the population average when you use $\overline{X}$ regularly in similar settings.

Notice that the population size does not matter when deciding how to reduce the variance of $\overline{X}$.
- The major determinants are the sample size $n$ and the variance of $X$.
Note that \[\mathsf{Var}\left(\overline{X}\right)=\mathbb{E}\left[\left(\overline{X}-\mathbb{E}\left(\overline{X}\right)\right)^2\right]=\mathbb{E}\left[\left(\overline{X}-\mu\right)^2\right]\]
Observe that $\mathsf{Var}\left(\overline{X}\right)$ tends to zero with increasing $n$.

$\sqrt{\mathsf{Var}\left(\overline{X}\right)}$ is the standard deviation of the sampling distribution of $\overline{X}$. We abbreviate this as the standard error of $\overline{X}$.
Notice that the standard error of $\overline{X}$ declines at the $\sqrt{n}$ rate.
This means that reducing this standard error in half require a four-fold increase in the number of observations.

Introducing convergence concepts

What you have seen in the previous application is the case where $X_1,\ldots,X_n$ are IID random variables with common mean $\mu$ and common variance $\sigma^2$.
We found that for any value of $n$, we must have $\mathbb{E}\left(\overline{X}\right)= \mu$ and $\mathsf{Var}\left(\overline{X}\right) = \sigma^2/n$.
Now, we are going to imagine a sequence of sample means $\overline{X}_1,\overline{X}_2,\ldots$ where the sample mean is calculated for an increasing number of observations.

Notice that $\mathbb{E}\left(\overline{X}_n\right) \to \mu$ as $n\to\infty$ and that $\mathsf{Var}\left(\overline{X}_n\right) \to 0$.
These two conditions actually coincide with the implications of the so-called mean-squared convergence.
Let $Z_1,Z_2,\ldots$ be a sequence of random variables with finite second moments. This sequence is said to converge in mean square to a random variable $Z$, denoted by $Z_n \overset{ms}{\to} Z$, if \[\mathbb{E}\left(Z_n-Z\right)^2\to 0\]

Another convergence concept stems from the behavior of the probability of the event that \[|Z_n-Z|\geq \epsilon\] as $n\to\infty$, for any pre-specified $\epsilon>0$.
Let $Z_1,Z_2,\ldots$ be a sequence of random variables. This sequence is said to converge in probability to a random variable $Z$, denoted by $Z_n \overset{p}{\to} Z$, if for any $\epsilon>0$, \[\lim_{n\to\infty}\mathbb{P}\left(|Z_n-Z|\geq \epsilon\right)\to 0\]

Sometimes, $Z_n \overset{p}{\to} Z$ can be written as $\underset{n\to\infty}{\mathrm{plim}} Z_n = Z$.
We will not prove that $\overline{X} \overset{p}{\to} \mu$, but demonstrate it using a separate simulation.
The result is one of the most celebrated theorems in probability, called the law of large numbers for sequences of IID random variables.
We also have if $g$ is a continuous function and $Z_n \overset{p}{\to} Z$, then $g\left(Z_n\right) \overset{p}{\to} g\left(Z\right)$.

Application: One version of the linear regression model

Assume that \[Y_t=\beta_0+\beta_1 X_{1t}+\varepsilon_t\] where $\varepsilon_t$ are IID random variables with $\mathbb{E}\left(\varepsilon_t\right)=0$ and $\mathsf{Var}\left(\varepsilon_t\right)=\sigma^2$.
In addition, assume that $X_{1t}$’s do NOT vary across repeated samples. Some call these “fixed in repeated sampling”.

Suppose $\beta_1$ is the population parameter of interest.
Consider the procedure behind lm() to produce regression coefficients:

\[\widehat{\beta}_1=\frac{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)\left(Y_{t}-\overline{Y}\right)}{\displaystyle\sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)^2} \]

We can show that

\[\begin{eqnarray} \mathbb{E}\left(\widehat{\beta}_1\right) &=& \beta_1 \\ \mathsf{Var}\left(\widehat{\beta}_1\right) &=& \frac{\sigma^2}{\displaystyle \sum_{t=1}^n \left(X_{1t}-\overline{X}_1\right)^2} \end{eqnarray}\]

lm() actually uses the formula you have just seen for $\mathsf{Var}\left(\widehat{\beta}_1\right)$ by default.
We will revisit the Monte Carlo simulation of Exercise Set 3 and record the standard errors used by the default behavior of lm().
There is a small modification to ensure that we are in the situation where we fix the regressors under repeated sampling.

# Number of observations
n <- 50
# How many times to loop
reps <- 10^4
# Storage for OLS results (2 entries per replication)
beta.store <- matrix(NA, nrow=reps, ncol=2)
se.store <- matrix(NA, nrow=reps, ncol=2)
# Generate X, fixed in repeated samples
X.t <- rbinom(n, 1, 0.3)  
# Monte Carlo loop
for (i in 1:reps)
{
  eps.t <- (rnorm(n, 0, 4))*(X.t == 1)+(rnorm(n, 0, 1))*(X.t == 0)
  Y.t <- 3 + 2*X.t + eps.t   # Generate Y
  temp <- lm(Y.t ~ X.t)
  beta.store[i,] <- coef(temp)
  se.store[i,] <- sqrt(diag(vcov(temp)))
}

# spreads from the simulated sampling distribution
apply(beta.store, 2, sd)

[1] 0.1772843 0.9601830

# average of standard errors computed from lm()
apply(se.store, 2, mean)

[1] 0.4381875 0.7303125

# incorrect formula-based standard error for beta1hat
sqrt(5.5/sum((X.t-mean(X.t))^2))

[1] 0.6909635

The spreads from the simulated sampling distribution require you to know the DGP. We typically do not know this in practice.
But lm() has implementable formulas for the standard errors.

The formula-based standard error for $\widehat{\beta}_1$ is based on inputting known quantities from the simulation design. We typically do not know this in practice.
Unfortunately, this formula actually does not apply. The point of showing it to you is that even if we know more from the simulation design, …
Next, you will see that even our standard errors in practice are subject to variation!
Then, we will also see what happens when you increase the sample size to $n=200$.

# plots of calculated standard errors
par(mfrow=c(1,2))
hist(se.store[,1], freq=FALSE); hist(se.store[,2], freq=FALSE)

# Number of observations
n <- 200
# How many times to loop
reps <- 10^4
# Storage for OLS results (2 entries per replication)
beta.store <- matrix(NA, nrow=reps, ncol=2)
se.store <- matrix(NA, nrow=reps, ncol=2)
# Generate X, fixed in repeated samples
X.t <- rbinom(n, 1, 0.3)  
# Monte Carlo loop
for (i in 1:reps)
{
  eps.t <- (rnorm(n, 0, 4))*(X.t == 1)+(rnorm(n, 0, 1))*(X.t == 0)
  Y.t <- 3 + 2*X.t + eps.t   # Generate Y
  temp <- lm(Y.t ~ X.t)
  beta.store[i,] <- coef(temp)
  se.store[i,] <- sqrt(diag(vcov(temp)))
}

# spreads from the simulated sampling distribution
apply(beta.store, 2, sd)

[1] 0.08519018 0.50678027

# average of standard errors computed from lm()
apply(se.store, 2, mean)

[1] 0.2054663 0.3632166

# incorrect formula-based standard error for beta1hat
sqrt(5.5/sum((X.t-mean(X.t))^2))

[1] 0.3554978

Therefore, a larger sample size $n=200$ does not help to “magically fix” the problem.
It is definitely possible to modify the simulation so that we can respect the setup of the specific version of the linear regression model.

# plots of calculated standard errors
par(mfrow=c(1,2))
hist(se.store[,1], freq=FALSE); hist(se.store[,2], freq=FALSE)

Let us consider another modification of the simulation design.

# Number of observations
n <- 50
# How many times to loop
reps <- 10^4
# Storage for OLS results (2 entries per replication)
beta.store <- matrix(NA, nrow=reps, ncol=2)
se.store <- matrix(NA, nrow=reps, ncol=2)
# Generate X, fixed in repeated samples
X.t <- rbinom(n, 1, 0.3)  
# Monte Carlo loop
for (i in 1:reps)
{
  eps.t <- rnorm(n, 0, sqrt(5.5))
  Y.t <- 3 + 2*X.t + eps.t   # Generate Y
  temp <- lm(Y.t ~ X.t)
  beta.store[i,] <- coef(temp)
  se.store[i,] <- sqrt(diag(vcov(temp)))
}

# spreads from the simulated sampling distribution
apply(beta.store, 2, sd)

[1] 0.4074083 0.6971665

# average of standard errors computed from lm()
apply(se.store, 2, mean)

[1] 0.4054715 0.6953779

# NOW, correct formula-based standard error for beta1hat
sqrt(5.5/sum((X.t-mean(X.t))^2))

[1] 0.70014

# plots of calculated standard errors
par(mfrow=c(1,2))
hist(se.store[,1], freq=FALSE); hist(se.store[,2], freq=FALSE)

Notice that the standard errors computed from lm() is now doing a better job!
We will learn later the contrast between the two simulation designs.
It is possible to have good performance in one dimension (center is roughly correct) but not good performance in another dimension (spread is very off).

Application: What do surveys accomplish?

Suppose there is a finite population of size $N$ and each member of this population has some characteristic of interest.
Let $\Omega=\{\omega_1,\ldots,\omega_N\}$ be the sample space representing the identity of the members of the population.
Let $X\left(\omega_k\right)$ be the corresponding characteristic of $\omega_k$.

Define the population average for that characteristic to be \[\mu=\frac{1}{N}\sum_{k=1}^N X\left(\omega_k\right)\]
Surveys rely on sampling a small fraction of this population. Let $n$ be the sample size.
Define a random variable $\alpha_t\left(\omega\right)$ where $\omega\in\Omega$, which is equal to any of $\omega_1,\ldots,\omega_N$, depending on which was sampled on the $t$th draw.

Define the sample mean to be \[\overline{X} = \frac{1}{n}\sum_{t=1}^n X\left(\alpha_t\right) \]
What do we learn from the survey? There are two cases to consider:
- We can sample with replacement: Individuals may be sampled repeatedly.
- We can sample without replacement: Individuals sampled have to be distinct from each other.

If we sample with replacement, the expectation and variance of the sample mean $\overline{X}$ is given by: \[\begin{eqnarray} \mathbb{E}\left(\overline{X}\right) &=& \mu\\ \mathsf{Var}\left(\overline{X}\right) &=& \frac{1}{n}\mathsf{Var}\left(X\right) \end{eqnarray}\]

If we sample without replacement, the expectation and variance of the sample mean $\overline{X}$ is given by: \[\begin{eqnarray} \mathbb{E}\left(\overline{X}\right) &=& \mu\\ \mathsf{Var}\left(\overline{X}\right) &=& \frac{1}{n}\cdot \frac{N-n}{N-1}\cdot\mathsf{Var}\left(X\right) \end{eqnarray}\]

How practical are these results?

Depending on which units were sampled, $\overline{X}$ will be different every time.
We may know what the possible values of $\overline{X}$, but do not know which will be realized for a particular survey.
This means that $\overline{X}$ is also a random variable which has a probability distribution (which some people call the sampling distribution of $\overline{X}$).
It will have moments, provided they exist.

In particular, the expectation of $\overline{X}$ is actually the population average.
- Some say that $\overline{X}$ is an unbiased estimator of the population average.
- This means that there is no tendency to systematically overestimate or underestimate the population average when you use $\overline{X}$ regularly in similar settings.

Furthermore, the findings about the variance of $\overline{X}$ depends on the case:
- Under sampling with replacement, notice that the population size does not matter when deciding how to reduce the variance of $\overline{X}$.
- Under sampling without replacement, the population size is present.
- But the major determinants are the sample size $n$ and the variance of $X$. The latter is sometimes called noise or “problem difficulty”.

It is important not to lose sight of what $\mathsf{Var}\left(\overline{X}\right)$ really means.
- Note that \[\mathsf{Var}\left(\overline{X}\right)=\mathbb{E}\left[\left(\overline{X}-\mathbb{E}\left(\overline{X}\right)\right)^2\right]=\mathbb{E}\left[\left(\overline{X}-\mu\right)^2\right]\]
- Observe that with fixed $N$, and $\mathsf{Var}\left(\overline{X}\right)$ tends to zero with increasing $n$.

The convergence you observed is one sense in which we have some stability when $n$ increases. Recall our pictures from before.
If we have time, we will revisit this example and try to understand the role of surveys in our big-data reality. Does this mean that surveys are a dead end?

Application: How can we make predictions?

One of the four words in the sentence “I SEE THE MOUSE” will be selected at random. This means that each of these words are equally likely to be chosen.
The task is to predict the number of letters in the selected word.
Your error would be the number of letters minus the predicted number of letters.
The square of your error will be your loss.

Because you do not know for sure which word will be chosen, you have to allow for contingencies.
Therefore, losses depend on which word was chosen.
What would be your prediction rule in order to make your expected loss as small as possible?
How much would be the smallest expected loss?

Optimal or best prediction

There is some random variable $Y$ whose value you want to predict.
You make a prediction (a real number $\beta_0$).
You make an error $Y-\beta_0$, but this error is a random variable having a distribution.
Next, you have have some criterion to assess whether your prediction is “good”.

You need a loss function. There are many choices, which are not limited to: \[L\left(z\right)=z^{2},L\left(z\right)=\left|z\right|,L\left(z\right)=\begin{cases} \left(1-\alpha\right)\left|z\right| & \mathsf{if}\ z\leq0\\ \alpha\left|z\right| & \mathsf{if}\ z\geq0 \end{cases}\]
But losses will have a distribution too.
We can look at expected losses. This may seem arbitrary but many prediction and forecasting contexts use expected losses.

Consider $L\left(z\right)=z^{2}$ and form an expected squared loss $\mathbb{E}\left[\left(Y-\beta_0\right)^{2}\right]$.
This expected loss is sometimes called mean squared error (MSE).
You can show that $\mathbb{E}\left(Y\right)$ is the unique solution to the following optimization problem: \[\min_{\beta_0}\mathbb{E}\left[\left(Y-\beta_0\right)^{2}\right]\]

$\mathbb{E}\left(Y\right)$ is given another interpretation as the optimal predictor.
We also have a neat connection with the definition of the variance:
- $\mathsf{Var}\left(Y\right)$ becomes the smallest expected loss from using $\mathbb{E}\left(Y\right)$ as your prediction for $Y$.
What happens if we have additional information which we may use for prediction?

Optimal or best linear prediction

Return to our toy example.
Let us now predict the number of letters $Y$ in the selected word, given that you know the number of E’s in the word, denote this as $X_1$.
Clearly, knowing the value of $X_1$ for the selected word could help.
How do we incorporate this additional information for prediction?

Under an MSE criterion, the minimization problem is now \[\min_{\beta_{0},\beta_{1}}\mathbb{E}\left[\left(Y-\beta_{0}-\beta_{1}X_1\right)^{2}\right],\] with optimal coefficients $\beta_{0}^{*}$ and $\beta_{1}^{*}$ satisfying the first-order conditions: \[\mathbb{E}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X_1\right) = 0, \ \ \mathbb{E}\left[X_1\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X_1\right)\right] = 0\]

Explicit solutions

As a result, we have \[\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right),\qquad\beta_{1}^{*}=\dfrac{\mathbb{E}\left(X_1Y\right)-\mathbb{E}\left(Y\right)\mathbb{E}\left(X_1\right)}{\mathbb{E}\left(X_1^2\right)-\mathbb{E}\left(X_1\right)^2}.\]
In the end, we have \[\beta_{0}^{*} = \mathbb{E}\left(Y\right)-\beta_{1}^{*}\mathbb{E}\left(X_{1}\right),\qquad\beta_{1}^{*}=\dfrac{\mathsf{Cov}\left(X_{1},Y\right)}{\mathsf{Var}\left(X_{1}\right)}.\]

Take note of how we computed linear regression in lm(). Do you find some similarities?
In fact, the similarities are so suggestive that it can be shown that under an IID sample $\{\left(X_{1t} , Y_t\right)\}_{t=1}^n$ from the joint distribution of $(X,Y)$, we have \[\widehat{\beta}_1 \overset{p}{\to} \beta^*_1,\ \ \ \widehat{\beta}_0 \overset{p}{\to} \beta^*_0\]

I SEE THE MOUSE, again

Recall that the task is to predict the number of letters in the selected word. Each word is equally likely to be chosen.
Let $Y$ be the number of letters in the selected word.
The optimal predictor under an MSE criterion for $Y$ is $\mathbb{E}\left(Y\right)$. Here, we have $\mathbb{E}\left(Y\right)=3$.
If we have information on the number of E’s in the word, what will be your prediction? Let $X_1$ denote the number of E’s in the word.

We have found a formula for the optimal linear predictor which is $\beta_0^* +\beta_1^*X_1$, where $\beta_0^*$ and $\beta_1^*$ have forms we have derived earlier.
You can show that the optimal linear predictor is $2+X_1$. Notice that this optimal linear predictor uses information about the number of E’s in the word to make predictions.
Contrast this to the case where we did not know the number of E’s in the word.

Application: A modern view of linear regression

We can always write \[Y_t=\beta_0^*+\beta_1^* X_{1t} + \varepsilon_t\] such that $\mathbb{E}\left(\varepsilon_t\right)=0$ and $\mathbb{E}\left(X_{1t}\varepsilon_t\right)=0$.
$\mathbb{E}\left(\varepsilon_t\right)=0$ and $\mathbb{E}\left(X_{1t}\varepsilon_t\right)=0$ together imply that $\mathsf{Cov}\left(X_{1t},\varepsilon_t\right)=0$.
Contrast this to the first version of the linear regression model you have seen earlier.

Usually, introductory textbooks will write down an equation like \[Y_t=\beta_0+\beta_1 X_{1t} + \varepsilon_t\] and then make assumptions about $\varepsilon_t$. The task is to learn $\beta_0$ and $\beta_1$.
If you assume that $\varepsilon_t$ satisfies $\mathbb{E}\left(\varepsilon_t\right)=0$ and $\mathbb{E}\left(X_{1t}\varepsilon_t\right)=0$ then, $\beta_0$ and $\beta_1$ automatically coincides with the coefficients of the optimal linear predictor.

Where to next?

The approach presented in the slides gives you a clearer motivation as to why we could interpret coefficients produced by lm() as predicted differences.
Actually in introductory textbooks, they impose a stronger assumption on $\varepsilon_t$. They would say that $\varepsilon_t$ is mean-independent of $X_{1t}$.
This motivates us to look further into this idea of mean independence.

Conditioning on an event

Return to I SEE THE MOUSE. The next task is to predict the number of letters in the selected word if you have information about the number of E’s in the word (call this $X_1$).
- We could already answer this question using the optimal linear predictor discussed earlier.
- But there is another way to provide a prediction.

In “I SEE THE MOUSE”, knowing the event that the selected word has 1 E, reduces the sample space of words from \[\{\mathrm{I}, \mathrm{SEE}, \mathrm{THE}, \mathrm{MOUSE}\}\] to \[\{\mathrm{SEE}, \mathrm{THE}, \mathrm{MOUSE}\}\]
As a result, your prediction of the number of letters will definitely change.

We now define a concept called the expectation of a random variable conditional on an event: Let $Y$ be a random variable and let $A$ be an event. Then, \[\mathbb{E}\left(Y|A\right)=\frac{\mathbb{E}\left(YI\left(A\right)\right)}{\mathbb{E}\left(I\left(A\right)\right)}=\frac{\mathbb{E}\left(YI\left(A\right)\right)}{\mathbb{P}\left(A\right)}\]
Take note of the similarity of the definition to how you would compute a subgroup average with real data.
Of course, $\mathbb{P}\left(A\right)>0$ for this $\mathbb{E}\left(Y|A\right)$ to be well-defined.

If $\mathbb{P}\left(A\right)>0$, then $\mathbb{E}\left(\cdot|A\right)$ has to satisfy the axioms of the expectation operator.
If $\Omega$ can be decomposed into disjoint events $\left\{A_k\right\}$, then \[\mathbb{E}\left(Y\right) = \sum_k \mathbb{P}\left(A_k\right)\mathbb{E}\left(Y|A_k\right)\]
- In some settings, you can think of this as wishful thinking or conditioning on what you wish you knew.

In a similar way when we obtained probabilities from expectation, we can also obtain a notion of conditional probability, say $\mathbb{P}\left(B|A\right)$ by choosing $Y=I\left(B\right)$, so that \[\mathbb{P}\left(B|A\right)=\frac{\mathbb{E}\left(I\left(A\right)I\left(B\right)\right)}{\mathbb{E}\left(I\left(A\right)\right)}\]

Conditioning on a random variable

Let $X$ be another random variable. We can also condition on the possible values of $X$. For example, let $x$ be some specific value that $X$ could take. The event is then $X=x$, which is shorthand for \[\{\omega\in\Omega: X\left(\omega\right)=x\}\]
$\mathbb{E}\left(Y|X=x\right)$ can be directly obtained as a result. This is just a constant.

But the values of $x$ could vary. In this sense, $\mathbb{E}\left(Y|X=x\right)$ may also be thought of as a function of $x$.
Another layer on top of this is that $X$ is a random variable. Although it is possible to know the alternative values of $X$, we do not know which will be realized.
That is why we can also think of $\mathbb{E}\left(Y|X\right)$ as a random variable which has a probability distribution.
- This means that $\mathbb{E}\left(Y|X=x\right)$ will be one of its values with associated probability $\mathbb{P}\left(X=x\right)$.

A technical problem we face (which affects practice) is that sometimes $X$ is a continuous random variable. Therefore, $\mathbb{P}\left(X=x\right)=0$.
- We will encounter a way to get around this in practice.
- The idea is to take a small neighborhood around $x$ instead.

But an alternative approach is to define $\mathbb{E}\left(Y|X\right)$ as a scalar function of $X$ such that \[\mathbb{E}\left[\left(Y-\mathbb{E}\left(Y|X\right)\right)H\left(X\right)\right]=0\] for any scalar-valued function $H\left(X\right)$.
Note the similarity of the condition above to the conditions for linear prediction: \[\mathbb{E}\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X\right) = 0, \ \ \mathbb{E}\left[X\left(Y-\beta_{0}^{*}-\beta_{1}^{*}X\right)\right] = 0\]

The conditional expectation and the best linear predictor differ with respect to the class of functions $H\left(X\right)$.
For the best linear prediction, there are only two functions $H\left(X\right)=1$ and $H\left(X\right)=X$ involved (at least in this context of scalar $Y$ and scalar $X$).
This tells us that conditional expectations and best linear predictors are not necessarily the same!

What are the advantages of this alternative approach when thinking of conditional expectations?
- It allows us to cover the case where $X$ could be discrete or continuous.
- If $\mathbb{E}\left(Y^2\right)<\infty$, then $\mathbb{E}\left(Y|X\right)$ is the scalar-valued function $\psi\left(X\right)$ which minimizes \[\mathbb{E}\left[\left(Y-\psi\left(X\right)\right)^2\right]\]

Properties of the conditional expectation

If $X$ and $Y$ are independent and $G$ is any function depending on $Y$ alone, then $\mathbb{E}\left(G\left(Y\right)|X\right)= \mathbb{E}\left(G\left(Y\right)\right)$.
For any function $G$ depending on $X$ alone, $\mathbb{E}\left(G\left(X\right)\cdot Y|X\right)=G\left(X\right)\cdot \mathbb{E}\left(Y|X\right)$.
Linearity: $\mathbb{E}\left(Y_1+Y_2|X\right)= \mathbb{E}\left(Y_1|X\right) + \mathbb{E}\left(Y_2|X\right)$.
For any r.v.’s $X$, $Y$, and $Z$, $\mathbb{E}\left[\mathbb{E}\left(Y|X,Z\right)|Z\right]= \mathbb{E}\left(Y|Z\right)$.

Law of iterated expectations

The law of iterated expectations is given by \[\mathbb{E}\left[G\left(X,Y\right)\right]=\mathbb{E}\left[\mathbb{E}\left[G\left(X,Y\right)|X\right]\right].\]
Take note of which distributions are being used to calculate the corresponding expectations.
A more commonly seen version is \[\mathbb{E}\left[Y\right]=\mathbb{E}\left[\mathbb{E}\left(Y|X\right)\right].\]

Law of total variance

The law of total variance is given by \[\mathsf{Var}\left(Y\right)=\mathbb{E}\left[\mathsf{Var}\left(Y|X\right)\right]+\mathsf{Var}\left[\mathbb{E}\left(Y|X\right)\right].\]
$\mathbb{E}\left[\mathsf{Var}\left(Y|X\right)\right]$ is called within-group variation or unexplained variation while $\mathsf{Var}\left(\mathbb{E}\left(Y|X\right)\right)$ is called between-group variation or explained variation.
- Extreme case 1: $Y$ is independent of $X$.
- Extreme case 2: $Y$ is a known function of $X$.

CEF versus BLP

Let us revisit I SEE THE MOUSE.
- The BLP is $2+X_1$.
- The CEF is \[\mathbb{E}\left(Y|X_1=x\right)=\begin{cases} 1 & x=0 \\ 4 & x=1 \\ 3 & x=2\end{cases}\]
Let us revisit the design in Exercise Set 03. Both the CEF and BLP are given by $3+2X_1$.

(A version of Aronow and Miller (2009)) Let $Y=X_1^2+W$ where $W$ and $X$ are independent.
- The CEF is $X^{2}_1$.
- If $X_1\sim N\left(0,1\right)$ , then the best linear predictor is $1$.
- If $X_1\sim N\left(-1,1\right)$, then the best linear predictor is $-2X_1$.
- If $X_1\sim N\left(1,1\right)$, then the best linear predictor is $2X_1$.
Changing the distribution of $X_1$ also changes the best linear predictor. In this example, the CEF is the same regardless of the distribution of $X_1$.

But, they can be the same.

It is not worth the trouble to write down the conditions where the two could be equal.
Two popular examples:
- $\left(X_1,Y\right)$ have a bivariate normal distribution.
- $X_1$ is a binary random variable or dummy variable. This means that $X_1$ takes on only two possible values 0 or 1. $Y$ may be discrete or continuous.

The latter is probably most practically relevant. The latter is sometimes called a saturated model.
- Suppose $X_1$ takes on only two values, 0 and 1. Therefore, we have \[\mathbb{E}\left(Y|X_1=0\right)=\pi_{0},\ \mathbb{E}\left(Y|X_1=1\right)=\pi_{1}.\]
- Therefore, $\mathbb{E}\left(Y|X_1=x_1\right)=\pi_{0}+\left(\pi_{1}-\pi_{0}\right)x_1$. So, you can set $\beta_{0}^{*}=\pi_{0}$ and $\beta_{1}^{*}=\pi_{1}-\pi_{0}$.

CEF versus BLP: extrapolations

Suppose $x$ is not a possible outcome of $X$. Technically, we say that $x$ is not in the support of $X$.
We could not compute $\mathbb{E}\left(Y|X=x\right)$ even if we had access to the entire population.
But we could still compute $\beta_{1}^{*}+\beta_{2}^{*}x$.

CEF versus BLP: Real data

Estimation and inference could be tricky for the CEF.
- Curse of dimensionality: What happens when you have other random variables in addition to $X_1$?
- Tuning parameters: The difference between the discrete and continuous case matters.
What you are going to see is a very naive form of nonparametric regression. If you want to learn more, I would suggest Henderson and Parmeter (2015).

Three illustrations:
- Relation between schooling and log earnings: Data from the US Current Population Survey, obtained from 528 persons interviewed in May 1985.
- Relation between father’s height and son’s height: Data collected by Karl Pearson and Alice Lee from 1903! Data were obtained from 1078 father and son pairs.
- Relation between district income and district average test scores for 220 elementary public school districts in Massachusetts in 1998.

Schooling and log earnings

CPS5 <- read.csv("CPS5.csv")
par(pch = 20)
plot(CPS5$V1, log(CPS5$V9), xlab = "Years in school", ylab = "Log hourly wage")
for(i in 6:18)
{
  points(i , mean(log(subset(CPS5, (CPS5$V1==i))$V9)), pch = 8, col = "hotpink2", cex = 2)
}
abline(coef(lm(log(V9) ~ V1, data = CPS5)), lty = "dashed")

Father’s height and son’s height

data <- read.csv("Pearson.csv")
par(pch=20)
plot(data$Father, data$Son, xlab = "Father's height in inches", ylab = "Son's height in inches")
abline(v = 71.5 , lty = "dashed")
abline(v = 72.5, lty = "dashed")

Let us look at that particular strip.

strip <-subset(data, (data$Father >= 71.5 & data$Father <= 72.5))
stem(strip$Son)


  The decimal point is at the |

  64 | 5
  66 | 479
  68 | 001334467712237
  70 | 00023479901244555888
  72 | 011236688902668
  74 | 06
  76 | 4

Let us look at another strip.

strip<-subset(data, (data$Father >= 61.5 & data$Father <= 62.5))
stem(strip$Son)


  The decimal point is at the |

  62 | 8
  63 | 13789
  64 | 08
  65 | 28
  66 | 069
  67 | 24677

data <- read.csv("Pearson.csv")
par(pch = 20)
plot(data$Father, data$Son, xlab = "Father's height in inches", ylab = "Son's height in inches")
for(i in 59:75)
{
  points(i , mean(subset(data, (data$Father >= (i-0.5) & data$Father <= (i+0.5)))$Son), pch = 8, col = "hotpink2", cex = 2)
  abline(v = (i-0.5), lty = "dotted")
  abline(v = (i+0.5), lty = "dotted")
}
abline(coef(lm(Son ~ Father, data = data)), lty = "dashed")

District income and test scores

mcas <- read.csv("mcas.csv")
par(pch=20)
plot(mcas$percap, mcas$totsc4, xlab = "District income (thousands of dollars)", ylab = "Test score")
for(i in 9:50)
{
  points(i , mean(subset(mcas, (mcas$percap >= (i-0.5) & mcas$percap <= (i+0.5)))$totsc4), pch = 8, col = "hotpink2", cex = 2)
    abline(v = (i-0.5), lty = "dotted")
    abline(v = (i+0.5), lty = "dotted")
}
abline(coef(lm(totsc4 ~ percap, data = mcas)), lty="dashed")

mcas <- read.csv("mcas.csv")
par(pch=20, mfrow = c(1, 2))
plot(mcas$percap, mcas$totsc4, xlab = "District income (thousands of dollars)", ylab = "Test score")
for(i in 9:50)
{
  points(i , mean(subset(mcas, (mcas$percap >= (i-0.5) & mcas$percap <= (i+0.5)))$totsc4), pch = 8, col = "hotpink2", cex = 2)
}
curve(cbind(1, x, x^2) %*% coef(lm(totsc4 ~ percap + I(percap^2), data = mcas)), lty = "dashed", add = TRUE, col = "red")
plot(mcas$percap, mcas$totsc4, xlab = "District income (thousands of dollars)", ylab = "Test score")
for(i in 9:50)
{
  points(i , mean(subset(mcas, (mcas$percap >= (i-0.5) & mcas$percap <= (i+0.5)))$totsc4), pch = 8, col = "hotpink2", cex = 2)
}
curve(cbind(1, x, x^2, x^3) %*% coef(lm(totsc4 ~ percap + I(percap^2) + I(percap^3), data = mcas)), lty = "dashed", add = TRUE, col = "blue")

Describing Distributions

Previously…

References

Empirical vs theoretical quantities

Behavior of the mean

Learning to simulate in R

Results of the program

What should you notice?

Learning to simulate in R, part 2

What should you notice?

Revisiting Exercise Set 3, \(n=50\)

Revisiting Exercise Set 3, \(n=200\)

What should you notice?

Why should you care?

What are we to do next

Random variables: coin toss

Random variables: die roll

Expectation or expected value

How do we actually calculate an expectation?

Returning to examples

What if \(\Omega\) takes on infinitely many values?

Moments

Generalizations to multiple random variables

Covariance and correlation

Application: Which is real, which is fake?

The concept of independence

Stories behind special distributions: Bernoulli/binomial

Stories behind special distributions: Discrete uniform

Stories behind special distributions: Continuous uniform

Stories behind special distributions: Standard normal

The Standard Bivariate Normal Distribution

Extend to bivariate normal

Application: Learning a common population mean

How practical are these results?

Introducing convergence concepts

Application: One version of the linear regression model

Application: What do surveys accomplish?

How practical are these results?

Application: How can we make predictions?

Optimal or best prediction

Optimal or best linear prediction

Explicit solutions

I SEE THE MOUSE, again

Application: A modern view of linear regression

Where to next?

Conditioning on an event

Conditioning on a random variable

Properties of the conditional expectation

Law of iterated expectations

Law of total variance

CEF versus BLP

But, they can be the same.

CEF versus BLP: extrapolations

CEF versus BLP: Real data

Schooling and log earnings

Father’s height and son’s height

District income and test scores