Introduction

This section is focused on practicing building and analyzing likelihoods. This section assumes that you have seen the likelihoods lecture, where you are introduced to the basic idea of a likelihood, likelihood profiles, and MLEs.

The Binomial Distribution

Used to model the number of ``successes’’ in a set of trials (e.g., number of heads when you flip a coin \(N\) times). The pmf is \[\begin{align*} {N \choose x} p^x(1-p)^{N-x} \end{align*}\] such that \(\mathrm{E}[x]=Np\). Throughout this lab, you will assume that your experiment consists of flipping 20 coins, so that \(N=20\).

You will use the Binomial distribution to practice two methods of estimating parameters for a probability distribution: method of moments and maximum likelihood.

Simulating from the Binomial using R

Take 50 draws from a binomial (using rbinom) for each \(p\in\) 0.1, 0.5, 0.8 with \(N=20\).

## 50 draws with each p 
pp<-c(0.1, 0.5, 0.8)
N<-20
reps<-50

Plot the histograms of these draws together with the density functions.

## histograms + density here
x<-seq(0, 50, by=1)
par(mfrow=c(1,3), bty="n")

Q1: Do the histograms look like the distributions for all 3 values of \(p\)? If not, what do you think is going on?

You’ll notice that for \(p=0.1\) the histogram and densities don’t look quite the same – the hist() function is lumping together the zeros and ones which makes it look off. This is typical for distributions that are truncated.

Method of Moments (MoM) Estimators

To obtain a method of moments estimator, we equate the theoretical moments (which will be a function of the parameters of the distribution) with the corresponding sample moments, and solve for the parameters in order to obtain an estimate. For the binomial distribution, there is only one parameter, \(p\).

Q2: Given the analytic expected value, above, and assuming that the sample mean is \(m\) (the mean number of observed heads across replicates), what is the MoM estimator for \(p\)?

Now calculate the MoM estimator for each of your 3 sets of simulated data sets to get the estimates for each of your values of \(p\).

## MOM estimators for 3 simulated sets

Q3: How good are your estimates for \(p\)? Do you get something close to the true value?

For 1 of your values of \(p\), take 20 draws from the binomial with \(N=20\) and calculate the MoM. Repeat this 100 times (hint: the replicate() and lapply functions may be useful.) Make a histogram of your estimates, and add a line/point to the plot to indicate the real value of \(p\) that you used to simulate the data.

## MoM estimates, histogram

Q4: Is the MoM successfully estimating \(p\)? Does your histogram for \(p\) look more or less normal? If so, what theorem might explain why this would be the case?

MLE for Binomial Distribution

Likelihood and Log Likelihood

Imagine that you flip a coin \(N\) times, and then repeat the experiment \(n\) times. Thus, you have data \(x=x_1, x_2, \dots x_n\) that are the number of times you observed a head in each trial. \(p\) is the probability of obtaining a head.

Q5: Write down the likelihood and log-likelihood for the data. Take the derivative of the negative log-likelihood, set this equal to zero, and find the MLE, \(\hat{p}\).

Computing the likelihood and MLE in R

Simulate some data with \(p=0.25\), \(N=10\), and 10 replicates. Calculate the negative log-likelihood of your simulated data across a range of \(p\) (from 0 to 1), and plot them. You may do this by using the built in functions in R (specifically dbinom) or write your own function. This is called a ``likelihood profile’’. Plot your likelihood profile with a line indicating the true value of \(p\). Add lines indicating the MLE \(\hat{p}\) and the MoM estimator for \(p\) to your likelihood profile.

pp<-.25
N<-10
reps<-10

## Make one set of data

## the likelihood is always exactly zero
## at p=0,1, so I skip those values
ps<-seq(0.01, 0.99, by=0.01) 

## Likelihood


## MLE/MoM estimators 

## now plot the negative log likelihood profile

Q6: How does your MLE compare to the true value? If you chose another version of the random seed, do you get the same answer?

Maximum Likelihood Estimation for SLR

Next we go further to:

Learn how to use R to construct likelihood profiles/surfaces.
Get the idea of what a likelihood in 1 or 2-D should look like.
See the built in capabilities of R for finding extrema of functions.

The SLR Model

Recall that SLR assumes every observation in your dataset was generated by the model: \[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \;\;\; \varepsilon_i \stackrel{\mathrm{iid}}{\sim} \mathrm{N}(0, \sigma^2) \]

This is a model for the conditional distribution of \(Y\) given \(X\).

The pdf for the normal distribution is given by \[ f(x) = \frac{1}{\sqrt{2\sigma^2 \pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2} \right) \]

Thus under the SLR model, the conditional distribution for any single observation, \(y_i\), can be written as: \[ f(y_i|\beta_0, \beta_1, x_i) = \frac{1}{\sqrt{2\sigma^2 \pi}} \exp\left(-\frac{(y_i-(\beta_0+\beta_1 x_i))^2}{2\sigma^2} \right) \]

If we interpret this function as a function of the parameters \(\theta=\{ \beta_0, \beta_1, \sigma \}\), then it gives us the likelihood of the \(i^{\mathrm{th}}\) data point for those values of the parameters.

As we did for the Poisson distribution, we can use the to estimate the parameters of the model. We also round the analytic expressions for the MLEs of the SLR model (which are the same as the LSEs):

\[\begin{align*} \hat{\beta}_0 & = \bar{Y}-b_1 \bar{X}\\ \hat{\beta}_1 & = r_{xy}\frac{s_y}{s_x} \end{align*}\]

Simulating data from the SLR model

First you’re going to simulate some data from the SLR model as you did in previous labs/homeworks:

n<-30
beta0<-10
beta1<-3
sigma<-4
X<-rnorm(n, mean=3, sd=7)

## fill in the equation for Y
Y<-rep(NA, length=n)

dat<-data.frame(X=X, Y=Y)

Plot your data.

## put the plot here

In lecture I showed you one way to impliment the negative log likelihood (NLL) in R. We use a similar approach here. However the function I’m providing is a bit different - I do it this way because later you will need to be able to use optim.

nll.slr<-function(par, dat, ...){
  args<-list(...)
  
  ## parameters
  b0<-par[1]
  b1<-par[2]
  
  ## data
  X<-dat$X
  Y<-dat$Y
  
  ## check if we will estimate sigma
  if(!is.na(args$sigma)){
    sigma<-args$sigma
  }else sigma<-par[3]
  
  ## calculate Yhat
  mu<-b0+b1*X
  
  ## the negative log likelihood
  return(-sum(dnorm(Y, mean=mu, sd=sigma, log=TRUE)))

}

Likelihood profile in R

For now, let’s assume that we know the true values of \(\beta_1\) and \(\sigma^2\) but don’t know \(\beta_0\). We will build a likelihood profile for your simulated data as follows:

N<-30 ## number of points to evaluate the nll
b0s<-seq(5, 15, length=N) ## grid of b0 values
mynll<-rep(NA, length=N) ## initialize object to
                          ## hold the nll

## calculate the value of the nll for each proposed 
## value of b0
#for(i in 1:N){
#  mynll[i]<- nll.slr(par=c(b0s[i],beta1), dat=dat, #sigma=sigma)
#}

Now plot the negative log likelihood as a function of \(b_0\). Indicate with a vertcal line (or something else):

the TRUE value of the intercept, \(\beta_0\)
the \(b_0\) corresponding to the minimum value of the nll that you’ve calculated (i.e., based on your grid)
The analytic MLE, \(\hat{\beta}_0\)

## put your plot here

Q1: How close is your estimate of the b0 to the truth? How close is the MLE to the estimate from the grid? How might you get your grid estimate closer to the MLE value?

Likelihood surface in R

We can do a grid search in 2-D as well, in order to estimate the slope and interept concurrently. For this we have to evaluate the likelihood on a 2-D grid of values of \(b_0\) and \(b_1\) – if you consider \(N_0\) possible values of \(b_0\) and \(N_1\) values of \(b_1\) you will evaluate the NLL at \(N_0 \times N_1\) points. I recommend that \(N_0 \neq N_1\) as it makes debugging easier.

## set up the possible values of b0 and b1
N0=100
N1=101
b0s<-seq(7,12, length=N0)
b1s<-seq(1,5, length=N1)

## build a matrix to hold the values of the NLL
mynll<-matrix(NA, nrow=N0, ncol=N1)

## run nested for loops to fill in the matrix
#for(i in 1:N0){
#  for(j in 1:N1)
#    mynll[i,j]<-nll.slr(par=c(b0s[i],b1s[j]),
#                        dat=dat, sigma=sigma)
#}

Based on this grid, find the minimum values of the NLL. Save them as b0.est and b1.est.

## find the values that give the lowest NLL


b0.est<-NA
b1.est<-NA

This is one way to plot the likelihood surface, with the true and estimated values of the parameters indicated.

#filled.contour(x = b0s,
#               y = b1s,
#               z= mynll,
#               col=heat.colors(25),
#               plot.axes = {axis(1); axis(2);
#                 points(beta0,beta1, pch=21);
#                 points(b0.est, b1.est, pch=8,
#                        cex=1.5)},
#               plot.title={
#                 title(xlab="b0", ylab="b1")
#                 }
#  )

Plot the simulated data with the estimated and true lines overlayed.

## your plot here

Q2: How good are your estimates? Do you get closer to one of the parameters than the other? Does this match with what you see as you plot the line with your data?

Alternatives to Grid Search

There are many alternative methods to grid searches. Since we are seeking to minimize an arbitrary function (the negative log likelihood) we typically use a descent method to perform general optimization.

There are lots of options implemented in the optim function in R. We won’t go into the details of these methods, due to time constraints. However, I typically use:

Brent’s method: for 1-D search within a bounding box, only
L-BFGS-B (limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm with bounding box constraints): a quasi-Newton method, used for higher dimensions, when you want to be able to put simple limits on your search area.

Maximum likelihood using optim

The first argument is the function that you want to minimize, and the second is a vector of starting values for your parameters. After the main arguments, you can add whatever else you need to evaluate your function (e.g. sigma here).

#fit <- optim(nll.slr, par=c(2, 1), 
#             method="L-BFGS-B", ## this is a n-D 
#                                ## method
#             lower=-Inf, upper=Inf, dat=dat,
#             sigma=sigma)

#fit

I can also fit \(\sigma\) as the same time, if I want:

#fit <- optim(nll.slr, par=c(2, 1, 5),
#             method="L-BFGS-B", ## this is a n-D 
#                                ## method
#             lower=c(-Inf, -Inf, 0.1), 
#             upper=Inf, dat=dat, sigma=NA)

## looking only at the parameter estimates
#fit$par

Plot your simulated data together with the fitted line based on the parameter values outputted by optim.

## Your plot here

Compare to fitting with lm

Fit the SLR model using the LM function and look at the estimates of the parameters.

## your fit of the model


## print out the coefficients

Q3: How do the estimates obtained via the lm function compare to the estimates obtained using optim? How do both compare to what you obtained using a grid search? Which would you trust the most, and why?

Going Further: Numerical Estimates of Confidence Intervals

The joint distribution of the MLEs are asymptotically Normally distributed. Given this, if you are minimizing the negative log likelihood (NLL) then the covariance matrix of the estimates is (asymptotically) the inverse of the Hessian matrix. The Hessian matrix evalutes the second derivatives of the NLL (numerically here), which gives us information about the curvature the likelihood. Thus we can use the Hessian to estimate confidence intervals:

#fit <- optim(nll.slr, par=c(2, 1), 
#             method="L-BFGS-B", hessian=TRUE,
#             lower=-Inf, upper=Inf, dat=dat,
#             sigma=sigma)

#fisher_info<-solve(fit$hessian)
#est_sigma<-sqrt(diag(fisher_info))
#upper<-fit$par+1.96*est_sigma
#lower<-fit$par-1.96*est_sigma
#interval<-data.frame(value=fit$par, upper=upper, lower=lower)
#interval

The standard errors on the intercept and slope should be very similar to those we obtain from the lm function, above.

If you’re interested in proofs of this sort of thing, they are covered in some of the numberical methods courses.

Review of Likelihoods

Activity: Method of Moments and Likelihoods; MLEs for SLR

VectorBiTE Team

Summer 2021