StatisticsMaximum Likelihood Estimation

So far we've had two ideas for building an estimator for a statistical functional : one is to plug into , and the other—kernel density estimation—is closely related (we just smear the probability mass out around each observed data point before substituting into ). In this section, we'll learn another approach which has some compelling properties and is suitable for choosing from a parametric family of densities or mass functions.

Let's revisit the example from the first section where we looked for the Gaussian distribution which best fits a given set of measurements of the heights of 50 adults. This time, we'll include a goodness score for each choice of and , so we don't have to select a best fit subjectively.

The goodness function we'll use is called the log likelihood function, which we define to be the log of the product of the density function evaluated at each of the observed data points. This function rewards density functions which have larger values at the observed data points and penalizes functions which have very small values at some of the points. This is a rigorous way of capturing the idea that the a given density function is consonant with the observed data.

Adjust the knobs to get the goodness score as high as possible (hint: you can get it up to about ).

μ=${μ}

σ=${σ}

log likelihood = ${LL}

The best μ value is , and the best σ value is .

Definitions

Consider a parametric family of PDFs or PMFs. For example, the parametric family might consist of all Gaussian distributions, all geometric distributions, or all discrete distributions on a particular finite set.

Given , the likelihood is defined by

The idea is that if is a vector of independent observations drawn from , then is small or zero when is not in concert with the observed data.

Because likelihood is defined to a product of many factors, its values are often extremely small, and we may encounter overflow issues. Furthermore, sums are often easier to reason about than products. For both of these reasons, we often compute the logarithm of the likelihood instead:

Maximizing the likelihood is the same as maximizing the log likelihood because the natural logarithm is a monotonically increasing function.

Example
Suppose is the density of a uniform random variable on . We observe four samples drawn from this distribution: , and . Find , , and .

Solution. The likelihood at 5 is zero, since . The likelihood at is very small, since . The likelihood at 7 is larger: .

As illustrated in this example, likelihood has the property of being zero or small at implausible values of , and larger at more reasonable values. Thus we propose the maximum likelihood estimator

Example
Suppose that is the normal density with mean and variance . Find the maximum likelihood estimator for and .

Solution. The maximum likelihood estimator is the minimizer of the logarithm of the likelihood function, which works out to

since , for each .

Setting the derivatives with respect to and equal to zero, we find

which implies (from solving the second equation) as well as (from solving the first equation). Since there's only one critical point, and since we can observe that the log likelihood goes to as , there must be a local maximum at this critical point.

So we may conclude that the maximum likelihood estimator agrees with the plug-in estimator for and .

Exercise
Consider a Poisson random variable with parameter . In other words, .

Verify that

Show that it follows the maximum likelihood estimator is equal to the sampel mean , and explain why this makes sense intuitively.

Solution. When we take the derivative with respect to and set it equal to zero, we get

which gives us , the sample mean.

Taking a second derivative gives . Since this quantity is everywhere negative, the likelihood is concave. Therefore, the MLE has a local maximum at the critical point , and that local maximum is also a global maximum.

Example
Suppose for , where has distribution . Treat as known and as the only unknown parameter. Suppose that observations are made.

Show that the least squares estimator for is the same as the MLE for by making observations about your log likelihood.

Solution. The log likelihood is

The only term that depends on is the second one, so maximizing the log likelihood is the same as maximizing , which in turn is the same as minimizing .

Exercise
(a) Consider the family of distributions which are uniform on , where . Explain why the MLE for the distribution maximum is the sample maximum.

(b) Show that the MLE for a Bernoulli distribution with parameter is the empirical success rate .

(a) The likelihood associated with any value of smaller than the sample maximum is zero, since at least one of the density values is zero in that case. The likelihood is a decreasing function of as ranges from the sample maximum to , since it's equal to . Therefore, the maximal value is at the sample maximum.

(b) The derivative of the log likelihood function is

where is the number of successes. Setting the derivative equal to zero and solving for , we find .

Properties of the Maximum Likelihood Estimator

MLE enjoys several nice properties: under certain regularity conditions, we have

Consistency: as the number of samples goes to . In other words, the average squared difference between the maximum likelihood estimator and the parameter it's estimating converges to zero.
Asymptotic normality: converges to as the number of samples goes to . This means that we can calculate good confidence intervals for the maximum likelihood estimator, assuming we can accurately approximate its mean and variance.
Asymptotic optimality: the MSE of the MLE converges to 0 approximately as fast as the MSE of any other consistent estimator. Thus the MLE is not wasteful in its use of data to produce an estimate.
Equivariance: Suppose is the MLE of for . Then the MLE for is . This is a useful property; it states that transformation on the parameter (say, shifting the mean of a normal distribution by a number, or taking the square of the standard deviation) of interest is not an inconvenience for our MLE estimate for the parameter because we can simply apply the transformation on the MLE as well.

Example
Show that the plug-in variance estimator for a sequence of i.i.d. samples from a Gaussian distribution converges to as .

Solution. We've seen that the plug-in variance estimator is the maximum likelihood estimator for variance. Therefore, it converges to by MLE consistency.

Exercise
Show that it is not possible to estimate the mean of a distribution in a way that converges to the true mean at a rate asymptotically faster than , where is the number of observations.

Solution. The sample mean is the maximum likelihood estimator, and it converges to the mean at a rate proportional to the inverse square root of the number of observations. Therefore, there is not another estimator which converges with an asymptotic rate faster than that.

Drawbacks of maximum likelihood estimation

The maximum likelihood estimator is not a panacea. We've already seen that the maximum likelihood estimator can be biased (the sample maximum for the family of uniform distributions on , where ). There are several other issues that can arise when maximizing likelihoods.

Computational difficulties. It might be difficult to work out where the maximum of the likelihood occurs, either analytically or numerically. This would be a particular concern in high dimensions (that is, if we have many parameters) and if the maximum likelihood function is .
Misspecification. The MLE may be inaccurate if the distribution of the observations is not in the specified parametric family. For example, if we assume the underlying distribution is Gaussian, when in fact its shape is not even close to that of a Gaussian, we very well might get unreasonable results.
Unbounded likelihood. If the likelihood function is not bounded, then is not even defined:

Exercise
Consider the family of distributions on given by the set of density functions

where , and where and are nonnegative real numbers such that . Show that the likelihood function has no maximum for this family of functions.

a=${a} b=${b} c=${c} d=${d} γ=${γ}

likelihood = ${likelihood}

Solution. We identify the largest value in our data set and choose to be less than that value and to be more than it. We choose and so that the interval contains all of the other observations (since otherwise we would get a likelihood value of zero). Then we can send to zero while holding and fixed. That sends to , which in turn causes the likelihood to grow without bound.

One further disadvantage of the maximum likelihood estimator is that it doesn't provide for a smooth mechanism to account for prior knowledge. For example, if we flip a coin twice and see heads both times, our (real-world) beliefs about the coin's heads probability would be that it's about 50%. Only once we saw quite a few heads in a row would we begin to use that as evidence move the needle on our strong prior belief that coins encountered in daily life are not heavily weighted to one side or the other.

Bayesian statistics provides an alternative framework which addresses this shortcoming of maximum likelihood estimation.

เปลี่ยนภาษา

ลงชื่อเข้าใช้ Mathigon

แบ่งปัน

รีเซ็ตความคืบหน้า

อภิธานศัพท์

StatisticsMaximum Likelihood Estimation

Definitions

Properties of the Maximum Likelihood Estimator

Drawbacks of maximum likelihood estimation