Maximum Likelihood Estimation

Chelsea Zou
7 min readDec 15, 2024

--

You’re thrown a bunch of data, and you can’t help but notice that they appear to follow a certain distribution. You aren’t sure of what the specific parameters of the distribution are, but you think to yourself just how cool it would be to find out. That’s where Maximum Likelihood Estimation (MLE) comes in — a powerful statistical method that allows you to infer the parameters of a distribution. The term comes up quite often in machine learning (ML) and is an important concept to understand exactly how an ML model learns from data. At a high level, the goal of MLE is to estimate the parameters of an assumed distribution that maximizes the probability of your data. For example, if I believe my data follows a normal distribution, then I can derive estimates for the most likely values of its mean and standard deviation. Similarly, if my data comes from a Poisson distribution, then I can find a most likely estimate for lambda, the rate of occurrence.

At the core of MLE is the likelihood function. What it all really boils down to are probability density/mass functions (which if you haven’t read my article on them yet, you should. It’s an important prerequisite to understanding MLE). The likelihood function is defined as the product of all the individual density functions of each of your data points. MLE aims to maximize this likelihood function by inferring the best and most optimal parameters based on the data.

The MLE for our parameter is the estimate that maximizes the probability of observing our datapoints x.

Suppose we have a shit ton of data, and the density function of each datapoint spits out a probability value between 0 and 1. Do you see what problem we might encounter? Multiplying a bunch of numbers < 1 will eventually converge to 0 as n → infinity, so the first step is a clever trick to convert the product into a sum by taking the log of the likelihood function. Thanks to log properties, the product of n functions is now equivalent to the sum of the log of n functions, which prevents the likelihood function outputting a super tiny probability.

Next, we take the derivative of this function wrt. to the parameters, set it to 0, and solve for the parameters. Doing this allows us to find the values that maximize the likelihood function, making the observed data most probable under the model. Additionally we can check that the second derivative is <= 0 to ensure it is at least a local maximum. And seriously, that’s pretty much it to MLE…

Really Important Example: MLE in Neural Networks

We’ve talked a bit about the statistics. Cool, but how does this exactly tie into the basic concepts in ML we are familiar with? The answer? Loss functions. Let’s look at an example.

Say you want to train a neural network to classify if something is in class 1 or class 0. This is a binary classification problem, in which the typical loss function we would use is the binary cross entropy (BCE) loss. Where does MLE come into play? Well it turns out that the BCE loss function is essentially a log likelihood function. Specifically, maximizing the log likelihood equation is exactly equivalent to minimizing the BCE loss. Let’s see how that works.

Here is the standard BCE loss function. Keep this in mind as we work through the following derivation:

First we can ask ourselves, what random variable (RV) comes to mind when we think of having data that either belongs to a certain class or not? A Bernoulli RV should ring a bell, where a “success” with probability p would correspond to class 1, and a “failure” with probability 1 - p would correspond to class 0. So BCE is essentially derived from the Bernoulli distribution. Recall that the probability mass function (PMF) of a Bernoulli RV takes the form pʸ (1 - p)¹⁻ʸ if y = 0 or 1. If y = 1, we focus on just p, and if y = 0, then we focus on 1 - p. Keep in mind that the goal of MLE in the case of Bernoulli RVs is to estimate the unknown parameter p, based on our data.

Alright, into ML territory. We have some data X and its true labels y ∈ {0, 1}. We want to predict ŷ, which for a standard linear model, is a function of our weights and bias parameters w and b: ŷ = wX + b. We want to find the optimal parameters w and b to best fit our input X and predict the true y such that y — ŷ is small. For simplicity, let’s first just ignore separately computing w and b and focus on just estimating one parameter: ŷ (since it is a function of our w and b).

Now, back to our Bernoulli PMF. Replace our target estimation parameter p with ŷ so that our PMF becomes ŷʸ (1 - ŷ)¹⁻ʸ. As stated above, the likelihood, denoted L, is a product of the PMFs over all of our data points:

(1) L = ∏​ ŷᵢ^yᵢ ∗ (1-ŷᵢ)^1-y

Taking the log of the likelihood turns our products into a sum:

(2) log L = ∑ log [ŷᵢ^yᵢ ∗(1-ŷᵢ)^1-yᵢ]

Using log rules, we get:

(3) log L = ∑ yᵢ log(ŷᵢ) + (1-yᵢ) log(1 — ŷᵢ)

Does this look familiar yet? Take a look at our BCE loss from above. It’s exactly the negative log likelihood function!

Alright what’s with the negative though? Remember that we want to find the parameter that maximizes the log likelihood function (i.e., the argmax of ŷ). In theory, recall that we would need to manually take the derivative of that function wrt ŷ, set it to 0, and solve for our parameter. However, directly solving this equation is usually not possible for most ML models that are complex and high-dimensional. So in practice, we instead use gradient descent to iteratively minimize the negative log likelihood function instead. TLDR; maximizing the likelihood is equivalent to minimizing the negative likelihood, which is just computationally easier to do in real settings. Feels good when it all comes together now doesn’t it.

KL Divergence

Another term that might come up whilst dabbling in MLE is the KL divergence. The KL divergence measures how far apart two probability distributions are. Here is the formula:

This looks like it could relate very well to MLE. Well it also turns out that minimizing the KL divergence is also equivalent to maximizing the log likelihood function which, as stated before, is equal to minimizing the negative log likelihood function. Here, P(x) represents the true distribution of our data, and Q(x) represents our estimated distribution according to our parameter(s) θ. However, since we do not have access to the true distribution P(x) and our parameters do not depend on it, we can simplify this a bit to cancel the P(x) from above and leaves us with the equivalence.

Note the expected value E notation above is the same as ∑ p(x) * log (p(x)/q(x)).

MLE Properties

Ok so this all seems pretty nice, but you might be thinking, how do we even know that it works? There are a few properties of MLE that render it optimal. There exists formal proofs for each of them but it’s enough to just state the overall intuition for now. First, as the sample size n of our data increases, the MLE estimate will become more accurate and get closer to the true parameter value. This is called the consistency property and comes from the Law of Large Numbers. Second, MLE tends to produce estimates that have the lowest possible variance, which is good (for instance imagine if it’s highly volatile and sometimes outputs really large values, and at other times, really small values — we would probably be very suspicious and trust it way less). This property is called efficiency, and is shown by achieving something called the Cramér-Rao Lower Bound, which is the lowest possible variance that any unbiased estimator can achieve (unbiased meaning that the expected value of the parameter is equal to the true value). The final property is that as the sample size increases, the distribution of the MLE estimate becomes approximately Normal. This property is called asymptotic normality and comes from Central Limit Theorem, which is useful because it allows us to construct confidence intervals and perform hypothesis testing for statistical analysis.

MLE is one of those ideas that just feels pretty neat to understand. If you don’t get too lost in the statistical details, it’s a relatively simple concept. You start with data, some assumptions about its distribution, and a nice technique to uncover its parameters. While there are other ways to compute estimators such as Method of Moments (which I may or may not write about in another article), MLE is just superior tbh.

--

--

Chelsea Zou
Chelsea Zou

Written by Chelsea Zou

ML @ Stanford | Dabbler in science, tech, maths, philosophy, neuroscience, and AI | http://bosonphoton.github.io

No responses yet