The von Mises–Fisher distribution (vMF) distribution is often regarded as the hyperspherical analogue of the Normal distribution, as it models data concentrated around a mean direction on the unit hypersphere.
The von Mises–Fisher distribution (vMF) distribution is often regarded as the hyperspherical analogue of the Normal distribution, as it models data concentrated around a mean direction on the unit hypersphere. In Euclidean space, a normal distribution is typically parameterized by a mean and variance , and in one dimension its density forms the familiar bell-shaped curve. Just as the normal distribution is the distribution with the maximum entropy given the mean and variance constraints, the vMF distribution is the distribution with the maximum entropy on a sphere given the average direction constraint. According to the Central Limit Theorem: the limiting distribution of a random walk or accumulated direction vector on a sphere will also tend towards the vMF distribution.
Basic Concepts
Domain: -dimensional unit sphere
Parameters:
- Mean direction
- Concentration parameter
Probability density function (with respect to the uniform measure on the sphere):
where
and is the modified Bessel function of the first kind.
Intuitive Understanding
- : uniform distribution on the sphere (no directional preference)
- : distribution concentrated around , with larger implying higher concentration
- : tends to a point mass at
Thus, the vMF distribution provides an analogy:
- In , the normal distribution controls concentration via the precision matrix
- On the sphere, the vMF distribution controls concentration around
Properties
Exponential Family Property
The vMF distribution belongs to the exponential family.
Its form is:
with
Hence, vMF enjoys standard exponential family properties:
- Existence of a sufficient statistic
- Simple maximum likelihood estimation
- Convenient for use in EM algorithms
Sufficient statistic:
First Moment (Mean)
where
The function is called the mean resultant length function.
Second Moment
Rotation Invariance
For any orthogonal matrix ,
if
then
This shows that the vMF distribution is invariant under the rotation group.
Information Geometry Properties
The family of vMF distributions forms an information-geometric manifold.
Fisher information:
This allows the use of natural gradient and information-geometric optimization.
Entropy
The entropy of the vMF distribution is:
Property: larger gives smaller entropy.
Spherical Harmonic Expansion
The vMF distribution can be expanded as:
where are spherical harmonics.
Spherical Cap
In the von Mises–Fisher distribution, probability mass is mainly concentrated around the mean direction .
Hence, a spherical cap is commonly used to describe confidence regions.
Let , , and let be the angle between the vectors:
A confidence region can be written as
i.e., a spherical cap with axis .
On the unit sphere ,
the area of a spherical cap is:
where is the total surface area of the sphere, given by:
and is the regularized incomplete beta function.
For :
Cap area:
Total sphere area:
Area proportion: .
In high dimensions, the probability within a spherical cap for a vMF distribution is:
There is no simple closed-form solution.
Approximations are commonly used.
When is large, the distribution is concentrated around . Using
we obtain
Thus, locally, the vMF distribution approximates a Gaussian distribution:
If the confidence probability is , then
where is the quantile of the normal distribution.
Examples:
| Confidence | |
|---|---|
| 90% | 1.64 |
| 95% | 1.96 |
| 99% | 2.58 |
Thus, the 95% confidence cone angle is:
In high dimensions , we have
with
Therefore, the confidence angle satisfies:
For small angles:
where
Hence, the spherical cap area grows as , which is a key reason why high-dimensional sphere volume concentrates near the equator.
In many embedding papers, the following is used:
Interpretation: most probability mass of the vMF distribution lies in
Intuition: determines the directional noise magnitude:
These formulas are frequently used in sphere embedding, CLIP embedding, contrastive learning, spherical clustering, and directional statistics.
For example, can be used to estimate: angular noise in embeddings, angular radius of clusters, confidence cone of prototypes.
The probability mass of the vMF distribution is concentrated at
The corresponding confidence region is a spherical cap with axis and angular radius roughly .
Parameter Estimation
Estimating is the most challenging part, as it requires solving an equation involving modified Bessel functions.
Below are several practical approximation algorithms/formulas for estimating the vMF parameter, listed in approximate order of common usage.
Given samples , each .
First compute:
Mean direction:
The exact equation for is:
where
This involves the modified Bessel function of the first kind, so approximations or numerical solutions are necessary.
Banerjee Approximation (most commonly used)
Source: Arindam Banerjee, 2005: Clustering on the Unit Hypersphere using von Mises-Fisher
Applicable for high dimensions
Advantages: extremely simple, computation, works well in high dimensions. Disadvantages: larger bias in low dimensions.
This is the most common initialization formula for spherical k-means / vMF mixtures.
Sra Approximation + Newton Refinement (recommended)
Source: Suvrit Sra, 2012: A Short Note on Parameter Approximation for vMF
Initial value:
Then apply Newton iteration:
Advantages: highly accurate, only 2–3 iterations needed.
Many libraries use this method.
Minka Approximation
Source: Thomas Minka
Formula:
Characteristics: slightly more accurate than Banerjee, still closed-form.
Large- Asymptotic Approximation
When is very large:
Inverting gives:
Applicable when embeddings are highly concentrated, e.g., in contrastive learning.
Small- Approximation
When is very small:
Thus,
Applicable when data are nearly uniform.
| Estimation | Complexity | Accuracy | Usage |
|---|---|---|---|
| Banerjee | Very low | Medium | Initialization |
| Sra + Newton | Medium | High | Recommended |
| Minka | Low | Medium-High | Fast estimation |
| Large | Very low | High (large ) | High concentration |
| Small | Very low | High (small ) | Near uniformity |