2026, April, 09
Introduction of vMF Distribution on Hypersphere

The von Mises–Fisher distribution (vMF) distribution is often regarded as the hyperspherical analogue of the Normal distribution, as it models data concentrated around a mean direction on the unit hypersphere.

The von Mises–Fisher distribution (vMF) distribution is often regarded as the hyperspherical analogue of the Normal distribution, as it models data concentrated around a mean direction on the unit hypersphere. In Euclidean space, a normal distribution is typically parameterized by a mean μ\mu and variance σ2\sigma^2, and in one dimension its density forms the familiar bell-shaped curve. Just as the normal distribution is the distribution with the maximum entropy given the mean and variance constraints, the vMF distribution is the distribution with the maximum entropy on a sphere given the average direction constraint. According to the Central Limit Theorem: the limiting distribution of a random walk or accumulated direction vector on a sphere will also tend towards the vMF distribution.

Basic Concepts

Domain: (d1)(d-1)-dimensional unit sphere

Sd1={xRd:x=1}\mathbb{S}^{d-1} = \{ \mathbf{x} \in \mathbb{R}^d : \|\mathbf{x}\|=1 \}

Parameters:

  • Mean direction μSd1\boldsymbol{\mu} \in S^{d-1}
  • Concentration parameter κ0\kappa \ge 0

Probability density function (with respect to the uniform measure on the sphere):

f(xμ,κ)=Cd(κ)exp(κμx)f(\mathbf{x} \mid \boldsymbol{\mu}, \kappa) = C_d(\kappa) \exp(\kappa \boldsymbol{\mu}^\top \mathbf{x})

where

Cd(κ)=κd21(2π)d/2Id21(κ)C_d(\kappa) = \frac{\kappa^{\frac{d}{2} - 1}}{(2\pi)^{d/2} I_{\frac{d}{2} - 1}(\kappa)}

and IνI_\nu is the modified Bessel function of the first kind.

Intuitive Understanding

  • κ=0\kappa = 0: uniform distribution on the sphere (no directional preference)
  • κ>0\kappa > 0: distribution concentrated around μ\boldsymbol{\mu}, with larger κ\kappa implying higher concentration
  • κ\kappa \to \infty: tends to a point mass at μ\boldsymbol{\mu}

Thus, the vMF distribution provides an analogy:

  • In Rp\mathbb{R}^p, the normal distribution controls concentration via the precision matrix
  • On the sphere, the vMF distribution controls concentration around μ\boldsymbol{\mu}

Properties

Exponential Family Property

The vMF distribution belongs to the exponential family.

Its form is:

f(xθ)=h(x)exp(θTxA(θ))f(x|\theta) = h(x)\exp(\theta^T x - A(\theta))

with

θ=κμ\theta = \kappa \mu

Hence, vMF enjoys standard exponential family properties:

  • Existence of a sufficient statistic
  • Simple maximum likelihood estimation
  • Convenient for use in EM algorithms

Sufficient statistic:

T(x)=xT(x) = x

First Moment (Mean)

E[x]=Ad(κ)μE[x] = A_d(\kappa)\mu

where

Ad(κ)=Id/2(κ)Id/21(κ)A_d(\kappa) = \frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)}

The function Ad(κ)A_d(\kappa) is called the mean resultant length function.

Second Moment

E[xxT]=Ad(κ)κI+(1dAd(κ)κ)μμTE[xx^T] = \frac{A_d(\kappa)}{\kappa} I + \left(1 - \frac{dA_d(\kappa)}{\kappa}\right)\mu\mu^T

Rotation Invariance

For any orthogonal matrix RR,

if

xvMF(μ,κ)x \sim \text{vMF}(\mu,\kappa)

then

RxvMF(Rμ,κ)Rx \sim \text{vMF}(R\mu,\kappa)

This shows that the vMF distribution is invariant under the rotation group.

Information Geometry Properties

The family of vMF distributions forms an information-geometric manifold.

Fisher information:

I(κ)=2κ2logCd(κ)I(\kappa) = -\frac{\partial^2}{\partial \kappa^2} \log C_d(\kappa)

This allows the use of natural gradient and information-geometric optimization.

Entropy

The entropy of the vMF distribution is:

H=κAd(κ)+logCd(κ)+d2log(2π)H = -\kappa A_d(\kappa) + \log C_d(\kappa) + \frac{d}{2}\log(2\pi)

Property: larger κ\kappa gives smaller entropy.

Spherical Harmonic Expansion

The vMF distribution can be expanded as:

f(x)=lalYl(x)f(x) = \sum_l a_l Y_l(x)

where YlY_l are spherical harmonics.

Spherical Cap

In the von Mises–Fisher distribution, probability mass is mainly concentrated around the mean direction μ\mu.
Hence, a spherical cap is commonly used to describe confidence regions.

Let xvMF(μ,κ)x \sim \text{vMF}(\mu,\kappa), x,μSd1x,\mu \in S^{d-1}, and let θ\theta be the angle between the vectors:

cosθ=μTx\cos\theta = \mu^T x

A confidence region can be written as

θθc\theta \le \theta_c

i.e., a spherical cap with axis μ\mu.

On the unit sphere Sd1\mathbb{S}^{d-1},

the area of a spherical cap is:

A(θ)=12Ad1Isin2θ(d12,12)A(\theta) = \frac{1}{2} A_{d-1} \, I_{\sin^2\theta}\left(\frac{d-1}{2},\frac12\right)

where Ad1A_{d-1} is the total surface area of the sphere, given by:

Ad1=2πd/2Γ(d/2)A_{d-1} = \frac{2\pi^{d/2}}{\Gamma(d/2)}

and Ix(a,b)I_x(a,b) is the regularized incomplete beta function.

For d=3d=3:

Cap area:

A(θ)=2π(1cosθ)A(\theta) = 2\pi (1-\cos\theta)

Total sphere area: 4π4\pi

Area proportion: P(θ)=1cosθ2P(\theta) = \frac{1-\cos\theta}{2}.

In high dimensions, the probability within a spherical cap for a vMF distribution is:

P(θθc)=0θceκcosθ(sinθ)d2dθ0πeκcosθ(sinθ)d2dθP(\theta \le \theta_c) = \frac{ \int_0^{\theta_c} e^{\kappa\cos\theta} (\sin\theta)^{d-2} \,d\theta }{ \int_0^{\pi} e^{\kappa\cos\theta} (\sin\theta)^{d-2} \,d\theta }

There is no simple closed-form solution.

Approximations are commonly used.

When κ\kappa is large, the distribution is concentrated around μ\mu. Using

cosθ1θ22\cos\theta \approx 1 - \frac{\theta^2}{2}

we obtain

eκcosθeκeκθ2/2e^{\kappa\cos\theta} \approx e^{\kappa} e^{-\kappa\theta^2/2}

Thus, locally, the vMF distribution approximates a Gaussian distribution:

θN(0,1κ)\theta \sim \mathcal{N}\left(0,\frac1\kappa\right)

If the confidence probability is pp, then

θc2κzp\theta_c \approx \sqrt{\frac{2}{\kappa}} \, z_p

where zpz_p is the quantile of the normal distribution.

Examples:

Confidencezpz_p
90%1.64
95%1.96
99%2.58

Thus, the 95% confidence cone angle is:

θ951.962κ\theta_{95} \approx 1.96\sqrt{\frac{2}{\kappa}}

In high dimensions dd, we have

μTxN(Ad(κ),1Ad(κ)2d)\mu^T x \approx \mathcal{N}\left(A_d(\kappa), \frac{1-A_d(\kappa)^2}{d}\right)

with

Ad(κ)=Id/2(κ)Id/21(κ)A_d(\kappa) = \frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)}

Therefore, the confidence angle satisfies:

cosθc=Ad(κ)zp1Ad(κ)2d\cos\theta_c = A_d(\kappa) - z_p \sqrt{\frac{1-A_d(\kappa)^2}{d}}

For small angles:

A(θ)Cdθd1A(\theta) \approx C_d \, \theta^{d-1}

where

Cd=2π(d1)/2Γ((d1)/2)C_d = \frac{2\pi^{(d-1)/2}}{\Gamma((d-1)/2)}

Hence, the spherical cap area grows as θd1\theta^{d-1}, which is a key reason why high-dimensional sphere volume concentrates near the equator.

In many embedding papers, the following is used:

θtypicald1κ\theta_{\text{typical}} \approx \sqrt{\frac{d-1}{\kappa}}

Interpretation: most probability mass of the vMF distribution lies in

θd1κ\theta \lesssim \sqrt{\frac{d-1}{\kappa}}

Intuition: κ\kappa determines the directional noise magnitude:

θO(1/κ)\theta \sim O(\sqrt{1/\kappa})

These formulas are frequently used in sphere embedding, CLIP embedding, contrastive learning, spherical clustering, and directional statistics.

For example, κ\kappa can be used to estimate: angular noise in embeddings, angular radius of clusters, confidence cone of prototypes.

The probability mass of the vMF distribution is concentrated at

θO(1/κ)\theta \sim O(\sqrt{1/\kappa})

The corresponding confidence region is a spherical cap with axis μ\mu and angular radius roughly 1/κ\sqrt{1/\kappa}.

Parameter Estimation

Estimating κ\kappa is the most challenging part, as it requires solving an equation involving modified Bessel functions.

Below are several practical approximation algorithms/formulas for estimating the vMF κ\kappa parameter, listed in approximate order of common usage.

Given samples x1,...,xnx_1,...,x_n, each xiSd1x_i \in \mathbb{S}^{d-1}.

First compute:

Rˉ=i=1nxin\bar{R} = \frac{|\sum_{i=1}^n x_i|}{n}

Mean direction:

μ=xixi\mu = \frac{\sum x_i}{|\sum x_i|}

The exact equation for κ\kappa is:

Ad(κ)=RˉA_d(\kappa) = \bar{R}

where

Ad(κ)=Id/2(κ)Id/21(κ)A_d(\kappa) = \frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)}

This involves the modified Bessel function of the first kind, so approximations or numerical solutions are necessary.

Banerjee Approximation (most commonly used)

Source: Arindam Banerjee, 2005: Clustering on the Unit Hypersphere using von Mises-Fisher

Applicable for high dimensions d3d \ge 3

κRˉ(dRˉ2)1Rˉ2\kappa \approx \frac{\bar{R}(d-\bar{R}^2)}{1-\bar{R}^2}

Advantages: extremely simple, O(1)O(1) computation, works well in high dimensions. Disadvantages: larger bias in low dimensions.

This is the most common initialization formula for spherical k-means / vMF mixtures.

Source: Suvrit Sra, 2012: A Short Note on Parameter Approximation for vMF

Initial value:

κ0=Rˉ(dRˉ2)1Rˉ2\kappa_0 = \frac{\bar{R}(d-\bar{R}^2)}{1-\bar{R}^2}

Then apply Newton iteration:

κt+1=κtAd(κt)Rˉ1Ad(κt)2d1κtAd(κt)\kappa_{t+1} = \kappa_t - \frac{A_d(\kappa_t)-\bar{R}}{1-A_d(\kappa_t)^2-\frac{d-1}{\kappa_t}A_d(\kappa_t)}

Advantages: highly accurate, only 2–3 iterations needed.

Many libraries use this method.

Minka Approximation

Source: Thomas Minka

Formula:

κRˉ(dRˉ2)1Rˉ2+Rˉ2(d1)\kappa \approx \frac{\bar{R}(d-\bar{R}^2)}{1-\bar{R}^2} + \frac{\bar{R}}{2(d-1)}

Characteristics: slightly more accurate than Banerjee, still closed-form.

Large-κ\kappa Asymptotic Approximation

When κ\kappa is very large:

Ad(κ)1d12κA_d(\kappa) \approx 1 - \frac{d-1}{2\kappa}

Inverting gives:

κd12(1Rˉ)\kappa \approx \frac{d-1}{2(1-\bar{R})}

Applicable when embeddings are highly concentrated, e.g., in contrastive learning.

Small-κ\kappa Approximation

When κ\kappa is very small:

Ad(κ)κdA_d(\kappa) \approx \frac{\kappa}{d}

Thus,

κdRˉ\kappa \approx d\bar{R}

Applicable when data are nearly uniform.

EstimationComplexityAccuracyUsage
BanerjeeVery lowMediumInitialization
Sra + NewtonMediumHighRecommended
MinkaLowMedium-HighFast estimation
Large κ\kappaVery lowHigh (large κ\kappa)High concentration
Small κ\kappaVery lowHigh (small κ\kappa)Near uniformity