Introduction of vMF Distribution on Hypersphere

The von Mises–Fisher distribution (vMF) distribution is often regarded as the hyperspherical analogue of the Normal distribution, as it models data concentrated around a mean direction on the unit hypersphere. In Euclidean space, a normal distribution is typically parameterized by a mean $\mu$ and variance $\sigma^2$ , and in one dimension its density forms the familiar bell-shaped curve. Just as the normal distribution is the distribution with the maximum entropy given the mean and variance constraints, the vMF distribution is the distribution with the maximum entropy on a sphere given the average direction constraint. According to the Central Limit Theorem: the limiting distribution of a random walk or accumulated direction vector on a sphere will also tend towards the vMF distribution.

Basic Concepts

Domain: $(d-1)$ -dimensional unit sphere

\mathbb{S}^{d-1} = \{ \mathbf{x} \in \mathbb{R}^d : \|\mathbf{x}\|=1 \}

Parameters:

Mean direction $\boldsymbol{\mu} \in S^{d-1}$
Concentration parameter $\kappa \ge 0$

Probability density function (with respect to the uniform measure on the sphere):

f(\mathbf{x} \mid \boldsymbol{\mu}, \kappa) = C_d(\kappa) \exp(\kappa \boldsymbol{\mu}^\top \mathbf{x})

where

C_d(\kappa) = \frac{\kappa^{\frac{d}{2} - 1}}{(2\pi)^{d/2} I_{\frac{d}{2} - 1}(\kappa)}

and $I_\nu$ is the modified Bessel function of the first kind.

Intuitive Understanding

$\kappa = 0$ : uniform distribution on the sphere (no directional preference)
$\kappa > 0$ : distribution concentrated around $\boldsymbol{\mu}$ , with larger $\kappa$ implying higher concentration
$\kappa \to \infty$ : tends to a point mass at $\boldsymbol{\mu}$

Thus, the vMF distribution provides an analogy:

In $\mathbb{R}^p$ , the normal distribution controls concentration via the precision matrix
On the sphere, the vMF distribution controls concentration around $\boldsymbol{\mu}$

Properties

Exponential Family Property

The vMF distribution belongs to the exponential family.

Its form is:

f(x|\theta) = h(x)\exp(\theta^T x - A(\theta))

with

\theta = \kappa \mu

Hence, vMF enjoys standard exponential family properties:

Existence of a sufficient statistic
Simple maximum likelihood estimation
Convenient for use in EM algorithms

Sufficient statistic:

T(x) = x

First Moment (Mean)

E[x] = A_d(\kappa)\mu

where

A_d(\kappa) = \frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)}

The function $A_d(\kappa)$ is called the mean resultant length function.

Second Moment

E[xx^T] = \frac{A_d(\kappa)}{\kappa} I + \left(1 - \frac{dA_d(\kappa)}{\kappa}\right)\mu\mu^T

Rotation Invariance

For any orthogonal matrix $R$ ,

if

x \sim \text{vMF}(\mu,\kappa)

then

Rx \sim \text{vMF}(R\mu,\kappa)

This shows that the vMF distribution is invariant under the rotation group.

Information Geometry Properties

The family of vMF distributions forms an information-geometric manifold.

Fisher information:

I(\kappa) = -\frac{\partial^2}{\partial \kappa^2} \log C_d(\kappa)

This allows the use of natural gradient and information-geometric optimization.

Entropy

The entropy of the vMF distribution is:

H = -\kappa A_d(\kappa) + \log C_d(\kappa) + \frac{d}{2}\log(2\pi)

Property: larger $\kappa$ gives smaller entropy.

Spherical Harmonic Expansion

The vMF distribution can be expanded as:

f(x) = \sum_l a_l Y_l(x)

where $Y_l$ are spherical harmonics.

Spherical Cap

In the von Mises–Fisher distribution, probability mass is mainly concentrated around the mean direction $\mu$ .
Hence, a spherical cap is commonly used to describe confidence regions.

Let $x \sim \text{vMF}(\mu,\kappa)$ , $x,\mu \in S^{d-1}$ , and let $\theta$ be the angle between the vectors:

\cos\theta = \mu^T x

A confidence region can be written as

\theta \le \theta_c

i.e., a spherical cap with axis $\mu$ .

On the unit sphere $\mathbb{S}^{d-1}$ ,

the area of a spherical cap is:

A(\theta) = \frac{1}{2} A_{d-1} \, I_{\sin^2\theta}\left(\frac{d-1}{2},\frac12\right)

where $A_{d-1}$ is the total surface area of the sphere, given by:

A_{d-1} = \frac{2\pi^{d/2}}{\Gamma(d/2)}

and $I_x(a,b)$ is the regularized incomplete beta function.

For $d=3$ :

Cap area:

A(\theta) = 2\pi (1-\cos\theta)

Total sphere area: $4\pi$

Area proportion: $P(\theta) = \frac{1-\cos\theta}{2}$ .

In high dimensions, the probability within a spherical cap for a vMF distribution is:

P(\theta \le \theta_c) = \frac{ \int_0^{\theta_c} e^{\kappa\cos\theta} (\sin\theta)^{d-2} \,d\theta }{ \int_0^{\pi} e^{\kappa\cos\theta} (\sin\theta)^{d-2} \,d\theta }

There is no simple closed-form solution.

Approximations are commonly used.

When $\kappa$ is large, the distribution is concentrated around $\mu$ . Using

\cos\theta \approx 1 - \frac{\theta^2}{2}

we obtain

e^{\kappa\cos\theta} \approx e^{\kappa} e^{-\kappa\theta^2/2}

Thus, locally, the vMF distribution approximates a Gaussian distribution:

\theta \sim \mathcal{N}\left(0,\frac1\kappa\right)

If the confidence probability is $p$ , then

\theta_c \approx \sqrt{\frac{2}{\kappa}} \, z_p

where $z_p$ is the quantile of the normal distribution.

Examples:

Confidence	$z_p$
90%	1.64
95%	1.96
99%	2.58

Thus, the 95% confidence cone angle is:

\theta_{95} \approx 1.96\sqrt{\frac{2}{\kappa}}

In high dimensions $d$ , we have

\mu^T x \approx \mathcal{N}\left(A_d(\kappa), \frac{1-A_d(\kappa)^2}{d}\right)

with

A_d(\kappa) = \frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)}

Therefore, the confidence angle satisfies:

\cos\theta_c = A_d(\kappa) - z_p \sqrt{\frac{1-A_d(\kappa)^2}{d}}

For small angles:

A(\theta) \approx C_d \, \theta^{d-1}

where

C_d = \frac{2\pi^{(d-1)/2}}{\Gamma((d-1)/2)}

Hence, the spherical cap area grows as $\theta^{d-1}$ , which is a key reason why high-dimensional sphere volume concentrates near the equator.

In many embedding papers, the following is used:

\theta_{\text{typical}} \approx \sqrt{\frac{d-1}{\kappa}}

Interpretation: most probability mass of the vMF distribution lies in

\theta \lesssim \sqrt{\frac{d-1}{\kappa}}

Intuition: $\kappa$ determines the directional noise magnitude:

\theta \sim O(\sqrt{1/\kappa})

These formulas are frequently used in sphere embedding, CLIP embedding, contrastive learning, spherical clustering, and directional statistics.

For example, $\kappa$ can be used to estimate: angular noise in embeddings, angular radius of clusters, confidence cone of prototypes.

The probability mass of the vMF distribution is concentrated at

\theta \sim O(\sqrt{1/\kappa})

The corresponding confidence region is a spherical cap with axis $\mu$ and angular radius roughly $\sqrt{1/\kappa}$ .

Parameter Estimation

Estimating $\kappa$ is the most challenging part, as it requires solving an equation involving modified Bessel functions.

Below are several practical approximation algorithms/formulas for estimating the vMF $\kappa$ parameter, listed in approximate order of common usage.

Given samples $x_1,...,x_n$ , each $x_i \in \mathbb{S}^{d-1}$ .

First compute:

\bar{R} = \frac{|\sum_{i=1}^n x_i|}{n}

Mean direction:

\mu = \frac{\sum x_i}{|\sum x_i|}

The exact equation for $\kappa$ is:

A_d(\kappa) = \bar{R}

where

A_d(\kappa) = \frac{I_{d/2}(\kappa)}{I_{d/2-1}(\kappa)}

This involves the modified Bessel function of the first kind, so approximations or numerical solutions are necessary.

Banerjee Approximation (most commonly used)

Source: Arindam Banerjee, 2005: Clustering on the Unit Hypersphere using von Mises-Fisher

Applicable for high dimensions $d \ge 3$

\kappa \approx \frac{\bar{R}(d-\bar{R}^2)}{1-\bar{R}^2}

Advantages: extremely simple, $O(1)$ computation, works well in high dimensions. Disadvantages: larger bias in low dimensions.

This is the most common initialization formula for spherical k-means / vMF mixtures.

Sra Approximation + Newton Refinement (recommended)

Source: Suvrit Sra, 2012: A Short Note on Parameter Approximation for vMF

Initial value:

\kappa_0 = \frac{\bar{R}(d-\bar{R}^2)}{1-\bar{R}^2}

Then apply Newton iteration:

\kappa_{t+1} = \kappa_t - \frac{A_d(\kappa_t)-\bar{R}}{1-A_d(\kappa_t)^2-\frac{d-1}{\kappa_t}A_d(\kappa_t)}

Advantages: highly accurate, only 2–3 iterations needed.

Many libraries use this method.

Minka Approximation

Source: Thomas Minka

Formula:

\kappa \approx \frac{\bar{R}(d-\bar{R}^2)}{1-\bar{R}^2} + \frac{\bar{R}}{2(d-1)}

Characteristics: slightly more accurate than Banerjee, still closed-form.

Large- $\kappa$ Asymptotic Approximation

When $\kappa$ is very large:

A_d(\kappa) \approx 1 - \frac{d-1}{2\kappa}

Inverting gives:

\kappa \approx \frac{d-1}{2(1-\bar{R})}

Applicable when embeddings are highly concentrated, e.g., in contrastive learning.

Small- $\kappa$ Approximation

When $\kappa$ is very small:

A_d(\kappa) \approx \frac{\kappa}{d}

Thus,

\kappa \approx d\bar{R}

Applicable when data are nearly uniform.

Estimation	Complexity	Accuracy	Usage
Banerjee	Very low	Medium	Initialization
Sra + Newton	Medium	High	Recommended
Minka	Low	Medium-High	Fast estimation
Large $\kappa$	Very low	High (large $\kappa$ )	High concentration
Small $\kappa$	Very low	High (small $\kappa$ )	Near uniformity

Basic Concepts

Intuitive Understanding

Properties

Exponential Family Property

First Moment (Mean)

Second Moment

Rotation Invariance

Information Geometry Properties

Entropy

Spherical Harmonic Expansion

Spherical Cap

Parameter Estimation

Banerjee Approximation (most commonly used)

Sra Approximation + Newton Refinement (recommended)

Minka Approximation

Large-κ\kappaκ Asymptotic Approximation

Small-κ\kappaκ Approximation

Large- $\kappa$ Asymptotic Approximation

Small- $\kappa$ Approximation