Skip to contents

For two mixed continuous/discrete distributions with densities \(p\) and \(q\), and denoting \(x = (x_\text{c},x_\text{d})\), the Kullback-Leibler divergence \(D_{KL}(p||q)\) is given as $$D_{KL}(p||q) = \sum_{x_d} \int p(x_c,x_d) \log\left(\frac{p(x_c,x_d)}{q(x_c,x_d)}\right)dx_c.$$ Conditioning on the discrete variables \(x_d\), this can be re-written as $$D_{KL}(p||q) = \sum_{x_d} p(x_d) D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big) + D_{KL}\big(p_{x_d}||q_{x_d}\big).$$ Here, the terms $$D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big)$$ are approximated via nearest neighbour- or kernel-based density estimates on the datasets X and Y stratified by the discrete variables, and $$D_{KL}\big(p_{x_d}||q_{x_d}\big)$$ is approximated using relative frequencies.

Usage

kld_est(
  X,
  Y = NULL,
  q = NULL,
  estimator.continuous = kld_est_nn,
  estimator.discrete = kld_est_discrete,
  vartype = NULL
)

Arguments

X, Y

n-by-d and m-by-d data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e. d = 1), representing n samples from the true distribution \(P\) and m samples from the approximate distribution \(Q\) in d dimensions. Y can be left blank if q is specified (see below).

q

The density function of the approximate distribution \(Q\). Either Y or q must be specified. If the distributions are all continuous or all discrete, q can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions, q must be given in decomposed form, \(q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d)\), specified as a named list with field cond for the conditional density \(q_{c|d}(y_c|y_d)\) (a function that expects two arguments y_c and y_d) and disc for the discrete marginal density \(q_d(y_d)\) (a function that expects one argument y_d). If such a decomposition is not available, it may be preferable to instead simulate a large sample from \(Q\) and use the two-sample syntax.

estimator.continuous, estimator.discrete

KL divergence estimators for continuous and discrete data, respectively. Both are functions with two arguments X and Y or X and q, depending on whether a two-sample or one-sample problem is considered. Defaults are kld_est_nn and kld_est_discrete, respectively.

vartype

A length d character vector, with vartype[i] = "c" meaning the i-th variable is continuous, and vartype[i] = "d" meaning it is discrete. If unspecified, vartype is "c" for numeric columns and "d" for character or factor columns. This default will mostly work, except if levels of discrete variables are encoded using numbers (e.g., 0 for females and 1 for males) or for count data.

Value

A scalar, the estimated Kullback-Leibler divergence \(\hat D_{KL}(P||Q)\).

Examples

# 2D example, two samples
set.seed(0)
X <- data.frame(cont  = rnorm(10),
                discr = c(rep('a',4),rep('b',6)))
Y <- data.frame(cont  = c(rnorm(5), rnorm(5, sd = 2)),
                discr = c(rep('a',5),rep('b',5)))
kld_est(X, Y)
#> [1] 0.5099841

# 2D example, one sample
set.seed(0)
X <- data.frame(cont  = rnorm(10),
                discr = c(rep(0,4),rep(1,6)))
q <- list(cond = function(xc,xd) dnorm(xc, mean = xd, sd = 1),
          disc = function(xd) dbinom(xd, size = 1, prob = 0.5))
kld_est(X, q = q, vartype = c("c","d"))
#> [1] 0.8126271