Kullback-Leibler divergence estimator for discrete, continuous or mixed data.
Source:R/kld-estimation-interfaces.R
kld_est.RdFor two mixed continuous/discrete distributions with densities \(p\) and
\(q\), and denoting \(x = (x_\text{c},x_\text{d})\), the Kullback-Leibler
divergence \(D_{KL}(p||q)\) is given as
$$D_{KL}(p||q) = \sum_{x_d} \int p(x_c,x_d) \log\left(\frac{p(x_c,x_d)}{q(x_c,x_d)}\right)dx_c.$$
Conditioning on the discrete variables \(x_d\), this can be re-written as
$$D_{KL}(p||q) = \sum_{x_d} p(x_d) D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big) +
D_{KL}\big(p_{x_d}||q_{x_d}\big).$$
Here, the terms
$$D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big)$$
are approximated via nearest neighbour- or kernel-based density estimates on
the datasets X and Y stratified by the discrete variables, and
$$D_{KL}\big(p_{x_d}||q_{x_d}\big)$$
is approximated using relative frequencies.
Usage
kld_est(
X,
Y = NULL,
q = NULL,
estimator.continuous = kld_est_nn,
estimator.discrete = kld_est_discrete,
vartype = NULL
)Arguments
- X, Y
n-by-dandm-by-ddata frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e.d = 1), representingnsamples from the true distribution \(P\) andmsamples from the approximate distribution \(Q\) inddimensions.Ycan be left blank ifqis specified (see below).- q
The density function of the approximate distribution \(Q\). Either
Yorqmust be specified. If the distributions are all continuous or all discrete,qcan be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions,qmust be given in decomposed form, \(q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d)\), specified as a named list with fieldcondfor the conditional density \(q_{c|d}(y_c|y_d)\) (a function that expects two argumentsy_candy_d) anddiscfor the discrete marginal density \(q_d(y_d)\) (a function that expects one argumenty_d). If such a decomposition is not available, it may be preferable to instead simulate a large sample from \(Q\) and use the two-sample syntax.- estimator.continuous, estimator.discrete
KL divergence estimators for continuous and discrete data, respectively. Both are functions with two arguments
XandYorXandq, depending on whether a two-sample or one-sample problem is considered. Defaults arekld_est_nnandkld_est_discrete, respectively.- vartype
A length
dcharacter vector, withvartype[i] = "c"meaning thei-th variable is continuous, andvartype[i] = "d"meaning it is discrete. If unspecified,vartypeis"c"for numeric columns and"d"for character or factor columns. This default will mostly work, except if levels of discrete variables are encoded using numbers (e.g.,0for females and1for males) or for count data.
Examples
# 2D example, two samples
set.seed(0)
X <- data.frame(cont = rnorm(10),
discr = c(rep('a',4),rep('b',6)))
Y <- data.frame(cont = c(rnorm(5), rnorm(5, sd = 2)),
discr = c(rep('a',5),rep('b',5)))
kld_est(X, Y)
#> [1] 0.5099841
# 2D example, one sample
set.seed(0)
X <- data.frame(cont = rnorm(10),
discr = c(rep(0,4),rep(1,6)))
q <- list(cond = function(xc,xd) dnorm(xc, mean = xd, sd = 1),
disc = function(xd) dbinom(xd, size = 1, prob = 0.5))
kld_est(X, q = q, vartype = c("c","d"))
#> [1] 0.8126271