Kullback-Leibler divergence estimator for discrete, continuous or mixed data.
Source:R/kld-estimation-interfaces.R
kld_est.Rd
For two mixed continuous/discrete distributions with densities \(p\) and
\(q\), and denoting \(x = (x_\text{c},x_\text{d})\), the Kullback-Leibler
divergence \(D_{KL}(p||q)\) is given as
$$D_{KL}(p||q) = \sum_{x_d} \int p(x_c,x_d) \log\left(\frac{p(x_c,x_d)}{q(x_c,x_d)}\right)dx_c.$$
Conditioning on the discrete variables \(x_d\), this can be re-written as
$$D_{KL}(p||q) = \sum_{x_d} p(x_d) D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big) +
D_{KL}\big(p_{x_d}||q_{x_d}\big).$$
Here, the terms
$$D_{KL}\big(p(\cdot|x_d)||q(\cdot|x_d)\big)$$
are approximated via nearest neighbour- or kernel-based density estimates on
the datasets X
and Y
stratified by the discrete variables, and
$$D_{KL}\big(p_{x_d}||q_{x_d}\big)$$
is approximated using relative frequencies.
Usage
kld_est(
X,
Y = NULL,
q = NULL,
estimator.continuous = kld_est_nn,
estimator.discrete = kld_est_discrete,
vartype = NULL
)
Arguments
- X, Y
n
-by-d
andm
-by-d
data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e.d = 1
), representingn
samples from the true distribution \(P\) andm
samples from the approximate distribution \(Q\) ind
dimensions.Y
can be left blank ifq
is specified (see below).- q
The density function of the approximate distribution \(Q\). Either
Y
orq
must be specified. If the distributions are all continuous or all discrete,q
can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions,q
must be given in decomposed form, \(q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d)\), specified as a named list with fieldcond
for the conditional density \(q_{c|d}(y_c|y_d)\) (a function that expects two argumentsy_c
andy_d
) anddisc
for the discrete marginal density \(q_d(y_d)\) (a function that expects one argumenty_d
). If such a decomposition is not available, it may be preferable to instead simulate a large sample from \(Q\) and use the two-sample syntax.- estimator.continuous, estimator.discrete
KL divergence estimators for continuous and discrete data, respectively. Both are functions with two arguments
X
andY
orX
andq
, depending on whether a two-sample or one-sample problem is considered. Defaults arekld_est_nn
andkld_est_discrete
, respectively.- vartype
A length
d
character vector, withvartype[i] = "c"
meaning thei
-th variable is continuous, andvartype[i] = "d"
meaning it is discrete. If unspecified,vartype
is"c"
for numeric columns and"d"
for character or factor columns. This default will mostly work, except if levels of discrete variables are encoded using numbers (e.g.,0
for females and1
for males) or for count data.
Examples
# 2D example, two samples
set.seed(0)
X <- data.frame(cont = rnorm(10),
discr = c(rep('a',4),rep('b',6)))
Y <- data.frame(cont = c(rnorm(5), rnorm(5, sd = 2)),
discr = c(rep('a',5),rep('b',5)))
kld_est(X, Y)
#> [1] 0.5099841
# 2D example, one sample
set.seed(0)
X <- data.frame(cont = rnorm(10),
discr = c(rep(0,4),rep(1,6)))
q <- list(cond = function(xc,xd) dnorm(xc, mean = xd, sd = 1),
disc = function(xd) dbinom(xd, size = 1, prob = 0.5))
kld_est(X, q = q, vartype = c("c","d"))
#> [1] 0.8126271