Uncertainty of KL divergence estimate using Politis/Romano's subsampling bootstrap.
Source:R/kld-uncertainty.R
kld_ci_subsampling.Rd
This function computes a confidence interval for KL divergence based on the subsampling bootstrap introduced by Politis and Romano. See Details for theoretical properties of this method.
Usage
kld_ci_subsampling(
X,
Y = NULL,
q = NULL,
estimator = kld_est_nn,
B = 500L,
alpha = 0.05,
subsample.size = function(x) x^(2/3),
convergence.rate = sqrt,
method = c("quantile", "se"),
include.boot = FALSE,
n.cores = 1L,
...
)
Arguments
- X, Y
n
-by-d
andm
-by-d
data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e.d = 1
), representingn
samples from the true distribution \(P\) andm
samples from the approximate distribution \(Q\) ind
dimensions.Y
can be left blank ifq
is specified (see below).- q
The density function of the approximate distribution \(Q\). Either
Y
orq
must be specified. If the distributions are all continuous or all discrete,q
can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions,q
must be given in decomposed form, \(q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d)\), specified as a named list with fieldcond
for the conditional density \(q_{c|d}(y_c|y_d)\) (a function that expects two argumentsy_c
andy_d
) anddisc
for the discrete marginal density \(q_d(y_d)\) (a function that expects one argumenty_d
). If such a decomposition is not available, it may be preferable to instead simulate a large sample from \(Q\) and use the two-sample syntax.- estimator
The Kullback-Leibler divergence estimation method; a function expecting two inputs (
X
andY
orq
, depending on arguments provided). Defaults tokld_est_nn
.- B
Number of bootstrap replicates (default:
500
), the larger, the more accurate, but also more computationally expensive.- alpha
Error level, defaults to
0.05
.- subsample.size
A function specifying the size of the subsamples, defaults to \(f(x) = x^{2/3}\).
- convergence.rate
A function computing the convergence rate of the estimator as a function of sample sizes. Defaults to \(f(x) = x^{1/2}\). If
convergence.rate
isNULL
, it is estimated empirically from the sample(s) usingkldest::convergence_rate()
.- method
Either
"quantile"
(the default), also known as the reverse percentile method, or"se"
for a normal approximation of the KL divergence estimator using the standard error of the subsamples.- include.boot
Boolean,
TRUE
means KL divergence estimates on subsamples are included in the returned list. Defaults toFALSE
.- n.cores
Number of cores to use in parallel computing (defaults to
1
, which means that no parallel computing is used). To use this option, theparallel
package must be installed and the OS must be of UNIX type (i.e., not Windows). Otherwise,n.cores
will be reset to1
, with a message.- ...
Arguments passed on to
estimator
, i.e. via the callestimator(X, Y = Y, ...)
orestimator(X, q = q, ...)
.
Value
A list with the following fields:
"est"
(the estimated KL divergence),"ci"
(a length2
vector containing the lower and upper limits of the estimated confidence interval)."boot"
(a lengthB
numeric vector with KL divergence estimates on the bootstrap subsamples), only included ifinclude.boot = TRUE
,
Details
In general terms, tetting \(b_n\) be the subsample size for a sample of size \(n\), and \(\tau_n\) the convergence rate of the estimator, a confidence interval calculated by subsampling has asymptotic coverage \(1 - \alpha\) as long as \(b_n/n\rightarrow 0\), \(b_n\rightarrow\infty\) and \(\frac{\tau_{b_n}}{\tau_n}\rightarrow 0\).
In many cases, the convergence rate of the nearest-neighbour based KL divergence estimator is \(\tau_n = \sqrt{n}\) and the condition on the subsample size reduces to \(b_n/n\rightarrow 0\) and \(b_n\rightarrow\infty\). By default, \(b_n = n^{2/3}\). In a two-sample problem, \(n\) and \(b_n\) are replaced by effective sample sizes \(n_\text{eff} = \min(n,m)\) and \(b_{n,\text{eff}} = \min(b_n,b_m)\).
Reference:
Politis and Romano, "Large sample confidence regions based on subsamples under minimal assumptions", The Annals of Statistics, Vol. 22, No. 4 (1994).
Examples
# 1D Gaussian (one- and two-sample problems)
set.seed(0)
X <- rnorm(100)
Y <- rnorm(100, mean = 1, sd = 2)
q <- function(x) dnorm(x, mean =1, sd = 2)
kld_gaussian(mu1 = 0, sigma1 = 1, mu2 = 1, sigma2 = 2^2)
#> [1] 0.4431472
kld_est_nn(X, Y = Y)
#> [1] 0.2169136
kld_est_nn(X, q = q)
#> [1] 0.6374628
kld_ci_subsampling(X, Y)$ci
#> 2.5% 97.5%
#> -0.2799538 0.5913424
kld_ci_subsampling(X, q = q)$ci
#> 2.5% 97.5%
#> 0.2870222 0.9201281