Uncertainty of KL divergence estimate using Politis/Romano's subsampling bootstrap.

This function computes a confidence interval for KL divergence based on the subsampling bootstrap introduced by Politis and Romano. See Details for theoretical properties of this method.

Usage

kld_ci_subsampling(
  X,
  Y = NULL,
  q = NULL,
  estimator = kld_est_nn,
  B = 500L,
  alpha = 0.05,
  subsample.size = function(x) x^(2/3),
  convergence.rate = sqrt,
  method = c("quantile", "se"),
  include.boot = FALSE,
  n.cores = 1L,
  ...
)

Arguments

X, Y: n-by-d and m-by-d data frames or matrices (multivariate samples), or numeric/character vectors (univariate samples, i.e. d = 1), representing n samples from the true distribution \(P\) and m samples from the approximate distribution \(Q\) in d dimensions. Y can be left blank if q is specified (see below).
q: The density function of the approximate distribution \(Q\). Either Y or q must be specified. If the distributions are all continuous or all discrete, q can be directly specified as the probability density/mass function. However, for mixed continuous/discrete distributions, q must be given in decomposed form, \(q(y_c,y_d)=q_{c|d}(y_c|y_d)q_d(y_d)\), specified as a named list with field cond for the conditional density \(q_{c|d}(y_c|y_d)\) (a function that expects two arguments y_c and y_d) and disc for the discrete marginal density \(q_d(y_d)\) (a function that expects one argument y_d). If such a decomposition is not available, it may be preferable to instead simulate a large sample from \(Q\) and use the two-sample syntax.
estimator: The Kullback-Leibler divergence estimation method; a function expecting two inputs (X and Y or q, depending on arguments provided). Defaults to kld_est_nn.
B: Number of bootstrap replicates (default: 500), the larger, the more accurate, but also more computationally expensive.
alpha: Error level, defaults to 0.05.
subsample.size: A function specifying the size of the subsamples, defaults to \(f(x) = x^{2/3}\).
convergence.rate: A function computing the convergence rate of the estimator as a function of sample sizes. Defaults to \(f(x) = x^{1/2}\). If convergence.rate is NULL, it is estimated empirically from the sample(s) using kldest::convergence_rate().
method: Either "quantile" (the default), also known as the reverse percentile method, or "se" for a normal approximation of the KL divergence estimator using the standard error of the subsamples.
include.boot: Boolean, TRUE means KL divergence estimates on subsamples are included in the returned list. Defaults to FALSE.
n.cores: Number of cores to use in parallel computing (defaults to 1, which means that no parallel computing is used). To use this option, the parallel package must be installed and the OS must be of UNIX type (i.e., not Windows). Otherwise, n.cores will be reset to 1, with a message.
...: Arguments passed on to estimator, i.e. via the call estimator(X, Y = Y, ...) or estimator(X, q = q, ...).

Value

A list with the following fields:

"est" (the estimated KL divergence),
"ci" (a length 2 vector containing the lower and upper limits of the estimated confidence interval).
"boot" (a length B numeric vector with KL divergence estimates on the bootstrap subsamples), only included if include.boot = TRUE,

Details

In general terms, tetting \(b_n\) be the subsample size for a sample of size \(n\), and \(\tau_n\) the convergence rate of the estimator, a confidence interval calculated by subsampling has asymptotic coverage \(1 - \alpha\) as long as \(b_n/n\rightarrow 0\), \(b_n\rightarrow\infty\) and \(\frac{\tau_{b_n}}{\tau_n}\rightarrow 0\).

In many cases, the convergence rate of the nearest-neighbour based KL divergence estimator is \(\tau_n = \sqrt{n}\) and the condition on the subsample size reduces to \(b_n/n\rightarrow 0\) and \(b_n\rightarrow\infty\). By default, \(b_n = n^{2/3}\). In a two-sample problem, \(n\) and \(b_n\) are replaced by effective sample sizes \(n_\text{eff} = \min(n,m)\) and \(b_{n,\text{eff}} = \min(b_n,b_m)\).

Reference:

Politis and Romano, "Large sample confidence regions based on subsamples under minimal assumptions", The Annals of Statistics, Vol. 22, No. 4 (1994).

Examples

# 1D Gaussian (one- and two-sample problems)
set.seed(0)
X <- rnorm(100)
Y <- rnorm(100, mean = 1, sd = 2)
q <- function(x) dnorm(x, mean =1, sd = 2)
kld_gaussian(mu1 = 0, sigma1 = 1, mu2 = 1, sigma2 = 2^2)
#> [1] 0.4431472
kld_est_nn(X, Y = Y)
#> [1] 0.2169136
kld_est_nn(X, q = q)
#> [1] 0.6374628
kld_ci_subsampling(X, Y)$ci
#>       2.5%      97.5% 
#> -0.2799538  0.5913424 
kld_ci_subsampling(X, q = q)$ci
#>      2.5%     97.5% 
#> 0.2870222 0.9201281