Stats¶
reliably.stats.bootstrap.bootstrap_ci(estimator, n, *, point, n_boot=2000, level=0.95, method='bca', seed=0)
¶
Compute a bootstrap confidence interval for any scalar estimator.
The estimator is called once per bootstrap replicate with a resample index array; use vectorized implementations where possible.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimator
|
Callable[[NDArray], float]
|
Function |
required |
n
|
int
|
Dataset size. |
required |
point
|
float
|
Point estimate (from the full dataset, not resampled). |
required |
n_boot
|
int
|
Number of resamples (default 2000). |
2000
|
level
|
float
|
Nominal coverage (default 0.95). |
0.95
|
method
|
str
|
|
'bca'
|
seed
|
int | Generator
|
RNG seed. |
0
|
Returns:
| Type | Description |
|---|---|
CI
|
Confidence interval object. |
Examples:
>>> import numpy as np
>>> data = np.random.default_rng(0).normal(0, 1, 200)
>>> ci = bootstrap_ci(lambda idx: data[idx].mean(), len(data),
... point=data.mean(), seed=0)
>>> ci.low < ci.high
True
Source code in src/reliably/stats/bootstrap.py
reliably.stats.delong.delong_test(scores_a, scores_b, labels)
¶
Compare two correlated AUROCs on the same test set (DeLong 1988).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores_a
|
NDArray[float64]
|
Scores from model A. |
required |
scores_b
|
NDArray[float64]
|
Scores from model B. |
required |
labels
|
NDArray[int64]
|
Shared binary labels. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
delta |
float
|
|
p_value |
float
|
Two-sided p-value. |
se |
float
|
Standard error of the difference. |
Examples:
>>> import numpy as np
>>> rng = np.random.default_rng(1)
>>> y = rng.integers(0, 2, 200)
>>> sa = rng.uniform(0, 1, 200)
>>> sb = rng.uniform(0, 1, 200)
>>> delta, p, se = delong_test(sa, sb, y)
>>> 0.0 <= p <= 1.0
True
Source code in src/reliably/stats/delong.py
reliably.stats.tests.paired_bootstrap_test(estimator_a, estimator_b, n, *, point_a, point_b, n_boot=2000, level=0.95, seed=0)
¶
Paired bootstrap test for any pair of scalar estimators.
Both estimators are applied to the same resample indices so the comparison is paired. This works for any metric, unlike DeLong.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimator_a
|
Callable[[NDArray], float]
|
|
required |
estimator_b
|
Callable[[NDArray], float]
|
|
required |
n
|
int
|
Dataset size. |
required |
point_a
|
float
|
Full-data point estimate for model A. |
required |
point_b
|
float
|
Full-data point estimate for model B. |
required |
n_boot
|
int
|
Number of resamples. |
2000
|
level
|
float
|
Nominal CI coverage. |
0.95
|
seed
|
int | Generator
|
RNG seed. |
0
|
Returns:
| Name | Type | Description |
|---|---|---|
delta |
float
|
|
ci |
CI
|
Bootstrap CI on the difference (percentile by default). |
p_value |
float
|
Two-sided p-value via bootstrap hypothesis convention. |
Examples:
>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> x = rng.normal(0, 1, 200)
>>> delta, ci, p = paired_bootstrap_test(
... lambda idx: x[idx].mean(), lambda idx: (-x)[idx].mean(),
... len(x), point_a=x.mean(), point_b=(-x).mean(), seed=0
... )
>>> p < 0.01
True
Source code in src/reliably/stats/tests.py
reliably.stats.tests.apply_correction(p_values, correction, *, level=0.05)
¶
Apply a multiple-comparison correction by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
p_values
|
list[float]
|
Raw p-values. |
required |
correction
|
str | None
|
|
required |
level
|
float
|
Error rate. |
0.05
|
Returns:
| Type | Description |
|---|---|
list[bool]
|
Significance flags. |
Examples: