reliably¶
Statistically rigorous model reliability evaluation — every metric carries a confidence interval, every comparison carries a significance test.
Five-line quick start¶
import numpy as np
import reliably as rb
report = rb.evaluate(y_true, y_prob)
print(report.summary())
# Report(task=binary, n=1000)
# ECE=0.0312 [0.0211, 0.0421]
# smECE=0.0289 [0.0185, 0.0398]
# Brier=0.1842 [0.1714, 0.1971]
# NLL=0.5103 [0.4887, 0.5319]
# AUROC=0.7841 [0.7512, 0.8170]
# Render to HTML
report.to_html("report.html")
# Plot reliability diagram
ax = report.reliability_diagram(y_true, y_prob)
Why reliably?¶
| Feature | reliably | netcal | Uncertainty Toolbox |
|---|---|---|---|
| Bootstrap CIs on every metric | ✅ | ❌ | Partial |
| DeLong significance test | ✅ | ❌ | ❌ |
| Paired bootstrap comparison | ✅ | ❌ | ❌ |
| Representation quality (MIG, DCI…) | ✅ | ❌ | ❌ |
| Framework-agnostic (numpy/torch/jax) | ✅ | ✅ | ✅ |
Installation¶
# Core only (numpy + scipy)
pip install reliably
# With visualization
pip install "reliably[viz]"
# With HTML reports
pip install "reliably[report]"
# Everything
pip install "reliably[all]"
License¶
Apache 2.0