The RL Reliability Metrics library provides a set of metrics for measuring the reliability of reinforcement learning (RL) algorithms. The library also provides statistical tools for computing confidence intervals and for comparing algorithms on these metrics.
As input, this library accepts a set of RL training curves, or a set of rollouts of an already trained RL algorithm. The library computes reliability metrics across different dimensions (additionally, it can also analyze non-reliability metrics like median performance), and outputs plots presenting the reliability metrics for each algorithm, aggregated across tasks or on a per-task basis. The library also provides statistical tests for comparing algorithms based on these metrics, and provides bootstrapped confidence intervals of the metric values.