SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

Abstract

Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-inefficient due to bootstrapping. While recent work has explored contrastive and supervised formulations to improve stability, we present a probabilistic alternative, called Survival Value Learning (SVL), that reframes GCRL as a survival learning problem by modeling the time-to-goal from each state as a probability distribution. This structured distributional Monte Carlo perspective yields a closed-form identity that expresses the goal-conditioned value function as a discounted sum of survival probabilities, enabling value estimation via a hazard model trained via maximum likelihood on both event and right-censored trajectories. We introduce three practical value estimators, including finite-horizon truncation and two binned infinite-horizon approximations to capture long-horizon objectives. Experiments on offline GCRL benchmarks show that SVL combined with hierarchical actors matches or surpasses strong hierarchical TD and Monte Carlo baselines, excelling on complex, long-horizon tasks.

Time-to-goal distributions in GCRL

In the general GCRL setting, the time required to reach a goal is a random variable. As illustrated in a navigation task, an agent may reach the target from the start via multiple distinct paths, resulting in a multi-modal distribution of arrival times. Traditional TD learning approaches estimate the goal-conditioned value function by backpropagating reward signals through all transitions from the goal to the state. Instead, we propose directly exploiting the arrival-time realizations: our survival value learning (SVL) approach models this distribution and recovers the value function \( V^\pi(s, g) \) as a discounted sum of survival probabilities. By learning a hazard model via maximum likelihood on both event and right-censored trajectories, SVL bypasses the bootstrapping instabilities of TD methods while inheriting the classical guarantees of MLE.

Hierarchical Survival Value Learning (HSVL)

Comparative results of SVL across OGBench tasks

We instantiate SVL within a hierarchical offline GCRL algorithm, HSVL, that extracts a hierarchical policy from the learned survival value function via advantage-weighted regression. Across the challenging OGBench benchmark (state-based and visual domains), HSVL matches or exceeds prior methods on standard tasks and delivers substantial gains on complex, long-horizon problems — reaching 81% success on humanoidmaze-giant-navigate where the best hierarchical baseline (HIQL) achieves only 12%, and non-hierarchical methods fail (0–3%). On pixel-based benchmarks, HSVL achieves 61% on visual-antmaze-giant, surpassing the best hierarchical (HIQL, 6%) and contrastive (CRL, 47%) baselines, and is the only method to extract meaningful signal on visual-humanoidmaze-medium.

These results provide evidence that survival-based MLE mitigates the compounding bootstrap errors of TD over long horizons, offering a stable, probabilistic alternative to standard goal-conditioned value learning.

@article{tiofack2026svl, title={SVL: Goal-Conditioned Reinforcement Learning as Survival Learning}, author={Tiofack, Franki Nguimatsia and Schramm, Fabian and Hellard, Th{\'e}otime Le and Carpentier, Justin}, journal={arXiv preprint arXiv:2604.17551}, year={2026} }

SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

Abstract

Time-to-goal distributions in GCRL

Hierarchical Survival Value Learning (HSVL)

BibTeX

SVL:
Goal-Conditioned Reinforcement Learning as Survival Learning