Guided Flow Policy:
Learning from High-Value Actions in Offline RL

*Equal contribution.  1Willow, Inria - ENS-PSL.   2ISIR.

Abstract

Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks.

Overview of GFP

GFP Overview
In offline RL, the agent learns exclusively from a static dataset, generated by an unknown behavior policy. In a given state s, the distribution of actions of such a behavior policy is illustrated on the left. GFP consists of three main components: (i) in yellow, VaBC, a multi-step flow policy trained via weighted BC using the guidance term, (ii) in green, a one-step actor distilled from the flow policy, and (iii) in gray, a critic guiding action evaluation. VaBC regularizes the actor toward high-value actions from the dataset; in turn, the actor shapes the flow and optimizes the critic following the actor-critic approach. Each drawing represents the probability distribution of actions of a policy, in a current state s, except for the gray ones, where it is the value of actions in state s, according to the critic. We refer to the paper for more explanations.

Experiments across 144 tasks

Performance profiles comparing GFP and prior works
(a) Performance profiles for 50 tasks comparing GFP against a wide range of prior works, showing the fraction of tasks where each algorithm achieves a score above threshold tau. (b) Performance profiles on 105 tasks, including more challenging ones, and carefully reevaluated prior methods. (c) Performance profiles restricted to 30 noisy and explore tasks.

Along with the code, we release csv files containing all our benchmarking results. These include the 144 tasks GFP was evaluated on, all the hyperparameters used, our careful reevaluation of ReBRAC on OGBench, and the first evaluation of GFP and FQL on Minari.

BibTeX

@inproceedings{tiofack2026guided,
    title     = {Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning},
    author    = {Franki {Nguimatsia Tiofack} and Theotime {Le Hellard} and Fabian Schramm and Nicolas Perrin-Gilbert and Justin Carpentier},
    booktitle = {The Fourteenth International Conference on Learning Representations},
    year      = {2026},
}