Flow as Flow:
Modeling Robot Velocity Fields
as
Probability Velocity Fields

Anonymous Author(s)


Under Review

Summary video (coming soon) Place static/videos/summary.mp4 here


Overview of Flow as Flow framework.

Overview of Flow as Flow. We leverage diverse cross-embodiment videos from multiple robot embodiments and humans for training. At deployment, the model predicts a robot flow (robot velocity field) representing a task-relevant motion conditioned on an initial image and a goal image. The robot then executes object manipulation conditioned on the generated flow to achieve the object poses specified by the goal image.

Abstract

Cross-embodiment data have become central to training robotic foundation models. To leverage such heterogeneous data, we focus on flow-based object manipulation, where robot flows (robot velocity fields) serve as embodiment-agnostic motion representations. Prior work often formulates robot flows by differencing predicted keypoints across frames, which requires a strong visibility assumption and thus yields rough approximations of their underlying velocity fields. To address this limitation, we propose Flow as Flow, a framework that models robot flows as probability flows based on a flow matching formulation. By naturally modeling such velocity fields within this formulation, our method achieves efficient and high-quality robot flow generation. Across standard benchmarks, our method outperforms representative baseline methods on standard metrics, while achieving approximately 33× faster generation. Furthermore, through real-world experiments evaluating 9 methods with 260 trials per method across 13 manipulation tasks, we show that our method achieves a higher average success rate than the baseline methods.

Highlights

🚀 ~33× faster generation than the best baseline method (44 ms vs. 1,430 ms per sample).

📊 Achieves the best ADE scores across four standard benchmarks (Fractal, Bridge V2, DROID-100, Fanuc Manipulation), including zero-shot settings.

🤖 Achieves the highest average success rate across 13 real-world mobile manipulation tasks, outperforming the strongest baseline (Track2Act) by 10 points.

🔌 Architecture-agnostic: Flow as Flow can be directly integrated into other flow-based methods without any structural changes.

Real-World Experiments

Our method is evaluated on 13 mobile manipulation tasks using the Human Support Robot (HSR). Click a video to enlarge.

Proposed Framework

Core Novelty

We propose Flow as Flow, a framework that models physical robot velocity fields as probability velocity fields, achieving efficient and high-quality generation of robot flows.

Flow as Flow

The core novelty of our framework is modeling physical robot velocity fields as probability velocity fields in the generation space of flow matching. We initialize \(N\) (e.g., \(10\times 10\)) points uniformly on the image and obtain their future positions by integrating the velocity fields predicted by a flow generation model \(\boldsymbol{v}_\theta\).

We construct target velocity fields in a stabilizing feedback form: \[ \boldsymbol{v}(\boldsymbol{\Xi}_h, \boldsymbol{X}, h) = \dot{\boldsymbol{\Xi}}_h - k(\boldsymbol{X} - \boldsymbol{\Xi}_h), \] where \(\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{\Xi}_h,\, \sigma_0^2 e^{-2kh}\boldsymbol{I})\). The stabilization term enhances robustness to out-of-distribution samples.

We train \(\boldsymbol{v}_\theta\) with the conditional flow matching (CFM) loss: \[ \mathcal{L}_{\text{CFM}} = \mathbb{E}_{\boldsymbol{\Xi}_h, h, \boldsymbol{X}} \left[\left\lVert \boldsymbol{v}_\theta\!\left(\boldsymbol{X}, h \mid \mathcal{I}, \mathcal{G}, \boldsymbol{\Xi}_{0:h-1}\right) - \boldsymbol{v}\!\left(\boldsymbol{\Xi}_h, \boldsymbol{X}, h\right) \right\rVert^2\right]. \] At inference, coordinates at step \(h\) are obtained by integrating \(\boldsymbol{v}_\theta\) autoregressively, enabling fast generation with only a single ODE solve per step: \[ \boldsymbol{X}_h = \boldsymbol{X}_0 + \int_0^h \boldsymbol{v}_\theta\!\left(\boldsymbol{X}_\tau, \tau \mid \mathcal{I}, \mathcal{G}, \mathcal{X}_{<\tau}\right) d\tau. \]

Model Architecture

Model architecture of Flow as Flow.

Our framework consists of two main modules: the Flow Generation Module and the Action Generation Module.

Flow Generation Module

Implements \(\boldsymbol{v}_\theta\) based on the Diffusion Transformer (DiT). Each DiT block is modulated via adaLN-Zero using a shared conditioning vector constructed from \(\mathcal{I}\), \(\mathcal{G}\), and sampling step \(h\). The initial and goal images are encoded via ResNet-18, and the conditioning vector is obtained by summing embeddings.

Action Generation Module

Predicts end-effector poses conditioned on the generated robot flows. Implemented based on Diffusion Policy with a DiT backbone, trained with a flow matching objective.

Results

Flow Generation

Quantitative Results

We compared our method against baseline methods on the test sets of Fractal, Bridge V2, DROID-100, and Fanuc Manipulation. In-domain settings corresponded to Fractal and Bridge V2; zero-shot settings corresponded to DROID-100 and Fanuc Manipulation.

Method Flow as Flow In-domain   Zero-shot   Inf. speed â†“
[ms]
Fractal Bridge V2 DROID-100 Fanuc Manip.
ADE â†“FDE â†“LTDR â†‘[%] ADE â†“FDE â†“LTDR â†‘[%] ADE â†“FDE â†“LTDR â†‘[%] ADE â†“FDE â†“LTDR â†‘[%]
Language-conditioned
FLIP 66.1787.5235.69 50.7368.4347.72 43.1049.1054.87 28.3150.8372.17 35
FLIP✓ 38.7757.4158.11 48.3466.3149.26 38.5444.4856.25 26.7947.8571.62 17
Im2Flow2Act 37.1447.7460.61 51.4870.9347.97 44.1454.4151.48 38.1564.1859.25 5,580
Im2Flow2Act✓ 33.2146.8364.25 42.9660.9354.00 38.8745.2556.07 26.5148.4871.75 230
GigaWorld-0-Video— 74.0095.2332.46 53.1869.5846.44 42.7547.9653.91 37.6058.3561.22 26,976
Goal-conditioned
Track2Act 64.3286.6242.00 47.2964.1351.61 40.7347.4354.29 27.3747.1770.99 1,430
Ours✓ 21.2327.3176.79 27.1134.6669.96 35.8940.5858.81 22.4642.1974.54 44

Quantitative comparison on robot flow generation benchmarks. Bold indicates the best result and underline indicates the second best. Green rows indicate methods using Flow as Flow (ours = dark green, variants = light green).

Qualitative Results

Comparison of robot flow generation between our method and Track2Act on samples from Fractal (i), Bridge V2 (ii), DROID-100 (iii), and Fanuc Manipulation (iv). The left two columns show \(\mathcal{I}\) and \(\mathcal{G}\); the remaining columns show the ground truth, Track2Act, and our method's flows overlaid on \(\mathcal{I}\).

Qualitative results of robot flow generation.

Qualitative results of robot flow generation. Our method generated flows targeting the correct object with the appropriate motion direction, both in in-domain and zero-shot settings.


Real-World Robot Experiments

We validated our method in downstream mobile manipulation tasks using the Human Support Robot (HSR), an 11-DoF mobile manipulator. We evaluated on 13 manipulation tasks, each with ~30 teleoperated demonstrations for training and 20 trials for evaluation (260 trials per method).


Real-world experimental setup with the Human Support Robot.

Representative examples of mobile manipulation tasks. 13 diverse tasks: bin picking, bussing table, push bin into shelf, push chair, open/close drawer, put fruit on plate, close box, water plant, take towel, close laptop, stack block, and stack cup.


Quantitative Results

[%] Method Flow-
based
Flow
as Flow
Push bin
into shelf
Push
chair
Close
drawer
Close
box
Take
towel
Bin
picking
Put fruit
on plate
Bussing
table
Close
laptop
Water
plant
Open
drawer
Stack
cup
Stack
block
Avg.
Language-conditioned
DP-Lang 757565653535 35353552051038
FLIP✓ 455550605025 15152551010528
FLIP✓✓ 555570655545 6045402525101043
Im2Flow2Act✓ 707065656045 5040452020152045
Im2Flow2Act✓✓ 858085757060 6055552525201555
Goal-conditioned
DP-Goal 403045502530 51530550522
Track2Act✓ 858075606555 5555351515201048
Ours✓✓ 909080757570 6565503030201558
Oracle✓— 959090858080 7570554035252065

Quantitative results of real-world experiments (260 trials per method, 20 per task). Bold = best, underline = second best. Our method achieved 58% average success rate, outperforming Track2Act (48%) by 10 points.


BibTeX

To be announced.