The MP1 takes the historical observation point cloud and the robot's state as inputs. These inputs are processed through a visual encoder and a state encoder, respectively, and then serve as conditional inputs to the UNet-integrated MeanFlow. After passing through the MeanFlow, the model computes regression loss (𝓁cfg) between the mean velocity generated from the initial noise and the target velocity. This 𝓁cfg is combined with a Dispersive Loss (𝓁disp) imposed on the UNet’s hidden states to jointly optimize the network parameters.
Hammer – Grasp a hammer and strike a pig.
Drawer Close – Closing a drawer.
Heat Water – Position a kettle in a suitable location.
Stack Block – Stack a block.
Spoon - Put the spoon in the bowl.
Simulation results for Hammer, Drawer Close, and Pick Place tasks.
Simulation results for Assembly, Coffee Pull, and Stick Push tasks.
In robot manipulation, robot learning is becoming a prevailing approach. However, generative models within this field face a fundamental trade-off between the slow, iterative sampling of diffusion models and the architectural constraints of faster Flow-based methods, which often rely on explicit consistency losses. To address these limitations, we introduce MP1, which pairs 3D point-cloud inputs with the MeanFlow paradigm to generate action trajectories in one network function evaluation (1-NFE). By directly learning the interval-averaged velocity via the MeanFlow Identity, our policy avoids any additional consistency constraints. This formulation eliminates numerical ODE-solver errors during inference, yielding more precise trajectories. MP1 further incorporates CFG for improved trajectory controllability while retaining 1-NFE inference without reintroducing structural constraints. Because subtle scene-context variations are critical for robot learning, especially in few-shot learning, we introduce a lightweight Dispersive Loss that repels state embeddings during training, boosting generalization without slowing inference. We validate our method on the Adroit and Meta-World benchmarks, as well as in real-world scenarios. Experimental results show MP1 achieves superior average task success rates, outperforming DP3 by 10.2% and FlowPolicy by 7.3%. Its average inference time is only 6.8 ms—19× faster than DP3 and nearly 2× faster than FlowPolicy. Our code is available at https://anonymous.4open.science/r/xxxx.
The MP1 is compared with SOTA methods based on Diffusion and Flow in terms of inference time and success rate on the Adroit and Meta-World tasks. With inference time on the x-axis and success rate on the y-axis, it can be observed that the MP1 achieves SOTA performance in both inference speed and success rate.
Success rate curves of different methods on multiple Meta-World tasks. We compare the performance of MP1, FlowPolicy, and DP3 on four tasks. The x-axis represents training steps, and the y-axis shows the success rate. Shaded areas represent the standard deviation across different random seeds. The proposed method achieves higher success rates with smaller variance.
We test the effect of different numbers of demon strations (0, 2, 5, 10, 20) on task performance. It can be observed that as the number of demonstrations increases, the success rate across various tasks significantly improves. Notably, tasks such as Assembly and Lever Pull quickly achieve near-optimal performance with as few as 5 to 10 demon strations. Our method, the MP1, consistently outperforms the Flow Policy, especially with fewer demonstrations. These results indicate that increasing the number of demonstrations effectively enhances model performance for most tasks, and the proposed method is particularly adept at adapting to few-shot learning scenarios.