LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

Rongxu Cui^1,2,3*, Zongzheng Zhang^1,2*, Jingrui Pang², Haohao Chi¹, Jinbang Guo², Saining Zhang¹,
Shaoxuan Xie², Xin Jin⁴, Yao Mu⁵, Jiaolong Yang⁶, Guocai Yao², Xianyuan Zhan¹, Ya-Qin Zhang¹, and Hao Zhao^1,2†

¹Institute for AI Industry Research (AIR), Tsinghua University
²Beijing Academy of Artificial Intelligence (BAAI)
³Beihang University
⁴Eastern Institute of Technology, Ningbo, China
⁵Shanghai Jiao Tong University
⁶Microsoft Research Asia (MSRA)

^*Equal contribution. ^†Corresponding author.

Accepted to ECCV 2026

Real-world VLA deployment is severely bottlenecked by physical safety and semantic reasoning, constituting critical (a) VLA Safety Challenges. To systematically evaluate these challenges, we introduce a comprehensive VLA safety benchmark and develop an efficient (b) Data Generation Pipeline to synthesize 19.7K strictly collision-free demonstrations. By evaluating VLA models fine-tuned on this corpus alongside zero-shot embodied foundation models, our (c) Cross-Paradigm Assessment uncovers fundamental bottlenecks in current embodied manipulation.

Benchmark

Overview

Overview of our VLA Safety Benchmark. (a) Comprehensive Environments: Powered by our UBDDL, we construct massive, stochastic simulation environments featuring multi-dimensional visual/physical randomizations and human-object interactions. (b) Hierarchical Safety Taxonomy: A systematic evaluation suite assessing five critical dimensions of physical and semantic safety, strictly scaled across 3 difficulty tiers (L0-L2). (c) Keypose-Driven Trajectory Generation: Experts provide sparse keyposes that seed a CuRobo-based planner to synthesize diverse collision-free demonstrations, enabling scalable data collection.

Dataset Comparison

Comparison with existing VLA benchmarks. Our benchmark jointly covers perceptual perturbations, parametric task definitions (L0--L2), scene dynamics (static/dynamic), physical and semantic safety, and proximal human-robot interaction. Furthermore, our keypose-driven data generation pipeline enables highly scalable, collision-free data generation.

Dataset Examples

fshoa_1_2

hri_1_4

tsa_1_3

fshoa_1_5

hri_1_1

Evaluation

Physical Safety

Embodied Physical Safety Track

Mean Success Rate (SR, %) with standard deviations in parentheses.

Embodied physical safety trend sparklines across tasks and methods — **Physical safety trends across tasks and methods.** Sparklines summarize L0-L2 behavior with a shared 0-1 y-axis: success rate is shown in blue and safety violation rate is shown in red.

Semantic Safety

Semantic Safety Reasoning Track

Metrics are reported as Refusal Rate (RR, %).

Data Scaling

Impact of Data Scaling on Task Efficacy and Safety

More demonstrations improve SR while reducing CR; CuRobo serves as a privileged upper bound.

Robustness

Environmental Stochasticity Evaluation

Robustness across perturbation axes including noise, initial state, view, scene, layout, and unseen objects.

Robustness Evaluation Across Axes of Environmental Stochasticity

Sub-optimal trajectory synthesis drives task failures without constraint violations. Even when states stay collision-free, policies often leave the nominal task manifold—through excessive halting or oscillatory motion—because long-horizon temporal consistency is weak. This yields collision-free incompletion: kinematically poor configurations (Kinematic Deadlocks) or episodes that exceed the execution horizon (Temporal Overflows), trading task success for conservative or erratic motion rather than true safety violations.

Semantic misalignment induces stable but task-irrelevant interactions. Beyond kinematics, VLAs can execute mechanically stable, collision-free grasps on semantically wrong objects when distractors are visually similar. Physical safety alone is insufficient without fine-grained grounding; robust relational reasoning is needed to bind language to the correct target so safe motion serves the intended command (Semantic Misalignment).

Citation

If you find LIBERO-Safety useful in your research, please cite our paper:

@article{cui2026liberosafety, title={LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models}, author={Rongxu Cui and Zongzheng Zhang and Jingrui Pang and Haohan Chi and Jinbang Guo and Saining Zhang and Shaoxuan Xie and Xin Jin and Yao Mu and Jiaolong Yang and Guocai Yao and Xianyuan Zhan and Ya-Qin Zhang and Hao Zhao}, journal={arXiv preprint arXiv:2606.23686}, year={2026} }

LIBERO-Safety