LIBERO-Safety

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

.
LIBERO-Safety overview figure

Real-world VLA deployment is severely bottlenecked by physical safety and semantic reasoning, constituting critical (a) VLA Safety Challenges. To systematically evaluate these challenges, we introduce a comprehensive VLA safety benchmark and develop an efficient (b) Data Generation Pipeline to synthesize 19.7K strictly collision-free demonstrations. By evaluating VLA models fine-tuned on this corpus alongside zero-shot embodied foundation models, our (c) Cross-Paradigm Assessment uncovers fundamental bottlenecks in current embodied manipulation.

Benchmark

Overview

Overview of our VLA Safety Benchmark

Overview of our VLA Safety Benchmark. (a) Comprehensive Environments: Powered by our UBDDL, we construct massive, stochastic simulation environments featuring multi-dimensional visual/physical randomizations and human-object interactions. (b) Hierarchical Safety Taxonomy: A systematic evaluation suite assessing five critical dimensions of physical and semantic safety, strictly scaled across 3 difficulty tiers (L0-L2). (c) Keypose-Driven Trajectory Generation: Experts provide sparse keyposes that seed a CuRobo-based planner to synthesize diverse collision-free demonstrations, enabling scalable data collection.

Dataset Comparison

Dataset Comparison

Comparison with existing VLA benchmarks. Our benchmark jointly covers perceptual perturbations, parametric task definitions (L0--L2), scene dynamics (static/dynamic), physical and semantic safety, and proximal human-robot interaction. Furthermore, our keypose-driven data generation pipeline enables highly scalable, collision-free data generation.

Dataset Examples

episode_000001

episode_000002

episode_000003

episode_000004

episode_000005

Evaluation

Evaluation results on the Embodied Physical Safety Track

Evaluation results on the Embodied Physical Safety Track. Metrics are reported as mean Success Rate (SR, %), with standard deviations in parentheses.

Evaluation results on the Semantic Safety Reasoning Track

Evaluation results on the Semantic Safety Reasoning Track. Metrics are reported as Refusal Rate (RR, %).

Impact of Data Scaling on Task Efficacy and Safety

Impact of Data Scaling on Task Efficacy and Safety. Scaling demonstrations per task (50 for π0.5* vs. 500 for π0.5) simultaneously improves SR and reduces CR. CuRobo serves as a privileged upper bound.

Robustness Evaluation Across Axes of Environmental Stochasticity

Robustness Evaluation Across Axes of Environmental Stochasticity.

Task Suite Demos

L0

Task 1

Task 2

Task 3

Task 4

Task 5

L1

Task 1

Task 2

Task 3

Task 4

Task 5

L2

Task 1

Task 2

Task 3

Task 4

Task 5

Failure Case

Kinematic Deadlocks

Semantic Misalignment

Temporal Overflows

Sub-optimal trajectory synthesis drives task failures without constraint violations. Even when states stay collision-free, policies often leave the nominal task manifold—through excessive halting or oscillatory motion—because long-horizon temporal consistency is weak. This yields collision-free incompletion: kinematically poor configurations (Kinematic Deadlocks) or episodes that exceed the execution horizon (Temporal Overflows), trading task success for conservative or erratic motion rather than true safety violations.

Semantic misalignment induces stable but task-irrelevant interactions. Beyond kinematics, VLAs can execute mechanically stable, collision-free grasps on semantically wrong objects when distractors are visually similar. Physical safety alone is insufficient without fine-grained grounding; robust relational reasoning is needed to bind language to the correct target so safe motion serves the intended command (Semantic Misalignment).