SceneSmith symbolSceneSmith:
Agentic Generation of Simulation-Ready Indoor Scenes

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

SceneSmith is a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. VLM agents collaborate across successive stages to construct richly furnished scenes—from single rooms to entire houses. The framework tightly integrates text-to-3D asset generation, articulated object retrieval, and physical property estimation to produce scenes directly usable in physics simulators for robotics research.

Scene Construction

SceneSmith constructs scenes through five successive stages, each implemented as an interaction among three VLM agents: a designer that proposes object placements, a critic that evaluates realism and coherence, and an orchestrator that mediates between the two.

1
Layout
Architectural floorplan
2
Furniture
Furniture placement
3
Wall-mounted
Shelves, mirrors, art
4
Ceiling-mounted
Lights and fans
5
Manipulands
Small interactive objects

Robot Teleoperation in Generated Scenes

SceneSmith scenes are simulation-ready and can be directly used for robotics tasks. Here we demonstrate this by teleoperating a Rainbow RBY1 robot in generated scenes using the Drake simulator.

Open-Vocabulary Text-to-Scene Generation

SceneSmith supports open-vocabulary text input for generating arbitrary indoor scenes, from single rooms to multi-room house-scale environments. The object vocabulary is also unbounded, as SceneSmith leverages text-to-3D generation to create any object on demand.

Simulation-Ready Scenes

All furniture and manipulands in SceneSmith-generated scenes are fully movable in simulation. The video above demonstrates this by applying earthquake-style shaking to entire scenes, showing that every placed object responds to physical forces.

Room-Scale Scenes

Select a scene to explore. Click objects to view them in isolation.

Scenes shown are slightly lower quality than originals due to compression, mesh decimation, and texture downsampling for faster web loading.

Select a scene to see its prompt
  • Rotate: Left-click + drag
  • Pan: Right-click + drag
  • Zoom: Scroll wheel
  • Select: Click on object
  • F Frame selected object
  • H Hide selected object
  • R Reset camera
  • C Clear selection
  • S Show all hidden

Selected Object

Click an object in the scene to select it

House-Scale Scenes

Multi-room environments with complex layouts. Select a scene to explore.

House scenes are larger and may take longer to load.

Select a scene to see its prompt
  • Rotate: Left-click + drag
  • Pan: Right-click + drag
  • Zoom: Scroll wheel
  • Select: Click on object
  • F Frame selected object
  • H Hide selected object
  • R Reset camera
  • C Clear selection
  • S Show all hidden

Selected Object

Click an object in the scene to select it

Application: Automatic Robot Policy Evaluation

Robot manipulation evaluation pipeline diagram

Robot manipulation evaluation pipeline. Given a manipulation task (e.g., “Pick a fruit from the fruit bowl and place it on a plate”), an LLM generates diverse scene prompts specifying scene constraints implied by the task. SceneSmith generates scenes from each prompt. A robot policy attempts the task in simulation, and an evaluation agent verifies success using simulator state queries and visual observations. This enables scalable policy evaluation without manual environment or success predicate design.

The videos above show rollouts of a model-based pick-and-place policy executing across SceneSmith-generated scenes. For each task, an LLM produces diverse scene prompts and SceneSmith generates 25 unique environments, yielding 100 evaluation scenes spanning four manipulation tasks. A vision-language model parses each task into structured goal components, and the policy plans collision-free motions via RRT-Connect. These rollouts serve as inputs to our automatic evaluation system: an agentic evaluator that verifies task success through simulator state queries and visual observations—without hand-crafted success predicates. The evaluator achieves 99.7% agreement with human labels, confirming that SceneSmith-generated scenes enable scalable, fully automatic policy evaluation.

System Overview

SceneSmith teaser showing a generated community center scene

Fully automated text-to-scene generation. This entire community center was generated by SceneSmith without any human intervention, from a single 151-word text prompt. Beyond explicitly specified elements, SceneSmith places additional objects from inferred contextual information, such as ping pong paddles and balls placed near a ping pong table. Objects are generated on-demand, are fully separable (non-composite), and include estimated physical properties, enabling direct interaction within a simulation. The resulting scenes are immediately usable in arbitrary physics simulators (robots added for demonstration).

Authors

Nicholas Pfaff1, Thomas Cohn1, Sergey Zakharov2, Rick Cory2, Russ Tedrake1
1Massachusetts Institute of Technology, 2Toyota Research Institute

Citation

@article{scenesmith2026,
  title={SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes},
  author={Pfaff, Nicholas and Cohn, Thomas and Zakharov, Sergey and Cory, Rick and Tedrake, Russ},
  journal={arXiv preprint},
  year={2026}
}