VisGym Logo VisGym
Diverse, Customizable, Scalable Environments for Multimodal Agents

VisGym consists of 17 diverse, long-horizon environments designed to systematically evaluate, diagnose, and train VLMs on visually interactive tasks. The agent must select each action conditioned on both its past actions and observation history.

VisGym Teaser

Abstract

Modern Vision–Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons.

We introduce VisGym, a comprehensive gymnasium of 17 diverse environments designed to systematically evaluate and train VLMs on visually interactive tasks. The suite spans multiple domains—including symbolic puzzles, real-image understanding, navigation, and manipulation—and provides flexible controls over difficulty, input representation, planning horizon, and feedback mechanisms. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning approaches.

Our comprehensive evaluations reveal that all frontier models struggle significantly in interactive settings, achieving low success rates in both easy (26.8%) and hard (12.6%) configurations. Through detailed experiments, we identify several critical limitations: models fail to effectively leverage long context, performing worse with unbounded history than with truncated windows. Additionally, we find that several text-based symbolic tasks become substantially harder once rendered visually, highlighting the unique challenges of visual reasoning.

However, we demonstrate that explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings yield consistent improvements through supervised finetuning, revealing concrete failure modes and actionable pathways for advancing multi-step visual decision-making. All code, data, and models will be publicly released.

Frontier VLMs Fail on VisGym

Even the best-performing frontier model, GPT-5, achieves only 26.43% on VisGym (Easy) and 12.57% on VisGym (Hard), indicating that VisGym poses a significant challenge for existing models.

Average Performance

Control Study

Teaching Agents to Explore Before Exploitation

Some trajectories are far more informative than others—especially those that reveal hidden state or disambiguate perception. We test whether inducing such information-revealing behaviors during supervised finetuning improves VLM decision-making.

Across tasks with unknown dynamics and partial observability, structured demonstrations that explicitly probe action–perception correspondence consistently outperform solve-only trajectories, improving both success and final accuracy.

Further finetuning on longer but unstructured demonstrations degrades performance, confirming that gains come from the informative structure of demonstrations—not their length or quantity.

Matchstick Rotation (Unknown Dynamics)

Comparison Dimension Baseline Demonstrations (Stochastic) Information-Revealing (Structured)
Strategy Three stochastic moves toward the target Two unit-scale steps to probe action–perception correspondence
Success Rate 32.9% 70.0% (≈2.1×)

Mental Rotation 3D (Partial Observability)

Comparison Dimension Baseline Demonstrations (Solve-Only) Information-Revealing (Explore-Then-Solve)
Strategy Directly attempt to solve the task Explicit exploratory actions to reveal hidden state
State Coverage Partial, implicit Explicitly disambiguates latent variables
Success Rate 28.6% 62.4% (≈2.2×)

Supervised finetuning is most effective when demonstrations teach agents how to reveal state, not just what action to take.

SFT Ablations: Vision vs LLM Gains

Better Eyes Or Better Brain

We decoupled the architecture to ask a simple question: does performance come from better eyes (the vision encoder) or a better brain (the LLM)?

Our analysis shows that, for most interactive tasks, temporal reasoning is the dominant factor.

While visual perception is necessary, the ability to integrate history and plan over time is what truly differentiates model performance.

"Temporal reasoning and history integration remain the primary bottlenecks for current VLMs."
Qwen Easy vs Hard Scatter and Bar

Stronger Base Model Generalizes Better

Supervised finetuning is known to generalize poorly to task variants. We revisit this question for modern VLMs by finetuning Qwen2.5-VL-7B and Qwen3-VL-8B on the same training data and optimization setup, then evaluating on harder task variants.

While both models perform similarly on the easy variants seen during training, Qwen3-VL generalizes substantially better to harder settings, nearly doubling the success rate on average. This shows that newer VLMs expand the generalization limits of supervised finetuning in multi-step visual decision-making.

Diagnosing Frontier Models with VisGym

Providing Final Goal at Beginning

Providing the final solution image upfront reframes these tasks from reasoning about goals to aligning observations with a known target, shifting difficulty toward visual perception and tool execution.

We evaluate this effect on five tasks—Patch Reassembly, Jigsaw, Colorization, Zoom-In Puzzle, and Matchstick Equation—where constructing the goal state is non-trivial. We augment instructions with the ground-truth final observation ogt.

Across tasks, performance improves substantially, indicating that imagining or constructing the target state is a key bottleneck. However, accuracy remains far from perfect, revealing additional limitations beyond reasoning—most notably fine-grained visual perception and action execution.

Unexpectedly, GPT-5 and Gemini 2.5 Pro underperform on Zoom-In Puzzle and Matchstick Equation when the goal image is provided, often terminating early despite visible mismatches. Follow-up tests attribute this failure to visual misjudgment rather than reasoning errors: when asked whether initial and goal images were identical, Gemini 2.5 Pro produced false positives 80% and 57% of the time on these tasks, compared to 18%, 2%, and 0% on Colorization, Jigsaw, and Patch Reassembly.

These results show that perception errors can negate—or even reverse—the expected benefits of explicit goal supervision.

Providing Final Goal at Beginning

Effect of providing the final goal observation. No Final Obs. and With Final Obs. indicate whether the goal image is available at episode start (mean ± s.e.).

Turns to Keep in Conversation History

While longer interaction histories provide useful environmental signals, they also introduce redundant and stale information that can hurt performance.

Across Maze2D, Sliding Block, MuJoCo Fetch Reach, and Matchstick Rotation, models perform best with a limited recent history, but degrade when given the full unbounded context.

This shows that visual context helps multi-step decision-making only up to a point. Importantly, the effect is task- and model-dependent, including cases of reverse scaling where longer history consistently reduces performance.

Turns to Keep in Conversation History

Effect of truncating conversational context. Settings 1, 2, 4, and ∞ retain increasing amounts of recent history, from the current turn only to the full history (mean ± s.e.).

Removal of Text-based Feedback

Humans can infer action consequences directly from visual changes, but current VLMs cannot reliably do so.

Across four tasks—Maze 3D, Maze 2D, Sliding Block, and Matchstick Equation—removing textual feedback and relying only on visual state transitions leads to consistent performance drops.

This shows that VLMs struggle to judge action validity from visual changes alone and depend heavily on text-based feedback during visually interactive decision-making.

Removal of Text-based Feedback

Effect of text-based feedback. Results with and without environment feedback (mean ± s.e.).

Representing Observation in Text

We compare visual and ASCII-only versions of four symbolic tasks. Removing visual encoding significantly improves GPT-5's performance—often by 3–4×—indicating that its primary limitation lies in visual grounding rather than long-horizon reasoning.

Gemini 2.5 Pro shows mixed effects across tasks, suggesting limitations in both perception and planning. Open-weight models perform poorly in both settings, indicating broader weaknesses in long-horizon decision-making.

Notably, Matchstick Equation reverses this trend: all models perform better with visual input than with ASCII, likely due to distorted ASCII glyphs that hinder model understanding.

Visual Input
ASCII Input
Visual Input
Effect of ASCII visualization

Effect of ASCII visualization. Image and Text denote visual and ASCII observation modalities (mean ± s.e.).

Qualitative Failure Modes (How They Fail)

StringSight Action Looping
Model: GPT-5 | Step: 1 / 7
> ('swap', ((0, 0), (0, 1))) _
# Swap pieces at positions (0,0) and (0,1)
Action Looping

The "Infinite Looper"

Models often get stuck in a repetitive cycle, acting like a broken record. In this StringSight task, the model repeatedly issues the same swap command between the same two pieces without making any progress towards solving the puzzle, as evidenced by identical actions over multiple consecutive steps.

"Why do models fail with long context? They get stuck in Action Loops, repeating the same mistake despite new observations."

StringSight State Mismanagement
Model: Gemma3 27B | Step: 1 / 14
> ('move', [1, 0, 2, 0]) _
# Move pieces
Early Termination

The "Rage Quitter"

Perception errors lead to false confidence. The model revisits previously blocked directions and fails to remember which moves have already been proven impossible, leading to repeated failed actions instead of seeking new routes.

"Models often hallucinate that a task is unsolvable and trigger Early Termination."

StringSight Early Termination
Model: Claude Sonnet 4 | Step: 1 / 20
> ('move', 2) # To move left _
State Mismanagement

The "Stubborn Explorer"

Without explicit state tracking, models act like they have amnesia. The model frequently misjudges the effect of its rotations, such as applying large yaw or roll corrections that result in the object being further from the target, indicating it is not fully utilizing the visual feedback.

"Without explicit state tracking, models suffer from State Mismanagement, ignoring critical environmental feedback."

StringSight Visual Misjudgment
Model: Gemini 2.5 Pro | Step: 1 / 13
> ('observe', 'scene') _
# Observe the current scene
Visual Misjudgment

"Optimistic Rotator"

After a long sequence of failed moves, the model explicitly gives up and submits the equation as final, stating 'I give up.' The model misinterprets the visual feedback and repeatedly makes incorrect moves, eventually giving up despite the task being solvable.

"Confirming that Visual Perception remains a distinct bottleneck from reasoning."