The Evolution of Just Walk Out technology

A Large Video Model with Agentic Reasoning for Physical World AI

When we first introduced Just Walk Out technology, the goal was simple yet ambitious: eliminate the friction of checkout lines. In 2024, we shared how a multimodal foundation model made this possible, by grounding perception and reasoning through the fusion of camera-based visual signals with complementary weight-sensor inputs. Today, we are moving beyond this paradigm toward systems that learn unified representations of the physical environment from continuous visual inputs, without relying on auxiliary sensing modalities. This shift enables agentic reasoning over space and time through learned visual skills, while supporting more robust perception and decision-making in complex real-world settings.

Our latest technology centers on a Large Video Model (LVM) with agentic reasoning over space and time with learned visual skills. Unlike the prevalent paradigm of Multi-model Large Language Models (MLLMs) — which begin with text and treat vision as a secondary modality to be aligned as “foreign” tokens — we flip the stack: vision comes first. We first build a visual foundation model to compress continuous multi-view video streams into rich latent tokens, enabling the system to internalize the physical world. On top of this visual foundation, we then progressively layer language and agentic reasoning capabilities. This allows our model to learn the 3D space, how the products change position, how the products are interacting with shoppers, and how the state of the environment evolves as a direct result of these interactions – ultimately utilizing these insights to execute specific tasks like generating a precise shopping receipt. The approach mirrors how humans learn: we first make sense of the world by observing it, then build problem-solving skills on top of that understanding.

We view the retail floor as an ideal testbed for Physical World Modeling. It is a “lived” world where shoppers actively reshape the surroundings, creating a high-bandwidth setting that requires AI to see, remember, and reason across multiple dimensions: spatial reasoning (3D layouts and geometry), object-centric reasoning (fine-grained identification and counting of objects), temporal reasoning (human actions and interactions with objects across extended time horizons), and world state modeling (the evolving state of the stores).

In this post, we share how our latest model, using only cameras, achieves a new level of visual intelligence, matching, and in some cases exceeding the prior system that requires both cameras and shelf-mounted weight sensors. Our R&D investments in this direction establish a scalable approach that enables Just Walk Out technology to achieve near-human-level accuracy across diverse store formats. More importantly, we are not only building a specialized system for retail; we are developing a foundation for AI that can understand, reason and act across physical environments.

The Challenge of Video Reasoning and Just Walk Out’s Large Video Model Solution

To deeply understand the causal dynamics of the physical world and activities within it, our AI requires an architecture capable of reasoning simultaneously across all cameras, across 3D space, and over extended time horizons. A straightforward approach to this problem would be training an end-to-end model that maps input videos directly to text receipts. However, a fundamental limitation of pure end-to-end learning is information loss.

Our input data – consisting of long, multi-camera video streams – exists in an extraordinarily high-dimensional space. Conversely, the final output receipt contains only a few sparse text tokens. Forcing a neural network to act as a direct compressor between these two vastly disparate spaces inevitably discarding details that may be critical for reasoning. This occurs because there are too few bits of text supervision to guide the model towards discovering the correct latent space. As a result, the model may learn shortcuts that explain the receipt while failing to capture the underlying structure and dynamics of the physical world.

To overcome this limitation, our latest technology employs a Large Video Model (LVM) with agentic reasoning and visual skills. At a high level, it resembles an agentic large language model but operates in a fundamentally more challenging domain: reasoning over continuous, high-dimensional video streams rather than text. Instead of compressing raw videos directly into a receipt, the Large Video Model acts as an active reasoning agent that explores the visual latent space and follows explicit, verifiable reasoning trajectories. Equipped with a suite of visual skills, the model dynamically decides when and how to invoke additional capabilities to gather evidence, resolve ambiguity, and handle complex scenarios. The final receipt is generated as the outcome of this grounded reasoning process rather than a direct prediction from raw observations.

Architecture of Just Walk Out technology’s Large Video Model

The following diagram shows the architecture of our Large Video Model (LVM).

Figure 1: The architecture of Just Walk Out technology’s Large Video Model with agentic reasoning and skills.

Starting from the input layer, the model ingests multi-camera video streams, the store map, and product catalog images. These inputs are processed by a visual foundation model pre-trained on our large-scale retail data with a diverse range of auxiliary tasks, spanning temporal reasoning, spatial reasoning, and cross-view alignment.

Importantly, these tasks are not optimized for a single downstream objective such as receipt prediction. Instead, they are designed to cultivate general-purpose visual intelligence that can robustly operate in complex physical environments. We found this diversity of supervision to be critical: models trained narrowly on receipt prediction or other isolated tasks often fail to generalize to long-tail real-world scenarios.

The resulting visual tokens are then passed into the language model, which acts as the system’s reasoning engine. Our architecture leverages a dual-thinking strategy that mirrors human System 1 and System 2 cognition. For regular shopping events, the model operates in a fast-thinking mode (System 1), directly predicting timestamps, spatial locations, product identities, and quantities. For more complex scenarios involving severe occlusions, repeated pick-and-return behaviors, or multiple small and visually similar items, the system transitions into a slow-thinking mode (System 2) that invokes agentic reasoning to iteratively gather evidence, reconcile inconsistencies, and refine its predictions.

Figure 2: A demo video to showcase the new agentic visual reasoning capabilities.

A key novelty of our approach is that this reasoning is grounded in learned visual agentic skills rather than handcrafted pipelines or external tools. The model itself acquires reusable capabilities such as search, zooming, cropping, segmentation, tracking, temporal retrieval, and counting through pre-training. During inference, the model autonomously decides when to invoke these skills, how to compose them, and where to apply them across space, time, and camera views. This transforms the model from a passive predictor into an active visual agent capable of dynamically gathering missing evidence, resolving ambiguity, and refining its own understanding of the scene.

By integrating these learned visual skills directly into the reasoning loop, the system can connect fragmented observations across views and time, actively recover missing or occluded evidence, enabling accurate receipt generation in complex real-world environments.

Future Look
In this post, we shared our latest evolution of Just Walk Out technology. By building a Large Video Model with reasoning and agentic capabilities, we have achieved a new level of accuracy and scalability that surpasses the previous multi-modal system with cameras and weight sensors. However, we have only begun to explore what is possible for physical-world AI. We are leveraging retail environments as a testbed for developing foundational intelligence, unlocking a future where AI can understand and reason about the physical world.

Related reading