V-JEPA (Video Joint Embedding Predictive Architecture) is a model from Meta AI that learns abstract representations of the world from video by predicting representations of future states rather than raw pixels. It is designed for physical reasoning and robotics.

How is V-JEPA different from classical LLMs?

Classical LLMs are trained on text tokens and learn statistical patterns in language. V-JEPA learns from video using self-supervised learning without labels, building an internal model of physical causality and object permanence rather than linguistic patterns.

What is AMI in the context of AI?

AMI stands for Advanced Machine Intelligence. It refers to AI systems capable of reasoning about and interacting with the physical world autonomously, beyond the text-and-image capabilities of current LLMs. V-JEPA is considered a step toward AMI.

V-JEPA as a Challenge for Classical LLMs

What Is V-JEPA and Why Does It Matter?

V-JEPA (Video Joint Embedding Predictive Architecture) is a model published by Meta AI that learns representations of the world from video—without labels, without pixel prediction, and without the text scaffolding that large language models depend on.

Where a classical LLM ingests tokens and learns statistical co-occurrence patterns across billions of sentences, V-JEPA ingests video frames and learns to predict what will happen next in abstract representation space—not in pixel space. It doesn't try to reconstruct what the next frame looks like. It tries to predict the meaning of the next frame.

This is a fundamentally different learning signal. And it produces a fundamentally different kind of model.

The Coffee Kettle Problem

Consider what a child learns when they watch someone pour coffee. They don't store a pixel-perfect memory of the kettle's surface texture or the steam's precise trajectory. They extract a causal rule: tilt container, liquid falls out. They understand object permanence, gravity, and causality—all from observation, without a single labeled training example.

Classical LLMs cannot do this. They can describe coffee pouring in text with remarkable fluency—because they have read millions of descriptions of coffee pouring. But they have no internal model of the physics. Ask an LLM what happens if you tilt a sealed container upside down over a cup, and it will often give the correct answer by pattern matching on text. Ask a robot running an LLM to actually do it, and the gap between linguistic knowledge and physical reasoning becomes obvious.

V-JEPA is designed to close that gap. It watches video and builds an internal world model that encodes:

Object permanence — objects continue to exist when occluded
Physical causality — forces produce predictable effects on objects
Temporal structure — events unfold in sequences with internal logic
Abstract state prediction — what matters about the future, not what it looks like

Self-Supervised Learning Without Labels

The training procedure is elegant. V-JEPA masks portions of a video sequence and trains an encoder to predict the representations of the masked regions from the visible regions—entirely in embedding space. There is no reconstruction loss on pixels, no human labels, no reward signal.

This is Self-Supervised Learning (SSL) applied to video at scale. The model learns because the prediction task is genuinely hard: to predict what a masked region means, you must understand what is happening in the scene, not just what it looks like.

The result is a model that generalizes across scenes and tasks in ways that pixel-prediction models do not. When tested on physical reasoning benchmarks—tasks that require understanding how objects will move, fall, or interact—V-JEPA significantly outperforms models trained to predict raw pixels.

Zero-Shot Planning and Robotics

The downstream application that makes V-JEPA practically significant is zero-shot planning: the ability to plan a sequence of actions in a new environment without task-specific training.

A robot equipped with a V-JEPA world model can:

Observe its environment via camera
Generate predicted future states for candidate action sequences
Score those futures against a goal representation
Execute the action sequence most likely to achieve the goal

This planning loop does not require the robot to have been trained on this specific task or this specific environment. The world model generalizes. The robot reasons from first principles about physics and causality, the same way a human would approach an unfamiliar kitchen.

Classical LLMs, by contrast, require either fine-tuning on task-specific data or extensive prompt engineering to approximate this kind of reasoning—and even then, they lack grounded physical understanding.

V-JEPA and the Path to AMI

Yann LeCun, Meta's Chief AI Scientist and the intellectual force behind JEPA architectures, has argued for years that the path to Advanced Machine Intelligence (AMI) runs through world models, not through scaling language models. His position: no amount of text data produces genuine understanding of the physical world, because the physical world is not primarily textual.

V-JEPA is the most concrete implementation of that argument to date. It is not a finished AMI system—it is a proof of concept that self-supervised video learning can produce models with meaningful physical reasoning capabilities.

The implications for AI in business are worth tracking. Systems that understand physical causality and can plan in novel environments will be qualitatively more capable than today's language models for applications in manufacturing, logistics, robotics, and any domain where AI must interact with the physical world rather than just describe it.

We are watching the early stages of a transition from AI that knows language to AI that understands the world.