SAM 3 Explained: How Meta’s New Model Solves “Concept Segmentation” (and How to Run It)

Watch or Listen on YouTube
SAM 3 Explained: How Meta’s New Model Solves “Concept Segmentation”

Introduction

There is a recurring debate right now: “Is computer vision finally a solved problem?” It is a tempting thought. We have models that can describe images with poetic detail and generators that can dream up video from scratch. Yet, until recently, if you wanted to ask a computer to “mask every single screw in this engine disassembly video,” you were out of luck. You had to click on them. One by one.

That changes with Segment Anything Model 3, or SAM 3.

Meta has just dropped the third iteration of their foundational vision model, and it represents a pivot from purely visual understanding to semantic understanding. While SAM 1 and SAM 2 were masterful at segmenting objects you explicitly pointed to, SAM 3 introduces a capability called Promptable Concept Segmentation. It does not just see pixels; it understands categories.

We are going to tear down the architecture, look at the benchmarks that matter, and show you how to run SAM 3 locally.

1. What is SAM 3? The Leap from “Visual” to “Concept” Segmentation

Editorial workspace visual comparing SAM 3 concept segmentation with manual point-and-click tools
Editorial workspace visual comparing SAM 3 concept segmentation with manual point-and-click tools

To understand why SAM 3 is a big deal, you have to look at the limitation of its predecessors. SAM 2 is an incredible piece of engineering, but it is fundamentally a reactive tool. You give it a click, a box, or a mask, and it gives you the object. It is a translator of user intent into pixel masks.

SAM 3 changes the game by accepting concepts as inputs.

This is Promptable Concept Segmentation (PCS). Instead of clicking on a specific car, you can prompt the model with the text “car” or “red vehicle.” The model then hunts down every instance of that concept across the image or video frame. It handles the logic of “finding” and “segmenting” simultaneously.

The unified architecture creates a strange new workflow where you can combine inputs. You can provide a text prompt (“cat”) and a visual exemplar (a bounding box around one specific cat), and the model generalizes from that combination to find all cats that look similar. It is a shift from “segment this” to “segment everything like this.”

2. Under the Hood: The “Presence Token” and Deep Encoder Architecture

Technical diagram of SAM 3 Presence Token and shared encoder architecture for reliable concept detection
Technical diagram of SAM 3 Presence Token and shared encoder architecture for reliable concept detection

The engineering team at Meta Superintelligence Labs didn’t just glue a language model to SAM 2. They rebuilt the pipeline to solve a specific headache in open-vocabulary segmentation: false positives.

When you ask a standard model to find “a unicorn” in a picture of a kitchen, standard detectors often hallucinate. They try to force the most “unicorn-like” blob (perhaps a blender) into a bounding box because they are optimized to localize, not to reject.

SAM 3 solves this with a Presence Head.

The architecture uses a shared Perception Encoder (PE) backbone that processes both vision and text. But here is the clever part: they decoupled the recognition (“is the object even here?”) from the localization (“where is the object?”).

A global Presence Token is responsible solely for predicting the probability that the concept exists in the frame. The localization queries then do the heavy lifting of drawing masks. The final score is a product of these two. If the Presence Token says “zero chance,” the localization queries get suppressed. This simple architectural tweak drastically reduces the hallucination rate for objects that aren’t there.

3. SAM 3 vs SAM 2: Key Differences and Improvements

If you are a developer deciding whether to upgrade, you need to know if the compute cost is worth it. SAM 3 is essentially a superset of SAM 2, meaning it retains all the visual prompting capabilities (clicks/boxes) while adding the semantic layer.

Here is the breakdown of the generational leap:

  • Input Modality: SAM 2 relies on geometric prompts (points, boxes). SAM 3 accepts geometric prompts, text phrases, and image exemplars.
  • Output Scope: SAM 2 is designed to segment a single target object per prompt. SAM 3 is designed to exhaustively segment all instances of a concept.
  • Video Persistence: SAM 3 features a memory-based tracker that shares the backbone with the detector. It improves re-identification when objects disappear behind occlusions and re-emerge later—a notorious pain point in tracking.
  • Disambiguation: SAM 3 employs a “hard mode-switch.” It can toggle between “concept mode” (find all cars) and “instance mode” (refine this specific car mask), preventing the logic errors that occur when models try to do both at once.

4. Breaking Down the Benchmarks: Is it Really SOTA?

Meta created a new benchmark specifically for this release called SA-Co (Segment Anything with Concepts). It is a beast of a dataset, containing over 207,000 unique concepts—roughly 50 times more than existing benchmarks.

The performance numbers suggest SAM 3 isn’t just a minor update.

In zero-shot evaluations on the LVIS dataset, SAM 3 achieved a mask AP (Average Precision) of 48.8. For context, the previous state-of-the-art was sitting at 38.5. That is a massive jump in the world of computer vision, where we usually fight for 1% gains.

SAM 3 benchmarks show particular strength in video consistency. By decoupling the detection and tracking, the model sustains near real-time performance for about 5 concurrent objects.

Here is a look at the comparative performance data:

SAM 3 Performance Benchmarks

Comparative performance benchmarks for SAM 3 against Human, OWLv2, and Gemini models across image and video tasks.
Benchmark CategoryTask DescriptionMetricModelScore
Concept Segmentation (Images)Text → Masks (on SA-Co Gold)cgF1Human 72.8%
Concept Segmentation (Images)Text → Masks (on SA-Co Gold)cgF1SAM 3 53.9%
Concept Segmentation (Images)Text → Masks (on SA-Co Gold)cgF1OWLv2 24.6%
Concept Segmentation (Images)Text → Masks (on SA-Co Gold)cgF1Gemini 2.5 Pro 13.0%
Concept Segmentation (Video)Text → Masklets (on SA-Co SA-V)pHOTAHuman 70.5%
Concept Segmentation (Video)Text → Masklets (on SA-Co SA-V)pHOTASAM 3 58.0%
Concept Segmentation (Video)Text → Masklets (on SA-Co SA-V)pHOTALLMDet + SAM3 Tracker 30.1%
Visual Segmentation (Video)Mask → Masklet (on SA-V test)J&FSAM 3 84.4%
Visual Segmentation (Video)Mask → Masklet (on SA-V test)J&FSeC 81.7%
Visual Segmentation (Video)Mask → Masklet (on SA-V test)J&FSAM 2.1 L 78.4%
Counting (Images)Counting objects (on CountBench)AccuracySAM 3 93.8%
Counting (Images)Counting objects (on CountBench)AccuracyGemini 2.5 Pro 92.4%
Counting (Images)Counting objects (on CountBench)AccuracyQwen-VL-72B 86.7%

The data shows SAM 3 doubling the accuracy of existing systems like OWLv2 on the new concept segmentation tasks. It approaches human performance in video tracking, which is arguably the hardest task in the suite.

5. Real-World Use Cases: Medical, Marine, and Creative

SAM 3 segmenting medical scans, marine life, and VFX scenes across three professional screens
SAM 3 segmenting medical scans, marine life, and VFX scenes across three professional screens

Synthesizing reddit discussions and early experiments from the community reveals where this model actually shines versus where it is just hype.

5.1 Medical Imaging:

Medical image segmentation AI is the “holy grail” application for tools like this. SAM 3 shows promise here because of its promptable nature. A radiologist could theoretically prompt “tumor” or “lesion.” Early tests show it is excellent for pre-labeling—generating a rough pass that a human expert refines. It does not replace the expert, as it can struggle with highly specific biological structures (like distinct intracranial arteries) without fine-tuning, but it speeds up the workflow massively.

5.2 Scientific Research:

Marine ecology researchers are already looking at SAM 3 for analyzing Scientific Research footage. The “Presence Token” is valuable here. If you are scanning hours of empty ocean floor for a specific starfish, you want a model that confidently says “nothing here” rather than hallucinating rocks as starfish.

5.3 Creative and VFX:

For video editors, the “concept segmentation” is a rotoscoping dream. You can type “person” and get a mask for every actor in the scene to apply color grading or effects. It removes the need to manually initialize trackers on every single person in a crowd shot.

6. Hardware Requirements: Can You Run SAM 3 Locally?

The big question for developers: Can I run SAM 3 locally? The answer is yes, but video will hurt.

6.1 The Weights:

The model itself is surprisingly efficient in terms of parameter count (~850M parameters). This fits comfortably in the VRAM of high-end consumer cards like the RTX 3090 or 4090.

6.2 Inference Speed:

For single images, SAM 3 is snappy. On an H200 GPU, it clocks in at about 30ms per image. You will see slower but usable speeds on consumer hardware.

6.3 The Video Bottleneck:

Video is where the math gets heavy. SAM 3 tracks every object with a masklet (a spatio-temporal mask). The inference cost scales linearly with the number of objects. Tracking one person is fast. Tracking 50 people in a crowd will tank your frame rate. If you need real-time performance on video with multiple objects, you are looking at data-center grade hardware or significant optimization work (like quantization or distillation).

7. How to Use SAM 3: A Quick Start Guide

The code is out, and the barriers to entry are low. You have three primary ways to get this running.

7.1 Option 1: Direct Python Usage

This is for the engineers building pipelines. You will need to request access to the checkpoints via Hugging Face first.

import torch
from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# 1. Load the model (Requires CUDA)
model = build_sam3_image_model()
processor = Sam3Processor(model)

# 2. Load your image
image = Image.open("my_dataset_image.jpg")
inference_state = processor.set_image(image)

# 3. The Magic: Prompt with a concept
# SAM 3 will find ALL instances of this concept
output = processor.set_text_prompt(
    state=inference_state,
    prompt="red sports car"
)

# 4. Extract results
masks = output["masks"]
boxes = output["boxes"]
print(f"Found {len(masks)} instances.")

7.2 Option 2: Visualizing with Notebooks

The GitHub repository includes excellent Jupyter notebooks (sam3_image_predictor_example.ipynb). These are the best way to inspect the “Presence Score” and see how the model discriminates between positive and negative text prompts.

7.3 Option 3: Auto-Labeling Pipelines

The killer app for SAM 3 isn’t necessarily running it in production—it is using it to label data. You can use SAM 3 to auto-label a massive dataset of specific objects (e.g., “hard hats”) and then train a smaller, faster model (like YOLOv10 or RT-DETR) on those labels for edge deployment. Python scripts can automate this Auto-Labeling Pipelines process effectively.

8. Limitations and “Hallucinations”

We need to be honest about what this model cannot do. SAM 3 is not an AGI. It is a pattern matcher with a good vocabulary.

8.1 Spatial Reasoning:

It struggles with queries that require logic. If you ask for “the man behind the car,” SAM 3 often fails to understand the prepositional relationship and might just segment both the man and the car. It segments nouns well; it segments relationships poorly.

8.2 Ambiguity:

The English language is messy. If you prompt “bat,” SAM 3 relies on visual context to decide between the animal and the baseball equipment. If the image is ambiguous, the model’s guess is a coin toss. The “Ambiguity Head” in the architecture tries to mitigate this by predicting multiple valid masks, but user guidance is often still required.

8.3 Generalization:

While it is “open vocabulary,” it is not omniscient. It generalizes poorly to niche domains (like thermal imagery or specific industrial parts) without fine-tuning. The “concept” understanding breaks down when the visual features deviate too far from the training distribution.

9. Conclusion: The “GPT Moment” for Computer Vision?

Segment Anything Model 3 feels like a foundational shift. We are moving away from the era where computer vision models were just “eye” simulators that needed manual pointing. We are entering an era where they have a brain behind the eyes.

The decoupling of recognition and localization via the Presence Token is a technical insight that will likely be copied across the industry. It solves the fundamental problem of “don’t find things that aren’t there.”

For developers and researchers, the value of SAM 3 lies in its versatility. Whether you are building medical image segmentation AI pipelines or just trying to automate video editing, this model raises the baseline of what is possible out of the box.

The weights are available. The code is on GitHub. It is time to see what you can build with it.

SAM 3 Model Card Overview

Technical specifications, hardware requirements, and architecture details for the SAM 3 model.
FeatureSpecificationNotes
Parameters~850M~450M Vision Encoder, ~300M Text Encoder, ~100M Heads
LicenseSAM LicenseResearch use mostly; check repo for specific restrictions.
Training Compute172k A100 HoursAlso utilized 86k H200 hours. Massive scale.
Input Resolution1024×1024Standard square crop for processing.
Key ArchitectureDETR-based DetectorUses a shared “Perception Encoder” backbone.
Video StrategyMasklet TrackingLinearly scales cost with object count.
Primary DatasetSA-Co207k unique concepts, millions of images.

Next Step: Would you like me to generate a specific Python script for batch-processing a folder of images using SAM 3 to auto-generate masks for a custom dataset?

Promptable Concept Segmentation (PCS): A task where the model identifies all instances of a concept (e.g., “car”) across an image or video based on a text or visual input, rather than just a single object.
Presence Token: A specific architectural component in SAM 3 that predicts the probability of a concept existing in the frame before attempting to segment it, significantly reducing false positives (hallucinations).
Masklet: A spatio-temporal mask that tracks an object across multiple video frames, maintaining its identity over time rather than treating each frame as a static image.
Exemplar Prompt: A visual input method where the user provides a reference image (e.g., a crop of a specific bird) and the model finds all other objects that look semantically similar.
SA-Co (Segment Anything with Concepts): The massive new benchmark dataset created by Meta, containing over 207,000 unique concepts, used to train and evaluate SAM 3’s semantic understanding.
cgF1 (Classification-Gated F1): A performance metric that combines the quality of the mask (F1 score) with the accuracy of the model’s “presence” prediction, ensuring the model isn’t rewarded for guessing on empty images.
pHOTA (Presence-Aware Higher Order Tracking Accuracy): A video tracking metric that evaluates how well the model tracks objects over time while strictly penalizing it for detecting objects that do not exist in the frame.
Zero-Shot: The ability of the model to identify and segment objects or concepts it was not explicitly trained on, using its generalized understanding of language and visuals.
Inference: The process of using the trained model to process live data (an image or video) and generate a result (a mask), as opposed to “training” where the model learns.
Quantization: A technique to reduce the size and memory usage of a model (e.g., running at 8-bit instead of 16-bit) to make it run faster on consumer hardware, often with a minor trade-off in accuracy.
DETR (Detection Transformer): The underlying architecture used in SAM 3’s detector, which uses a transformer to predict object sets directly, streamlining the pipeline compared to older CNN-based methods.
IoU (Intersection over Union): A standard metric that measures the overlap between the predicted mask and the “ground truth” (perfect) mask; a higher IoU means a more accurate segmentation.
Occlusion: A scenario in video tracking where an object is temporarily hidden (e.g., a person walking behind a tree); SAM 3’s memory tracker is designed to handle this by “remembering” the object until it reappears.
VRAM (Video Random Access Memory): The high-speed memory on your graphics card; SAM 3 requires significant VRAM (16GB+) to store the model weights and process images efficiently locally.

What is the difference between SAM 3 and SAM 2?

SAM 3 adds “Promptable Concept Segmentation” (finding all cats, not just one specific cat) and improved video memory, whereas SAM 2 was strictly limited to segmenting individual objects you manually clicked on.

Can I run SAM 3 locally, and what are the hardware requirements?

Yes, but it is demanding. Real-time video processing requires enterprise H100 GPUs, but you can run image segmentation locally on consumer cards like the RTX 4090 with 24GB VRAM for decent performance.

What is “Promptable Concept Segmentation” (PCS)?

This is the model’s new ability to take a generic text prompt (e.g., “red wheels”) or visual example and autonomously find every matching instance in a video or image, rather than just one target.

Is SAM 3 free for commercial use?

Generally yes, under the “SAM License,” which allows both research and commercial applications. However, users must verify the specific license file for restrictions on “out-of-scope” use cases like surveillance.

How does SAM 3 perform on medical images compared to specialized models?

It serves as a powerful generalist baseline (48.8 AP on LVIS), but experts note it still trails behind fine-tuned specialist models (like MedSAM) for detecting highly specific vascular or cell structures.