Introduction
There is a recurring debate right now: “Is computer vision finally a solved problem?” It is a tempting thought. We have models that can describe images with poetic detail and generators that can dream up video from scratch. Yet, until recently, if you wanted to ask a computer to “mask every single screw in this engine disassembly video,” you were out of luck. You had to click on them. One by one.
That changes with Segment Anything Model 3, or SAM 3.
Meta has just dropped the third iteration of their foundational vision model, and it represents a pivot from purely visual understanding to semantic understanding. While SAM 1 and SAM 2 were masterful at segmenting objects you explicitly pointed to, SAM 3 introduces a capability called Promptable Concept Segmentation. It does not just see pixels; it understands categories.
We are going to tear down the architecture, look at the benchmarks that matter, and show you how to run SAM 3 locally.
Table of Contents
1. What is SAM 3? The Leap from “Visual” to “Concept” Segmentation

To understand why SAM 3 is a big deal, you have to look at the limitation of its predecessors. SAM 2 is an incredible piece of engineering, but it is fundamentally a reactive tool. You give it a click, a box, or a mask, and it gives you the object. It is a translator of user intent into pixel masks.
SAM 3 changes the game by accepting concepts as inputs.
This is Promptable Concept Segmentation (PCS). Instead of clicking on a specific car, you can prompt the model with the text “car” or “red vehicle.” The model then hunts down every instance of that concept across the image or video frame. It handles the logic of “finding” and “segmenting” simultaneously.
The unified architecture creates a strange new workflow where you can combine inputs. You can provide a text prompt (“cat”) and a visual exemplar (a bounding box around one specific cat), and the model generalizes from that combination to find all cats that look similar. It is a shift from “segment this” to “segment everything like this.”
2. Under the Hood: The “Presence Token” and Deep Encoder Architecture

The engineering team at Meta Superintelligence Labs didn’t just glue a language model to SAM 2. They rebuilt the pipeline to solve a specific headache in open-vocabulary segmentation: false positives.
When you ask a standard model to find “a unicorn” in a picture of a kitchen, standard detectors often hallucinate. They try to force the most “unicorn-like” blob (perhaps a blender) into a bounding box because they are optimized to localize, not to reject.
SAM 3 solves this with a Presence Head.
The architecture uses a shared Perception Encoder (PE) backbone that processes both vision and text. But here is the clever part: they decoupled the recognition (“is the object even here?”) from the localization (“where is the object?”).
A global Presence Token is responsible solely for predicting the probability that the concept exists in the frame. The localization queries then do the heavy lifting of drawing masks. The final score is a product of these two. If the Presence Token says “zero chance,” the localization queries get suppressed. This simple architectural tweak drastically reduces the hallucination rate for objects that aren’t there.
3. SAM 3 vs SAM 2: Key Differences and Improvements
If you are a developer deciding whether to upgrade, you need to know if the compute cost is worth it. SAM 3 is essentially a superset of SAM 2, meaning it retains all the visual prompting capabilities (clicks/boxes) while adding the semantic layer.
Here is the breakdown of the generational leap:
- Input Modality: SAM 2 relies on geometric prompts (points, boxes). SAM 3 accepts geometric prompts, text phrases, and image exemplars.
- Output Scope: SAM 2 is designed to segment a single target object per prompt. SAM 3 is designed to exhaustively segment all instances of a concept.
- Video Persistence: SAM 3 features a memory-based tracker that shares the backbone with the detector. It improves re-identification when objects disappear behind occlusions and re-emerge later—a notorious pain point in tracking.
- Disambiguation: SAM 3 employs a “hard mode-switch.” It can toggle between “concept mode” (find all cars) and “instance mode” (refine this specific car mask), preventing the logic errors that occur when models try to do both at once.
4. Breaking Down the Benchmarks: Is it Really SOTA?
Meta created a new benchmark specifically for this release called SA-Co (Segment Anything with Concepts). It is a beast of a dataset, containing over 207,000 unique concepts—roughly 50 times more than existing benchmarks.
The performance numbers suggest SAM 3 isn’t just a minor update.
In zero-shot evaluations on the LVIS dataset, SAM 3 achieved a mask AP (Average Precision) of 48.8. For context, the previous state-of-the-art was sitting at 38.5. That is a massive jump in the world of computer vision, where we usually fight for 1% gains.
SAM 3 benchmarks show particular strength in video consistency. By decoupling the detection and tracking, the model sustains near real-time performance for about 5 concurrent objects.
Here is a look at the comparative performance data:
SAM 3 Performance Benchmarks
| Benchmark Category | Task Description | Metric | Model | Score |
|---|---|---|---|---|
| Concept Segmentation (Images) | Text → Masks (on SA-Co Gold) | cgF1 | Human | 72.8% |
| Concept Segmentation (Images) | Text → Masks (on SA-Co Gold) | cgF1 | SAM 3 | 53.9% |
| Concept Segmentation (Images) | Text → Masks (on SA-Co Gold) | cgF1 | OWLv2 | 24.6% |
| Concept Segmentation (Images) | Text → Masks (on SA-Co Gold) | cgF1 | Gemini 2.5 Pro | 13.0% |
| Concept Segmentation (Video) | Text → Masklets (on SA-Co SA-V) | pHOTA | Human | 70.5% |
| Concept Segmentation (Video) | Text → Masklets (on SA-Co SA-V) | pHOTA | SAM 3 | 58.0% |
| Concept Segmentation (Video) | Text → Masklets (on SA-Co SA-V) | pHOTA | LLMDet + SAM3 Tracker | 30.1% |
| Visual Segmentation (Video) | Mask → Masklet (on SA-V test) | J&F | SAM 3 | 84.4% |
| Visual Segmentation (Video) | Mask → Masklet (on SA-V test) | J&F | SeC | 81.7% |
| Visual Segmentation (Video) | Mask → Masklet (on SA-V test) | J&F | SAM 2.1 L | 78.4% |
| Counting (Images) | Counting objects (on CountBench) | Accuracy | SAM 3 | 93.8% |
| Counting (Images) | Counting objects (on CountBench) | Accuracy | Gemini 2.5 Pro | 92.4% |
| Counting (Images) | Counting objects (on CountBench) | Accuracy | Qwen-VL-72B | 86.7% |
The data shows SAM 3 doubling the accuracy of existing systems like OWLv2 on the new concept segmentation tasks. It approaches human performance in video tracking, which is arguably the hardest task in the suite.
5. Real-World Use Cases: Medical, Marine, and Creative

Synthesizing reddit discussions and early experiments from the community reveals where this model actually shines versus where it is just hype.
5.1 Medical Imaging:
Medical image segmentation AI is the “holy grail” application for tools like this. SAM 3 shows promise here because of its promptable nature. A radiologist could theoretically prompt “tumor” or “lesion.” Early tests show it is excellent for pre-labeling—generating a rough pass that a human expert refines. It does not replace the expert, as it can struggle with highly specific biological structures (like distinct intracranial arteries) without fine-tuning, but it speeds up the workflow massively.
5.2 Scientific Research:
Marine ecology researchers are already looking at SAM 3 for analyzing Scientific Research footage. The “Presence Token” is valuable here. If you are scanning hours of empty ocean floor for a specific starfish, you want a model that confidently says “nothing here” rather than hallucinating rocks as starfish.
5.3 Creative and VFX:
For video editors, the “concept segmentation” is a rotoscoping dream. You can type “person” and get a mask for every actor in the scene to apply color grading or effects. It removes the need to manually initialize trackers on every single person in a crowd shot.
6. Hardware Requirements: Can You Run SAM 3 Locally?
The big question for developers: Can I run SAM 3 locally? The answer is yes, but video will hurt.
6.1 The Weights:
The model itself is surprisingly efficient in terms of parameter count (~850M parameters). This fits comfortably in the VRAM of high-end consumer cards like the RTX 3090 or 4090.
6.2 Inference Speed:
For single images, SAM 3 is snappy. On an H200 GPU, it clocks in at about 30ms per image. You will see slower but usable speeds on consumer hardware.
6.3 The Video Bottleneck:
Video is where the math gets heavy. SAM 3 tracks every object with a masklet (a spatio-temporal mask). The inference cost scales linearly with the number of objects. Tracking one person is fast. Tracking 50 people in a crowd will tank your frame rate. If you need real-time performance on video with multiple objects, you are looking at data-center grade hardware or significant optimization work (like quantization or distillation).
7. How to Use SAM 3: A Quick Start Guide
The code is out, and the barriers to entry are low. You have three primary ways to get this running.
7.1 Option 1: Direct Python Usage
This is for the engineers building pipelines. You will need to request access to the checkpoints via Hugging Face first.
import torch from PIL import Image from sam3.model_builder import build_sam3_image_model from sam3.model.sam3_image_processor import Sam3Processor # 1. Load the model (Requires CUDA) model = build_sam3_image_model() processor = Sam3Processor(model) # 2. Load your image image = Image.open("my_dataset_image.jpg") inference_state = processor.set_image(image) # 3. The Magic: Prompt with a concept # SAM 3 will find ALL instances of this concept output = processor.set_text_prompt( state=inference_state, prompt="red sports car" ) # 4. Extract results masks = output["masks"] boxes = output["boxes"] print(f"Found {len(masks)} instances.")
7.2 Option 2: Visualizing with Notebooks
The GitHub repository includes excellent Jupyter notebooks (sam3_image_predictor_example.ipynb). These are the best way to inspect the “Presence Score” and see how the model discriminates between positive and negative text prompts.
7.3 Option 3: Auto-Labeling Pipelines
The killer app for SAM 3 isn’t necessarily running it in production—it is using it to label data. You can use SAM 3 to auto-label a massive dataset of specific objects (e.g., “hard hats”) and then train a smaller, faster model (like YOLOv10 or RT-DETR) on those labels for edge deployment. Python scripts can automate this Auto-Labeling Pipelines process effectively.
8. Limitations and “Hallucinations”
We need to be honest about what this model cannot do. SAM 3 is not an AGI. It is a pattern matcher with a good vocabulary.
8.1 Spatial Reasoning:
It struggles with queries that require logic. If you ask for “the man behind the car,” SAM 3 often fails to understand the prepositional relationship and might just segment both the man and the car. It segments nouns well; it segments relationships poorly.
8.2 Ambiguity:
The English language is messy. If you prompt “bat,” SAM 3 relies on visual context to decide between the animal and the baseball equipment. If the image is ambiguous, the model’s guess is a coin toss. The “Ambiguity Head” in the architecture tries to mitigate this by predicting multiple valid masks, but user guidance is often still required.
8.3 Generalization:
While it is “open vocabulary,” it is not omniscient. It generalizes poorly to niche domains (like thermal imagery or specific industrial parts) without fine-tuning. The “concept” understanding breaks down when the visual features deviate too far from the training distribution.
9. Conclusion: The “GPT Moment” for Computer Vision?
Segment Anything Model 3 feels like a foundational shift. We are moving away from the era where computer vision models were just “eye” simulators that needed manual pointing. We are entering an era where they have a brain behind the eyes.
The decoupling of recognition and localization via the Presence Token is a technical insight that will likely be copied across the industry. It solves the fundamental problem of “don’t find things that aren’t there.”
For developers and researchers, the value of SAM 3 lies in its versatility. Whether you are building medical image segmentation AI pipelines or just trying to automate video editing, this model raises the baseline of what is possible out of the box.
The weights are available. The code is on GitHub. It is time to see what you can build with it.
SAM 3 Model Card Overview
| Feature | Specification | Notes |
|---|---|---|
| Parameters | ~850M | ~450M Vision Encoder, ~300M Text Encoder, ~100M Heads |
| License | SAM License | Research use mostly; check repo for specific restrictions. |
| Training Compute | 172k A100 Hours | Also utilized 86k H200 hours. Massive scale. |
| Input Resolution | 1024×1024 | Standard square crop for processing. |
| Key Architecture | DETR-based Detector | Uses a shared “Perception Encoder” backbone. |
| Video Strategy | Masklet Tracking | Linearly scales cost with object count. |
| Primary Dataset | SA-Co | 207k unique concepts, millions of images. |
Next Step: Would you like me to generate a specific Python script for batch-processing a folder of images using SAM 3 to auto-generate masks for a custom dataset?
What is the difference between SAM 3 and SAM 2?
SAM 3 adds “Promptable Concept Segmentation” (finding all cats, not just one specific cat) and improved video memory, whereas SAM 2 was strictly limited to segmenting individual objects you manually clicked on.
Can I run SAM 3 locally, and what are the hardware requirements?
Yes, but it is demanding. Real-time video processing requires enterprise H100 GPUs, but you can run image segmentation locally on consumer cards like the RTX 4090 with 24GB VRAM for decent performance.
What is “Promptable Concept Segmentation” (PCS)?
This is the model’s new ability to take a generic text prompt (e.g., “red wheels”) or visual example and autonomously find every matching instance in a video or image, rather than just one target.
Is SAM 3 free for commercial use?
Generally yes, under the “SAM License,” which allows both research and commercial applications. However, users must verify the specific license file for restrictions on “out-of-scope” use cases like surveillance.
How does SAM 3 perform on medical images compared to specialized models?
It serves as a powerful generalist baseline (48.8 AP on LVIS), but experts note it still trails behind fine-tuned specialist models (like MedSAM) for detecting highly specific vascular or cell structures.
