Meta recently unveiled SAM2, the second generation of their "segment anything" AI model, at SIGGRAPH. Compared to the previous version, SAM2 expands its capabilities from image segmentation to video segmentation. It can process videos of any length in real-time and easily segment and track objects that were not seen before in the video.
Crucially, the model's code, weights, and dataset have all been open-sourced. Like the Llama series, it follows the Apache 2.0 license and shares evaluation code under the BSD-3 license.
Meta stated that the open-sourced dataset includes 51,000 real-world videos and 600,000 spatio-temporal masks (masklets), far exceeding the scale of previous similar datasets. An online demo is also available for everyone to experience.
SAM2 builds upon SAM by adding a memory module. Its key upgrades include:
- Real-time segmentation of videos of any length
- Zero-shot generalization
- Improved segmentation and tracking accuracy
- Solving occlusion issues
The interactive segmentation process mainly consists of two steps: selection and refinement. In the first frame, users select the target object by clicking. SAM2 then automatically propagates the segmentation to subsequent frames, forming a spatio-temporal mask. If SAM2 loses the target object in certain frames, users can correct it by providing additional prompts in a new frame.
SAM2's core idea is to treat images as single-frame videos, allowing direct extension from SAM to the video domain while supporting both image and video inputs. The only difference in processing videos is that the model needs to rely on memory to recall processed information for accurate object segmentation in the current time step.
To address the challenges of video segmentation, Meta focused on three main areas:
- Designing a promptable visual segmentation task
- Developing a new model based on SAM
- Building the SA-V dataset
The team designed a visual segmentation task that generalizes image segmentation to videos. SAM2 is trained to accept prompts in any frame of a video to define the spatio-temporal mask to be predicted. It makes instant mask predictions on the current frame based on input prompts and performs temporal propagation to generate masks for the target object across all frames.
By introducing streaming memory, the model can process videos in real-time and more accurately segment and track target objects. The memory component consists of a memory encoder, memory bank, and memory attention module. This design allows the model to process videos of any length, which is important for annotation collection in the SA-V dataset and has potential impacts in fields like robotics.
SAM2 also outputs multiple valid masks if the segmented object is ambiguous. Additionally, to handle occlusion in videos, SAM2 includes an extra "occlusion head" to predict whether an object appears in the current frame.
The SA-V dataset contains 4.5 times more videos and 53 times more annotations than the largest existing similar dataset. To collect such a large amount of data, the research team built a data engine that iteratively improves both the dataset and the model.
Compared to state-of-the-art semi-supervised methods, SAM2 performs well across various metrics. However, the research team acknowledges some limitations, such as potentially losing track of objects in crowded scenes or with significant camera angle changes. They designed a real-time interactive mode to support manual corrections for such cases.
The model is not only open-sourced for free use but is also hosted on platforms like Amazon SageMaker.