Back
MarkTechPost - PE-AV Release
webmarktechpost.com·marktechpost.com/2025/12/22/meta-ai-open-sourced-percepti...
A news article covering Meta's open-source release of PE-AV, a multimodal audiovisual encoder; tangentially relevant to AI safety discussions around open-sourcing powerful foundation model components and multimodal capabilities.
Metadata
Importance: 28/100news articlenews
Summary
Meta AI has open-sourced PE-AV (Perception Encoder Audiovisual), a multimodal encoder that jointly processes audio and visual information, powering their SAM-Audio system and enabling large-scale audiovisual retrieval. The model represents an extension of Meta's Perception Encoder family into the audio-visual domain, designed for robust cross-modal understanding. This release contributes to the open-source multimodal AI ecosystem with implications for how foundation models handle combined sensory inputs.
Key Points
- •Meta open-sourced PE-AV, an audiovisual encoder that jointly encodes audio and visual modalities for multimodal AI tasks.
- •PE-AV powers SAM-Audio (Segment Anything Model extended to audio) and supports large-scale multimodal retrieval applications.
- •The model is part of Meta's broader Perception Encoder family, extending strong vision encoders into the audio-visual domain.
- •Open-sourcing enables researchers to build on top of PE-AV for tasks like audio-visual segmentation, retrieval, and cross-modal reasoning.
- •This release reflects Meta's continued strategy of open-sourcing foundational AI components to accelerate ecosystem-wide research.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Meta AI (FAIR) | Organization | 51.0 |
Cached Content Preview
HTTP 200Fetched Mar 15, 202618 KB
Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval - MarkTechPost
Discord
Linkedin
Reddit
X
Home
Open Source/Weights
AI Agents
Tutorials
Voice AI
AINews.sh
Sponsorship
Search
News Hub
News Hub
Premium Content
Read our exclusive articles
Facebook Instagram X
Home
Open Source/Weights
AI Agents
Tutorials
Voice AI
AINews.sh
Sponsorship
News Hub
Search
Home
Open Source/Weights
AI Agents
Tutorials
Voice AI
AINews.sh
Sponsorship
Home Technology Artificial Intelligence Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM...
Technology
Artificial Intelligence
Language Model
Audio Language Model
Editors Pick
New Releases
Open Source
Staff
Uncategorized
Meta researchers have introduced Perception Encoder Audiovisual, PE AV , as a new family of encoders for joint audio and video understanding. The model learns aligned audio, video, and text representations in a single embedding space using large scale contrastive training on about 100M audio video pairs with text captions.
From Perception Encoder to PE AV
Perception Encoder, PE, is the core vision stack in Meta’s Perception Models project. It is a family of encoders for images, video, and audio that reaches state of the art on many vision and audio benchmarks using a unified contrastive pretraining recipe. PE core surpasses SigLIP2 on image tasks and InternVideo2 on video tasks. PE lang powers Perception Language Model for multimodal reasoning. PE spatial is tuned for dense prediction tasks such as detection and depth estimation.
PE AV builds on this backbone and extends it to full audio video text alignment. In the Perception Models repository, PE audio visual is listed as the branch that embeds audio, video, audio video, and text into a single joint embedding space for cross modal understanding.
https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/
Architecture, Separate Towers and Fusion
The PE AV architecture is composed of a frame encoder, a video encoder, an audio encoder, an audio video fusion encoder, and a text encoder.
The video path uses the existing PE frame encoder on RGB frames, then applies a
... (truncated, 18 KB total)Resource ID:
0fee6c54b7a7246e | Stable ID: NDMxODU2NT