Skip to content
Longterm Wiki
Back

MarkTechPost - PE-AV Release

web

A news article covering Meta's open-source release of PE-AV, a multimodal audiovisual encoder; tangentially relevant to AI safety discussions around open-sourcing powerful foundation model components and multimodal capabilities.

Metadata

Importance: 28/100news articlenews

Summary

Meta AI has open-sourced PE-AV (Perception Encoder Audiovisual), a multimodal encoder that jointly processes audio and visual information, powering their SAM-Audio system and enabling large-scale audiovisual retrieval. The model represents an extension of Meta's Perception Encoder family into the audio-visual domain, designed for robust cross-modal understanding. This release contributes to the open-source multimodal AI ecosystem with implications for how foundation models handle combined sensory inputs.

Key Points

  • Meta open-sourced PE-AV, an audiovisual encoder that jointly encodes audio and visual modalities for multimodal AI tasks.
  • PE-AV powers SAM-Audio (Segment Anything Model extended to audio) and supports large-scale multimodal retrieval applications.
  • The model is part of Meta's broader Perception Encoder family, extending strong vision encoders into the audio-visual domain.
  • Open-sourcing enables researchers to build on top of PE-AV for tasks like audio-visual segmentation, retrieval, and cross-modal reasoning.
  • This release reflects Meta's continued strategy of open-sourcing foundational AI components to accelerate ecosystem-wide research.

Cited by 1 page

PageTypeQuality
Meta AI (FAIR)Organization51.0

Cached Content Preview

HTTP 200Fetched Mar 15, 202618 KB
Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval - MarkTechPost 
 
 
 
 
 
 
 

 
 
 


 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 

 

 

 
 

 
 
 
 
 
 

 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 

 

 

 



 
 
 
 

 

 
 

 

 


 
 
 
 
 
 
 
 
 
 
 
 Discord 
 
 
 
 
 
 Linkedin 
 
 
 
 
 
 Reddit 
 
 
 
 
 
 X 
 
 
 
 
 
 
 

 
 
 
 
 Home 

 Open Source/Weights 

 AI Agents 

 Tutorials 

 Voice AI 

 AINews.sh 

 Sponsorship 

 
 

 
 
 
 
 
 
 
 
 
 
 Search 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 News Hub 

 

 
 
 
 
 News Hub 
 
 

 

 
 

 Premium Content 

 Read our exclusive articles 
 
 

 Facebook Instagram X 
 
 

 
 

 Home 

 Open Source/Weights 

 AI Agents 

 Tutorials 

 Voice AI 

 AINews.sh 

 Sponsorship 

 
 
 
 
 

 
 
 News Hub 
 
 
 
 
 
 
 
 Search 
 
 

 
 Home 

 Open Source/Weights 

 AI Agents 

 Tutorials 

 Voice AI 

 AINews.sh 

 Sponsorship 

 
 
 
 
 
 

 
 Home Technology Artificial Intelligence Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM... 
 
 
 
 


 
 

 Technology 
 Artificial Intelligence 
 Language Model 
 Audio Language Model 
 Editors Pick 
 New Releases 
 Open Source 
 Staff 
 Uncategorized 
 
 

 

 

 
 
 Meta researchers have introduced Perception Encoder Audiovisual, PE AV , as a new family of encoders for joint audio and video understanding. The model learns aligned audio, video, and text representations in a single embedding space using large scale contrastive training on about 100M audio video pairs with text captions.

 From Perception Encoder to PE AV 

 Perception Encoder, PE, is the core vision stack in Meta’s Perception Models project. It is a family of encoders for images, video, and audio that reaches state of the art on many vision and audio benchmarks using a unified contrastive pretraining recipe. PE core surpasses SigLIP2 on image tasks and InternVideo2 on video tasks. PE lang powers Perception Language Model for multimodal reasoning. PE spatial is tuned for dense prediction tasks such as detection and depth estimation.

 PE AV builds on this backbone and extends it to full audio video text alignment. In the Perception Models repository, PE audio visual is listed as the branch that embeds audio, video, audio video, and text into a single joint embedding space for cross modal understanding.

 
 https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/ 
 

 Architecture, Separate Towers and Fusion 

 The PE AV architecture is composed of a frame encoder, a video encoder, an audio encoder, an audio video fusion encoder, and a text encoder.

 

 
 The video path uses the existing PE frame encoder on RGB frames, then applies a 

... (truncated, 18 KB total)
Resource ID: 0fee6c54b7a7246e | Stable ID: NDMxODU2NT