Podcast Episode

Stability AI Releases Stable Audio 3 With 6-Minute Track Generation

May 20, 2026

0:00

5:02

Stability AI has unveiled Stable Audio 3, a family of latent diffusion models capable of generating professional-grade instrumental music and sound effects up to 6 minutes and 20 seconds long. The release more than doubles the previous 3-minute ceiling and introduces open weights for three of its four model variants, alongside a novel semantic-acoustic autoencoder. The launch positions Stability AI as a leader in instrumental music generation and sound design, distinct from rivals focused on vocal song generation.

A Leap Forward in AI Audio Generation

Stability AI has released Stable Audio 3, a new family of latent diffusion models that can generate professional-grade instrumental music and sound effects up to 6 minutes and 20 seconds in length. This more than doubles the 3-minute ceiling of its predecessor, Stable Audio 2.5, marking a significant step forward in long-form AI audio generation. The accompanying research paper was published on 18 May, with open weights and full training and inference pipelines released for the majority of the model family.

A Four-Model Family

The release spans four model variants of varying sizes and capabilities. Two small models, one optimised for music and one for sound effects, each contain 459 million parameters and can generate up to two minutes of audio. A 1.4-billion-parameter medium model and a 2.7-billion-parameter large model both support the full 6-minute-20-second generation length. Remarkably, the large model can produce a maximum-length track in just 1.8 seconds on an H200 GPU. Three of the four models, the small-music, small-sfx, and medium variants, are released with open weights and licensed training data, designed to run on consumer-grade hardware including Apple's MacBook Pro M4. The large model's weights remain proprietary.

Technical Advances

Stable Audio 3 introduces several technical innovations. The models are built atop a novel semantic-acoustic autoencoder that compresses audio at a 4,096x downsampling ratio while preserving both audio fidelity and semantic structure. The system supports variable-length generation, a departure from earlier diffusion models that required generating full-length outputs regardless of the desired clip length. It also enables inpainting for targeted audio editing and continuation of existing recordings. The training pipeline uses a three-stage approach: flow matching pre-training, distillation warmup, and adversarial post-training via the company's Adversarial Relativistic-Contrastive method, allowing high-quality outputs in just a few inference steps.

Industry Context

The release comes as competition in AI music generation intensifies. Suno, valued at $2.45 billion, and Udio, which settled a licensing dispute with Universal Music Group in late 2025, have emerged as leading competitors in song generation with vocals. Stability AI appears to be carving out its own niche, positioning Stable Audio 3 as the go-to option for instrumental music and sound design, with a strong emphasis on open-weight access and legal clarity through licensed training data. This open approach contrasts sharply with the more closed, vocals-focused offerings of its rivals and signals a maturing market with distinct specialisations.

Published May 20, 2026 at 8:01pm

Stability AI Releases Stable Audio 3 With 6-Minute Track Generation

A Leap Forward in AI Audio Generation

A Four-Model Family

Technical Advances

Industry Context

More Recent Episodes

Bluesky Gives Up Chasing X, Pivots to Reddit-Style Communities as Engagement Halves

US and Japan Launch $1 Billion AI Research Partnership Under Genesis Mission

Meta Pauses Multi-Billion Custom AI Chip Project with Samsung