After developing generative AI technology for code and picture production, what happens next? It is text-to-audio generation for Stability AI
Today, Stability AI released the first public release of its Stable Audio technology, enabling anyone to create brief audio snippets with basic text cues. The company behind the Stable Diffusion text-to-image generation AI technology is Stability AI at venturebeat.com
Developing Reputable AI: Microsoft’s Approach to Safe and Robust Generative AI
For better picture composition, Stable Diffusion was revised back in July with its new SDXL base model. Following up on that announcement, the company launched StableCode in August, broadening its offering beyond images to include code at venturebeat.com
Although StableAudio is a new feature, it is built upon many of the same fundamental AI methods that allow Stable Diffusion to produce images. Specifically, the Stable Audio technology uses a diffusion model to create new audio clips; it is trained on audio instead of graphics at venturebeat.com
Ed Newton-Rex, vice president of audio at Stability AI, told VentureBeat, “Stability AI is best known for its work in images, but now we’re launching our first product for music and audio generation, which is called Stable Audio.” “The idea is quite straightforward: you just write down the music or audio you want to hear, and our system will produce it for you.”
How new musical compositions are created using Stable Audio rather than MIDI files
Having founded his own firm, Jukedeck, in 2011 and sold it to TikTok in 2019, Newton-Rex is no stranger to the world of computer generated music venturebeat.com/ai/stability-ai-debuts-stable-audio-bringing-text-to-audio-generation-to-the-masses/
However, Zach Evans’ internal research lab Harmonai at Stability AI, which specializes on music generation, is where Stable Audio’s technology originated rather than Jukedeck AI at venturebeat.com
Evans told VentureBeat, “It’s a lot of taking the same ideas from the image generation space and applying them to the domain of audio.” “I founded Harmonai, a research lab that is a part of Stability AI and essentially a means to have this generative audio research conducted as an open community effort. at venturebeat.com
The use of technology to create foundation audio tracks is not a recent development. People have previously had access to what Evans called “symbolic generation” techniques. He clarified that MIDI (Musical Instrument Digital Interface) files—which, for instance, can represent a drum roll—are frequently used in symbolic generation. Stable Audio’s generative AI capability is unique in that it lets users compose original music beyond the monotonous sounds found in MIDI and symbolic synthesis at venturebeat.com
For better-quality results, Stable Audio works directly with unprocessed audio samples. More than 800,000 licensed songs from the audio resource AudioSparks were used to train the model at venturebeat.com
Evans stated, “With that much data, it’s very complete metadata.” “Having audio data that is both high quality and has good corresponding metadata is one of the really hard things to do when you’re doing these text based models.”
You shouldn’t anticipate creating a new Beatles song with Stable Audio.
Creating images in the manner of a particular artist is one of the frequent uses of image generating models. Users won’t be able to ask the AI model to create original music for Stable Audio, such as a song that sounds like a Beatles classic at venturebeat.com
Newton-Rex remarked, “We haven’t trained on the Beatles.””With audio sample generation, it’s not always what people want to pursue for musicians.”
According to Newton-Rex’s observations, most musicians prefer to be more creative when beginning a new audio work rather than requesting something in the vein of The Beatles or any other particular musical group at venturebeat.com
Finding the appropriate text-to-audio generating prompts
According to Evans, the Stable Audio model includes about 1.2 billion parameters as a diffusion model, which is about equal to the initial release of Stable Diffusion for picture production at venturebeat.com
Stability AI created and trained the text model that was utilized for the prompts that produced audio. Evans clarified that Contrastive Language Audio Pretraining (CLAP) is the method the text model use. In addition to launching Stable Audio, Stability AI is also providing a prompt guide to assist customers in creating the kinds of audio files they wish to create using text prompts at venturebeat.com
Stable Audio will be offered as a free download as well as a $12/month Pro package. Twenty generations of up to 20-second tracks are permitted per month in the free edition; this number rises to 500 generations and 90-second tracks in the pro version.Newton-Rex stated, “We want to give everyone the opportunity to use this and experiment with it.”