Scaling Generalist Speech Models

One model, natural prompts, infinite possibilities

Demo

Research Agenda

Emergent Capabilities

Backed by

Ending The Curse of Specialization

Towards Generalist Speech Models

Current speech AI requires specialized models for every task. Want to clone a voice? One model. Make it sing? Different model. Dub it to another language? Third model. We're building a single generalist model that can do all speech tasks with natural instructions & in-context learning.

Multi-task by Design

One model trained simultaneously on voice cloning, generation, editing, dubbing, and audio understanding. Not separate specialized models stitched together.

Multi-task by Design

One model trained simultaneously on voice cloning, generation, editing, dubbing, and audio understanding. Not separate specialized models stitched together.

Multi-task by Design

One model trained simultaneously on voice cloning, generation, editing, dubbing, and audio understanding. Not separate specialized models stitched together.

Instruction Following

Describe what you want, like you'd direct a sound engineer. Make this voice sound older and speak slower. Speak in the same accent as the user. Sing a song in my voice.

Instruction Following

Describe what you want, like you'd direct a sound engineer. Make this voice sound older and speak slower. Speak in the same accent as the user. Sing a song in my voice.

Instruction Following

Describe what you want, like you'd direct a sound engineer. Make this voice sound older and speak slower. Speak in the same accent as the user. Sing a song in my voice.

Instruction Following

Describe what you want, like you'd direct a sound engineer. Make this voice sound older and speak slower. Speak in the same accent as the user. Sing a song in my voice.

In-Context Learning

Contextually aware Voice agents that adjust the tone based on conversation history not just text-transcript. Instantly clone your voice with your recording in input prompt.

In-Context Learning

Contextually aware Voice agents that adjust the tone based on conversation history not just text-transcript. Instantly clone your voice with your recording in input prompt.

In-Context Learning

Contextually aware Voice agents that adjust the tone based on conversation history not just text-transcript. Instantly clone your voice with your recording in input prompt.

In-Context Learning

Contextually aware Voice agents that adjust the tone based on conversation history not just text-transcript. Instantly clone your voice with your recording in input prompt.

Complex Capabilities

Ask it to clone your voice, speak in a Texas accent, then sing a melody. All with conversational prompts.

Complex Capabilities

Ask it to clone your voice, speak in a Texas accent, then sing a melody. All with conversational prompts.

Complex Capabilities

Ask it to clone your voice, speak in a Texas accent, then sing a melody. All with conversational prompts.

Complex Capabilities

Ask it to clone your voice, speak in a Texas accent, then sing a melody. All with conversational prompts.

About US

Building the universal speech model

Kalpa Labs is scaling generalist speech models to the same limits as LLMs: one model for every audio task, instructed the way you'd direct a sound engineer. Founded by Prashant Shishodia (ex-Google) and Gautam Jha (ex-QRT, Squarepoint).

Prashant Shishodia
CEO, Co-Founder

Gautam Jha
CTO, Co-Founder

Start building with one model for everything

Talk to Sales

Say Hello

founders@kalpalabs.ai