Skip to content

Speech Steps

QuickFlo includes multi-provider speech steps for transcription (STT) and voice synthesis (TTS). Choose the best provider for your use case — all four share the same step interface, so switching providers doesn’t require rewiring your workflow.

ProviderSTTTTSDiarizationStreaming
OpenAIWhisperTTS-1 / TTS-1-HDNoNo
ElevenLabsScribe v1Multilingual v2 + othersYesYes
Google CloudChirp 2, TelephonyNeural / Studio voicesYesYes
AWSTranscribePolly (Neural)YesYes

Each provider requires its own connection — an API key for OpenAI/ElevenLabs, a GCP service account for Google Cloud, or AWS credentials for AWS.

Generate audio from text using any of the four providers.

Text-to-speech step editor showing OpenAI provider with voice and format selection
FieldDescription
ProviderOpenAI, ElevenLabs, Google Cloud, or AWS
ConnectionAPI key or cloud credentials for the selected provider
VoiceVoice to use — the dropdown automatically populates with available voices once you select a connection
ModelProvider-specific model (e.g., TTS-1, TTS-1-HD for OpenAI)
Audio FormatOutput format (MP3, WAV, OGG, PCM, etc. — varies by provider)
TextThe text to convert — supports template syntax
Output ModeSave to managed storage (file) or return as base64
FilenameCustom filename — supports templates
SpeedSpeech speed from 0.25x to 4x (default: 1x)
{
"audio": {
"url": "gs://your-org/audio/greeting_abc123.mp3",
"filename": "greeting_abc123.mp3",
"format": "mp3",
"size": 48200
},
"provider": "openai",
"voice": "alloy",
"textLength": 142
}

OpenAI — TTS-1-HD produces higher quality at higher cost. Supports MP3, Opus, AAC, FLAC, WAV, PCM.

ElevenLabs — Large multilingual voice library. Multilingual v2 recommended for quality, Turbo v2.5 for low latency. Supports telephony formats (u-law 8kHz).

Google Cloud — Standard, Neural2, and Studio voice tiers. Voice selection determines quality tier automatically. Requires a GCP service account with Cloud Text-to-Speech API enabled.

AWS Polly — Neural engine voices. Supports MP3, OGG Vorbis, PCM, and JSON speech marks (timing metadata). Requires AWS credentials with Polly permissions.

Transcribe audio to text with optional timestamps, speaker diarization, and subtitle generation.

Speech-to-text step editor showing AWS provider with output detail and diarization options
FieldDescription
ProviderOpenAI, ElevenLabs, Google Cloud, or AWS
ConnectionAPI key or cloud credentials
Audio SourceStored file (GCS, S3, managed) or base64 data
Output Detailtext (transcript only), segments (with timestamps), or full (includes VTT/SRT subtitles)
Speaker DiarizationIdentify and label different speakers (not available with OpenAI)
LanguageLanguage code (e.g., en-US) — leave empty for auto-detection

With text output detail:

{
"text": "Hello, this is a test recording.",
"provider": "openai",
"durationMs": 3200
}

With segments or full output detail:

{
"text": "Hello, this is a test recording.",
"segments": [
{ "text": "Hello, this is a test recording.", "start": 0.0, "end": 3.2, "confidence": 0.97 }
],
"vtt": "WEBVTT\n\n00:00:00.000 --> 00:00:03.200\nHello, this is a test recording.",
"srt": "1\n00:00:00,000 --> 00:00:03,200\nHello, this is a test recording.",
"provider": "openai",
"durationMs": 3200
}

OpenAI (Whisper) — 25 MB file limit. No diarization or streaming. Best for simple transcription of shorter files. Supported languages

ElevenLabs (Scribe v1) — Supports diarization and streaming for large files. Good accuracy with speaker labeling. Supported languages

Google Cloud (Chirp 2) — Multiple models including a telephony-optimized model for call recordings. 10 MB synchronous limit, streams automatically for larger files. Supported languages

AWS (Transcribe) — Always uses streaming (no batch API). Supports PCM, WAV, FLAC, and OGG (Opus) — does not support MP3. Use an Audio Convert step first if your source is MP3. Supported languages