The format of the encoded audio data.
Specifies the audio codec and container format used for the voice input. Currently supports "wav" for uncompressed high-quality audio and "mp3" for compressed audio with smaller file sizes. The format information ensures proper decoding and processing of the audio content.
Different formats may be preferred based on quality requirements, file size constraints, and compatibility with speech recognition services.
Unique identifier for this user message content item.
A UUID assigned by the message intake pipeline so downstream phases can reference one precise content item without relying on array position, filename, file metadata, or modality-specific payload shape.
Semantic metadata attached after audio intake.
Byte length, duration, and speaker count may be adapter-owned facts. Codec, transcription state, source id, and governance state remain orchestrator-owned.
Type discriminator for identifying the specific content modality.
Provides type-safe discrimination between different content types such as "text", "audio", "image", and "file". This discriminator enables proper type narrowing and ensures that each content type is processed according to its specific characteristics and requirements.
The type field is essential for the multimodal content processing pipeline, allowing the system to route different content types to appropriate handlers while maintaining type safety throughout the conversation flow.
URL or data URL from which the audio bytes can be read.
Content type representing audio input from users in the conversation.
Enables natural voice interaction by allowing users to communicate requirements, ask questions, and provide specifications through spoken input. Voice input enhances the vibe coding experience by providing a more natural and efficient way to express complex requirements, especially when describing workflows, business processes, or detailed specifications.
The audio content is processed through speech-to-text capabilities, allowing the AI assistant to understand and respond to voice-based requirements just as effectively as text input. This multimodal approach makes the development conversation more accessible and user-friendly.
Author
Samchon