mistral_common.tokens.tokenizers.audio
AudioConfig(sampling_rate, frame_rate, encoding_config, chunk_length_s=None, transcription_format=TranscriptionFormat.INSTRUCT, transcription_delay_ms=None, streaming_look_ahead_ms=None, streaming_look_back_ms=None, streaming_n_left_pad_tokens=None, voice_num_audio_tokens=None)
dataclass
Configuration for audio processing.
Attributes:
| Name | Type | Description |
|---|---|---|
sampling_rate |
int
|
Sampling rate of the audio. |
frame_rate |
float
|
Number of frames per second accepted by the tokenizer model. |
encoding_config |
AudioSpectrogramConfig
|
Configuration for audio spectrogram. |
chunk_length_s |
float | None
|
Whether to pad an audio into multiples of chunk_length_s seconds (optional). |
voice_num_audio_tokens |
dict[str, int] | None
|
Mapping from speaker voice name to number of audio tokens for that speaker's reference audio (optional, only for TTS). |
audio_length_per_tok
property
Calculate the length of audio per token.
chunk_frames
property
Calculate the number of frames per chunk.
AudioEncoder(audio_config, special_ids)
Encodes audio chunks into a format suitable for further processing.
Attributes:
| Name | Type | Description |
|---|---|---|
audio_config |
Configuration for audio processing. |
|
encoding_config |
Configuration for audio spectrogram. |
|
special_ids |
Special tokens for audio encoding. |
Source code in src/mistral_common/tokens/tokenizers/audio.py
audio_to_text_token
property
Get the audio_to_text token.
audio_token
property
Get the audio token.
begin_audio_token
property
Get the begin audio token.
streaming_pad
property
Get the streaming pad token.
text_to_audio_token
property
Get the text_to_audio token.
__call__(content)
Call the encoder on an audio chunk or URL chunk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
AudioChunk | AudioURLChunk
|
Audio or URL chunk to encode. |
required |
Returns:
| Type | Description |
|---|---|
AudioEncoding
|
Encoded audio data and tokens. |
Source code in src/mistral_common/tokens/tokenizers/audio.py
encode_audio(audio, transcription_delay_ms=None)
Encode an audio optionally with transcription delay.
Source code in src/mistral_common/tokens/tokenizers/audio.py
encode_audio_for_speech_request(audio, voice)
Encode audio or voice preset into an AudioEncoding for speech synthesis.
Either audio (reference audio for voice cloning) or voice (preset name)
must be provided. When audio is given it takes precedence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio
|
Audio | None
|
Reference audio waveform, or None to use a voice preset. |
required |
voice
|
str | None
|
Preset voice name (e.g. 'Neutral Male', 'Neutral Female'), or None when using ref audio. |
required |
Returns:
| Type | Description |
|---|---|
AudioEncoding
|
AudioEncoding containing the token sequence and optional audio data. |
Source code in src/mistral_common/tokens/tokenizers/audio.py
encode_streaming_tokens(transcription_delay_ms=None)
Encode the streaming tokens given a transcription delay.
Source code in src/mistral_common/tokens/tokenizers/audio.py
get_padding_audio(transcription_delay_ms=None)
Gets left and right padding for realtime audio models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
transcription_delay_ms
|
optional
|
Delay in milliseconds for transcription. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[Audio, Audio]
|
Tuple of left and right padding for realtime audio models. |
Source code in src/mistral_common/tokens/tokenizers/audio.py
next_multiple_of_chunk_frames(audio_array_len, sampling_rate)
Calculate the next multiple of chunk frames.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_array_len
|
int
|
Length of the audio array. |
required |
sampling_rate
|
int
|
Sampling rate of the audio. |
required |
Returns:
| Type | Description |
|---|---|
int
|
The next multiple of chunk frames. |
Source code in src/mistral_common/tokens/tokenizers/audio.py
pad(audio_array, sampling_rate, transcription_delay_ms=None, **kwargs)
Pad the audio array to the desired length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_array
|
ndarray
|
Audio data as a numpy array. |
required |
sampling_rate
|
int
|
Sampling rate of the audio. |
required |
transcription_delay_ms
|
optional
|
Delay in milliseconds for transcription. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Padded audio array. |
Source code in src/mistral_common/tokens/tokenizers/audio.py
AudioEncoding(tokens, audio)
dataclass
AudioSpectrogramConfig(num_mel_bins, hop_length, window_size)
dataclass
Configuration for generating an audio spectrogram.
Attributes:
| Name | Type | Description |
|---|---|---|
num_mel_bins |
int
|
Number of mel bins, typically 80 or 128. |
hop_length |
int
|
Length of the overlapping windows for the STFT used to obtain the Mel Frequency coefficients, typically 160. |
window_size |
int
|
Window size of the Fourier transform, typically 400. |
SpecialAudioIDs(audio, begin_audio, streaming_pad, text_to_audio, audio_to_text)
dataclass
Special text tokens corresponding to audio token sequence.
Attributes:
| Name | Type | Description |
|---|---|---|
audio |
int | None
|
Token representing audio. |
begin_audio |
int | None
|
Token representing the beginning of audio. |
streaming_pad |
int | None
|
Token representing streaming pad of audio. Only relevant for steaming models. |
text_to_audio |
int | None
|
Token representing intent to convert text to audio. |
audio_to_text |
int | None
|
Token representing intent to convert audio to text. |