Skip to content

mistral_common.tokens.tokenizers.audio

AudioConfig(sampling_rate, frame_rate, encoding_config, chunk_length_s=None, transcription_format=TranscriptionFormat.INSTRUCT, transcription_delay_ms=None, streaming_look_ahead_ms=None, streaming_look_back_ms=None, streaming_n_left_pad_tokens=None, voice_num_audio_tokens=None) dataclass

Configuration for audio processing.

Attributes:

Name Type Description
sampling_rate int

Sampling rate of the audio.

frame_rate float

Number of frames per second accepted by the tokenizer model.

encoding_config AudioSpectrogramConfig

Configuration for audio spectrogram.

chunk_length_s float | None

Whether to pad an audio into multiples of chunk_length_s seconds (optional).

voice_num_audio_tokens dict[str, int] | None

Mapping from speaker voice name to number of audio tokens for that speaker's reference audio (optional, only for TTS).

audio_length_per_tok property

Calculate the length of audio per token.

chunk_frames property

Calculate the number of frames per chunk.

AudioEncoder(audio_config, special_ids)

Encodes audio chunks into a format suitable for further processing.

Attributes:

Name Type Description
audio_config

Configuration for audio processing.

encoding_config

Configuration for audio spectrogram.

special_ids

Special tokens for audio encoding.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def __init__(self, audio_config: AudioConfig, special_ids: SpecialAudioIDs) -> None:
    self.audio_config = audio_config
    self.encoding_config = audio_config.encoding_config
    self.special_ids = special_ids

audio_to_text_token property

Get the audio_to_text token.

audio_token property

Get the audio token.

begin_audio_token property

Get the begin audio token.

streaming_pad property

Get the streaming pad token.

text_to_audio_token property

Get the text_to_audio token.

__call__(content)

Call the encoder on an audio chunk or URL chunk.

Parameters:

Name Type Description Default
content AudioChunk | AudioURLChunk

Audio or URL chunk to encode.

required

Returns:

Type Description
AudioEncoding

Encoded audio data and tokens.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def __call__(self, content: AudioChunk | AudioURLChunk) -> AudioEncoding:
    r"""Call the encoder on an audio chunk or URL chunk.

    Args:
        content: Audio or URL chunk to encode.

    Returns:
        Encoded audio data and tokens.
    """
    if isinstance(content, AudioURLChunk):
        return self._encode_audio_url_chunk(content)
    elif isinstance(content, AudioChunk):
        return self._encode_audio_chunk(content)
    else:
        raise ValueError(f"Unsupported content type: {type(content)}")

encode_audio(audio, transcription_delay_ms=None)

Encode an audio optionally with transcription delay.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def encode_audio(self, audio: Audio, transcription_delay_ms: float | None = None) -> AudioEncoding:
    r"""Encode an audio optionally with transcription delay."""
    audio.resample(self.audio_config.sampling_rate)
    audio.audio_array = self.pad(audio.audio_array, self.audio_config.sampling_rate, transcription_delay_ms)

    if self.audio_config.transcription_format == TranscriptionFormat.STREAMING:
        tokens = self.encode_streaming_tokens(transcription_delay_ms)
    else:
        tokens = self._encode_audio_tokens(audio.audio_array.shape[0])

    return AudioEncoding(
        tokens=tokens,
        audio=audio,
    )

encode_audio_for_speech_request(audio, voice)

Encode audio or voice preset into an AudioEncoding for speech synthesis.

Either audio (reference audio for voice cloning) or voice (preset name) must be provided. When audio is given it takes precedence.

Parameters:

Name Type Description Default
audio Audio | None

Reference audio waveform, or None to use a voice preset.

required
voice str | None

Preset voice name (e.g. 'Neutral Male', 'Neutral Female'), or None when using ref audio.

required

Returns:

Type Description
AudioEncoding

AudioEncoding containing the token sequence and optional audio data.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def encode_audio_for_speech_request(self, audio: Audio | None, voice: str | None) -> AudioEncoding:
    r"""Encode audio or voice preset into an AudioEncoding for speech synthesis.

    Either ``audio`` (reference audio for voice cloning) or ``voice`` (preset name)
    must be provided. When ``audio`` is given it takes precedence.

    Args:
        audio: Reference audio waveform, or None to use a voice preset.
        voice: Preset voice name (e.g. 'Neutral Male', 'Neutral Female'), or None when using ref audio.

    Returns:
        AudioEncoding containing the token sequence and optional audio data.
    """
    assert audio is not None or voice is not None, (
        f"Either audio or voice must be defined to encode audio, got {audio=} and {voice=}"
    )

    if audio is not None:
        audio.resample(self.audio_config.sampling_rate)
        num_audio_tokens = self._get_num_audio_token_for_speech_request(len(audio.audio_array))
    else:
        assert self.audio_config.voice_num_audio_tokens is not None, (
            "voice_num_audio_tokens must be set in audio config to use voice-based speech requests"
        )
        assert voice is not None and voice in self.audio_config.voice_num_audio_tokens, (
            f"Unknown voice {voice!r}, expected one of {list(self.audio_config.voice_num_audio_tokens)}"
        )
        num_audio_tokens = self.audio_config.voice_num_audio_tokens[voice]
    tokens = self._encode_audio_tokens_for_speech_request(num_audio_tokens)

    return AudioEncoding(
        tokens=tokens,
        audio=audio,
    )

encode_streaming_tokens(transcription_delay_ms=None)

Encode the streaming tokens given a transcription delay.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def encode_streaming_tokens(self, transcription_delay_ms: float | None = None) -> list[int]:
    r"""Encode the streaming tokens given a transcription delay."""
    assert isinstance(self.audio_config.encoding_config, AudioSpectrogramConfig), (
        f"Audio encoder must be spectrogram encoder, got {self.audio_config.encoding_config=}"
    )
    assert self.audio_config.transcription_delay_ms is not None

    # streaming pad tokens consist of silence we pad on left + delay tokens
    stream_pad_prefix_len = self.audio_config.n_left_pad_tokens + self.audio_config.get_num_delay_tokens(
        transcription_delay_ms
    )
    tokens = [self.streaming_pad] * stream_pad_prefix_len

    return tokens

get_padding_audio(transcription_delay_ms=None)

Gets left and right padding for realtime audio models.

Parameters:

Name Type Description Default
transcription_delay_ms optional

Delay in milliseconds for transcription.

None

Returns:

Type Description
tuple[Audio, Audio]

Tuple of left and right padding for realtime audio models.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def get_padding_audio(self, transcription_delay_ms: float | None = None) -> tuple[Audio, Audio]:
    r"""Gets left and right padding for realtime audio models.

    Args:
        transcription_delay_ms (optional): Delay in milliseconds for transcription.

    Returns:
        Tuple of left and right padding for realtime audio models.
    """

    left_pad, right_pad = self._get_streaming_pad(0, transcription_delay_ms)
    left_pad_audio = Audio(
        audio_array=np.zeros(left_pad, dtype=np.float32),
        sampling_rate=self.audio_config.sampling_rate,
        format="wav",
    )
    right_pad_audio = Audio(
        audio_array=np.zeros(right_pad, dtype=np.float32),
        sampling_rate=self.audio_config.sampling_rate,
        format="wav",
    )
    return left_pad_audio, right_pad_audio

next_multiple_of_chunk_frames(audio_array_len, sampling_rate)

Calculate the next multiple of chunk frames.

Parameters:

Name Type Description Default
audio_array_len int

Length of the audio array.

required
sampling_rate int

Sampling rate of the audio.

required

Returns:

Type Description
int

The next multiple of chunk frames.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def next_multiple_of_chunk_frames(self, audio_array_len: int, sampling_rate: int) -> int:
    r"""Calculate the next multiple of chunk frames.

    Args:
        audio_array_len: Length of the audio array.
        sampling_rate: Sampling rate of the audio.

    Returns:
        The next multiple of chunk frames.
    """
    assert sampling_rate == self.audio_config.sampling_rate, (
        f"Expected {sampling_rate=} to be {self.audio_config.sampling_rate=}"
    )
    assert self.audio_config.chunk_length_s is not None, (
        f"Can't call next_multiple_of_chunk_frames if {self.audio_config.chunk_length_s=}."
    )

    return math.ceil(audio_array_len / self.audio_config.chunk_frames) * self.audio_config.chunk_frames

pad(audio_array, sampling_rate, transcription_delay_ms=None, **kwargs)

Pad the audio array to the desired length.

Parameters:

Name Type Description Default
audio_array ndarray

Audio data as a numpy array.

required
sampling_rate int

Sampling rate of the audio.

required
transcription_delay_ms optional

Delay in milliseconds for transcription.

None

Returns:

Type Description
ndarray

Padded audio array.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def pad(
    self,
    audio_array: np.ndarray,
    sampling_rate: int,
    transcription_delay_ms: float | None = None,
    **kwargs: Any,
) -> np.ndarray:
    r"""Pad the audio array to the desired length.

    Args:
        audio_array: Audio data as a numpy array.
        sampling_rate: Sampling rate of the audio.
        transcription_delay_ms (optional): Delay in milliseconds for transcription.

    Returns:
        Padded audio array.
    """
    # TODO(Patrick) - remove **kwargs as it's just there to swallow deprecated
    # keyword args from voxtral_realtime in vLLM. It was
    # relevant for the release. Remove in mistral_common version 1.11
    if self.audio_config.chunk_length_s:
        next_multiple_of_chunk_frames = self.next_multiple_of_chunk_frames(audio_array.shape[-1], sampling_rate)
        audio_array = np.pad(audio_array, (0, next_multiple_of_chunk_frames - audio_array.shape[-1]))
    elif self.audio_config.is_streaming:
        left_pad, right_pad = self._get_streaming_pad(audio_array.shape[-1], transcription_delay_ms)
        # we pad both left & right as this leads to better performance
        audio_array = np.pad(audio_array, (left_pad, right_pad))
    elif (
        isinstance(self.encoding_config, AudioSpectrogramConfig)
        and audio_array.shape[-1] < self.encoding_config.window_size
    ):
        # minimum length for audios is at least one spectrogram frame
        audio_array = np.pad(audio_array, (0, self.encoding_config.window_size - audio_array.shape[-1]))

    return audio_array

AudioEncoding(tokens, audio) dataclass

Encapsulates the tokens and audio data for an audio chunk.

Attributes:

Name Type Description
tokens list[int]

Text tokens corresponding to this audio chunk.

audio Audio | None

Original audio waveform data, or None when using a preset voice (no reference audio to forward to the model).

AudioSpectrogramConfig(num_mel_bins, hop_length, window_size) dataclass

Configuration for generating an audio spectrogram.

Attributes:

Name Type Description
num_mel_bins int

Number of mel bins, typically 80 or 128.

hop_length int

Length of the overlapping windows for the STFT used to obtain the Mel Frequency coefficients, typically 160.

window_size int

Window size of the Fourier transform, typically 400.

SpecialAudioIDs(audio, begin_audio, streaming_pad, text_to_audio, audio_to_text) dataclass

Special text tokens corresponding to audio token sequence.

Attributes:

Name Type Description
audio int | None

Token representing audio.

begin_audio int | None

Token representing the beginning of audio.

streaming_pad int | None

Token representing streaming pad of audio. Only relevant for steaming models.

text_to_audio int | None

Token representing intent to convert text to audio.

audio_to_text int | None

Token representing intent to convert audio to text.

TranscriptionFormat

Bases: str, Enum

Transcription format.

Should be set by the tokenizer for correct encoding.

Attributes: - INSTRUCT: The instruct format. - STREAMING: The streaming format.