Skip to content

mistral_common.tokens.tokenizers.audio

AudioConfig(sampling_rate, frame_rate, encoding_config, chunk_length_s=None) dataclass

Configuration for audio processing.

Attributes:

Name Type Description
sampling_rate int

Sampling rate of the audio.

frame_rate float

Number of frames per second accepted by the tokenizer model.

encoding_config AudioSpectrogramConfig

Configuration for audio spectrogram.

chunk_length_s Optional[float]

Whether to pad an audio into multiples of chunk_length_s seconds (optional).

audio_length_per_tok property

Calculate the length of audio per token.

chunk_frames property

Calculate the number of frames per chunk.

AudioEncoder(audio_config, special_ids)

Encodes audio chunks into a format suitable for further processing.

Attributes:

Name Type Description
audio_config

Configuration for audio processing.

encoding_config

Configuration for audio spectrogram.

special_ids

Special tokens for audio encoding.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def __init__(self, audio_config: AudioConfig, special_ids: SpecialAudioIDs) -> None:
    self.audio_config = audio_config
    self.encoding_config = audio_config.encoding_config
    self.special_ids = special_ids

audio_token property

Get the audio token.

begin_audio_token property

Get the begin audio token.

__call__(content)

Call the encoder on an audio chunk or URL chunk.

Parameters:

Name Type Description Default
content Union[AudioChunk, AudioURLChunk]

Audio or URL chunk to encode.

required

Returns:

Type Description
AudioEncoding

Encoded audio data and tokens.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def __call__(self, content: Union[AudioChunk, AudioURLChunk]) -> AudioEncoding:
    r"""Call the encoder on an audio chunk or URL chunk.

    Args:
        content: Audio or URL chunk to encode.

    Returns:
        Encoded audio data and tokens.
    """
    if isinstance(content, AudioURLChunk):
        return self._encode_audio_url_chunk(content)
    elif isinstance(content, AudioChunk):
        return self._encode_audio_chunk(content)
    else:
        raise ValueError(f"Unsupported content type: {type(content)}")

next_multiple_of_chunk_frames(audio_array_len, sampling_rate)

Calculate the next multiple of chunk frames.

Parameters:

Name Type Description Default
audio_array_len int

Length of the audio array.

required
sampling_rate int

Sampling rate of the audio.

required

Returns:

Type Description
int

The next multiple of chunk frames.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def next_multiple_of_chunk_frames(self, audio_array_len: int, sampling_rate: int) -> int:
    r"""Calculate the next multiple of chunk frames.

    Args:
        audio_array_len: Length of the audio array.
        sampling_rate: Sampling rate of the audio.

    Returns:
        The next multiple of chunk frames.
    """
    assert sampling_rate == self.audio_config.sampling_rate, (
        f"Expected {sampling_rate=} to be {self.audio_config.sampling_rate=}"
    )
    assert self.audio_config.chunk_length_s is not None, (
        f"Can't call next_multiple_of_chunk_frames if {self.audio_config.chunk_length_s=}."
    )

    return math.ceil(audio_array_len / self.audio_config.chunk_frames) * self.audio_config.chunk_frames

pad(audio_array, sampling_rate)

Pad the audio array to the desired length.

Parameters:

Name Type Description Default
audio_array ndarray

Audio data as a numpy array.

required
sampling_rate int

Sampling rate of the audio.

required

Returns:

Type Description
ndarray

Padded audio array.

Source code in src/mistral_common/tokens/tokenizers/audio.py
def pad(self, audio_array: np.ndarray, sampling_rate: int) -> np.ndarray:
    r"""Pad the audio array to the desired length.

    Args:
        audio_array: Audio data as a numpy array.
        sampling_rate: Sampling rate of the audio.

    Returns:
        Padded audio array.
    """
    if self.audio_config.chunk_length_s:
        next_multiple_of_chunk_frames = self.next_multiple_of_chunk_frames(audio_array.shape[-1], sampling_rate)
        audio_array = np.pad(audio_array, (0, next_multiple_of_chunk_frames - audio_array.shape[-1]))
    elif audio_array.shape[-1] < self.encoding_config.window_size:
        # minimum length for audios is at least one spectrogram frame
        audio_array = np.pad(audio_array, (0, self.encoding_config.window_size - audio_array.shape[-1]))

    return audio_array

AudioEncoding(tokens, audio) dataclass

Encapsulates the tokens and audio data for an audio chunk.

Attributes:

Name Type Description
tokens List[int]

Text tokens corresponding to this audio chunk.

audio Audio

Original audio waveform data.

AudioSpectrogramConfig(num_mel_bins, hop_length, window_size) dataclass

Configuration for generating an audio spectrogram.

Attributes:

Name Type Description
num_mel_bins int

Number of mel bins, typically 80 or 128.

hop_length int

Length of the overlapping windows for the STFT used to obtain the Mel Frequency coefficients, typically 160.

window_size int

Window size of the Fourier transform, typically 400.

SpecialAudioIDs(audio, begin_audio) dataclass

Special text tokens corresponding to audio token sequence.

Attributes:

Name Type Description
audio int

Token representing audio.

begin_audio int

Token representing the beginning of audio.