mistral_common.tokens.tokenizers.mistral
MistralTokenizer(instruct_tokenizer, validator, request_normalizer)
Bases: Generic[UserMessageType, AssistantMessageType, ToolMessageType, SystemMessageType, TokenizedType]
Mistral tokenizer.
This class is a wrapper around a InstructTokenizer, a MistralRequestValidator and a InstructRequestNormalizer.
It provides a convenient interface to tokenize, validate ad normalize Mistral requests.
Attributes:
Name | Type | Description |
---|---|---|
instruct_tokenizer |
InstructTokenizer[InstructRequest, FIMRequest, TokenizedType, AssistantMessageType]
|
The instruct tokenizer to use. See InstructTokenizer. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruct_tokenizer
|
InstructTokenizer[InstructRequest, FIMRequest, TokenizedType, AssistantMessageType]
|
The instruct tokenizer to use. |
required |
validator
|
MistralRequestValidator[UserMessageType, AssistantMessageType, ToolMessageType, SystemMessageType]
|
The request validator to use. |
required |
request_normalizer
|
InstructRequestNormalizer[UserMessageType, AssistantMessageType, ToolMessageType, SystemMessageType, InstructRequestType]
|
The request normalizer to use. |
required |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
__reduce__()
Provides a recipe for pickling (serializing) this object, which is necessary for use with multiprocessing.
Returns:
Type | Description |
---|---|
Tuple[Callable, Tuple[Any, ...]]
|
A tuple of the factory function and the arguments to reconstruct the object from its source file. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
decode(tokens, special_token_policy=None)
Decodes a list of tokens into a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens
|
List[int]
|
The tokens to decode. |
required |
special_token_policy
|
Optional[SpecialTokenPolicy]
|
The policy to use for special tokens. Passing |
None
|
Returns:
Type | Description |
---|---|
str
|
The decoded string. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
encode_chat_completion(request, max_model_input_len=None)
Encodes a chat completion request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
ChatCompletionRequest[UATS]
|
The chat completion request to encode. |
required |
max_model_input_len
|
Optional[int]
|
The maximum length of the input to the model.
If |
None
|
Returns:
Type | Description |
---|---|
TokenizedType
|
The encoded chat completion request. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
encode_fim(request)
Encodes a fill in the middle request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
FIMRequest
|
The fill in the middle request to encode. |
required |
Returns:
Type | Description |
---|---|
TokenizedType
|
The encoded fill in the middle request. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
encode_transcription(request)
Encodes a transcription request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
TranscriptionRequest
|
The transcription request to encode. |
required |
Returns:
Type | Description |
---|---|
TokenizedType
|
The encoded transcription request. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
from_file(tokenizer_filename, mode=ValidationMode.test)
classmethod
Loads a tokenizer from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer_filename
|
Union[str, Path]
|
The path to the tokenizer file. |
required |
mode
|
ValidationMode
|
The validation mode to use. |
test
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The loaded tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 |
|
from_hf_hub(repo_id, token=None, revision=None, force_download=False, local_files_only=False, mode=ValidationMode.test)
staticmethod
Download the Mistral tokenizer for a given Hugging Face repository ID.
See here for a list of our OSS models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repo_id
|
str
|
The Hugging Face repo ID. |
required |
token
|
Optional[Union[bool, str]]
|
The Hugging Face token to use to download the tokenizer. |
None
|
revision
|
Optional[str]
|
The revision of the model to use. If |
None
|
mode
|
ValidationMode
|
The validation mode to use. |
test
|
force_download
|
bool
|
Whether to force the download of the tokenizer. If |
False
|
local_files_only
|
bool
|
Whether to only use local files. If |
False
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The Mistral tokenizer for the given model. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
from_model(model, strict=False)
classmethod
Get the Mistral tokenizer for a given model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
The model name. |
required |
strict
|
bool
|
Whether to use strict model name matching. If |
False
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The Mistral tokenizer for the given model. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
v1()
classmethod
v2()
classmethod
Get the Mistral tokenizer v2.
v3(is_tekken=False, is_mm=False)
classmethod
Get the Mistral tokenizer v3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
is_tekken
|
bool
|
Whether the tokenizer is a tekken tokenizer. See Tekkenizer. |
False
|
is_mm
|
bool
|
Whether to load image tokenizer. |
False
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The Mistral tokenizer v3. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
v7(is_mm=False)
classmethod
Get the Mistral tokenizer v7.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
is_mm
|
bool
|
Whether to load the image tokenizer. |
False
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The Mistral tokenizer v7. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
load_audio_encoder(audio_config, tokenizer)
Load a audio encoder from a config and a tokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
audio_config
|
AudioConfig
|
The audio config. |
required |
tokenizer
|
Tekkenizer
|
The tokenizer. |
required |
Returns:
Type | Description |
---|---|
AudioEncoder
|
The audio encoder. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
load_image_encoder(image_config, tokenizer)
Load a image encoder from a config and a tokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
image_config
|
ImageConfig
|
The image config. |
required |
tokenizer
|
Union[Tekkenizer, SentencePieceTokenizer]
|
The tokenizer. |
required |
Returns:
Type | Description |
---|---|
ImageEncoder
|
The image encoder. |