mistral_common.tokens.tokenizers.mistral
MistralTokenizer(instruct_tokenizer, validator, request_normalizer)
Bases: Generic[UserMessageType, AssistantMessageType, ToolMessageType, SystemMessageType, TokenizedType]
Mistral tokenizer.
This class is a wrapper around a InstructTokenizer, a MistralRequestValidator and a InstructRequestNormalizer.
It provides a convenient interface to tokenize, validate ad normalize Mistral requests.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
instruct_tokenizer
|
InstructTokenizer[InstructRequest, FIMRequest, TokenizedType, AssistantMessageType]
|
The instruct tokenizer to use. |
required |
validator
|
MistralRequestValidator[UserMessageType, AssistantMessageType, ToolMessageType, SystemMessageType]
|
The request validator to use. |
required |
request_normalizer
|
InstructRequestNormalizer[UserMessageType, AssistantMessageType, ToolMessageType, SystemMessageType, InstructRequestType]
|
The request normalizer to use. |
required |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
decode(tokens)
encode_chat_completion(request, max_model_input_len=None)
Encodes a chat completion request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
ChatCompletionRequest[UATS]
|
The chat completion request to encode. |
required |
max_model_input_len
|
Optional[int]
|
The maximum length of the input to the model.
If |
None
|
Returns:
Type | Description |
---|---|
TokenizedType
|
The encoded chat completion request. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
encode_fim(request)
Encodes a fill in the middle request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
FIMRequest
|
The fill in the middle request to encode. |
required |
Returns:
Type | Description |
---|---|
TokenizedType
|
The encoded fill in the middle request. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
from_file(tokenizer_filename, mode=ValidationMode.test)
classmethod
Loads a tokenizer from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer_filename
|
str
|
The path to the tokenizer file. |
required |
mode
|
ValidationMode
|
The validation mode to use. |
test
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The loaded tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
from_hf_hub(model_id, **kwargs)
staticmethod
Get the Mistral tokenizer for a given Hugging Face model ID.
See here for a list of our OSS models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
str
|
The Hugging Face model ID. |
required |
kwargs
|
Any
|
Additional keyword arguments to pass to |
{}
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The Mistral tokenizer for the given model. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
from_model(model, strict=False)
classmethod
Get the Mistral tokenizer for a given model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
str
|
The model name. |
required |
strict
|
bool
|
Whether to use strict model name matching. If |
False
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The Mistral tokenizer for the given model. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
v1()
classmethod
v2()
classmethod
Get the Mistral tokenizer v2.
v3(is_tekken=False, is_mm=False)
classmethod
Get the Mistral tokenizer v3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
is_tekken
|
bool
|
Whether the tokenizer is a tekken tokenizer. See Tekkenizer. |
False
|
is_mm
|
bool
|
Whether to load multimodal tokenizer. |
False
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The Mistral tokenizer v3. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
v7(is_mm=False)
classmethod
Get the Mistral tokenizer v7.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
is_mm
|
bool
|
Whether to load the multimodal tokenizer. |
False
|
Returns:
Type | Description |
---|---|
MistralTokenizer
|
The Mistral tokenizer v7. |
Source code in src/mistral_common/tokens/tokenizers/mistral.py
load_mm_encoder(mm_config, tokenizer)
Load a multi-modal encoder from a config and a tokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mm_config
|
MultimodalConfig
|
The multi-modal config. |
required |
tokenizer
|
Union[Tekkenizer, SentencePieceTokenizer]
|
The tokenizer. |
required |
Returns:
Type | Description |
---|---|
MultiModalEncoder
|
The multi-modal encoder. |