mistral_common.tokens.tokenizers.base
InstructTokenizer(tokenizer, image_encoder, audio_encoder)
Bases: Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Base class for instruct tokenizers.
Attributes:
Name | Type | Description |
---|---|---|
tokenizer |
Tokenizer
|
The tokenizer to use. |
image_encoder |
Optional[ImageEncoder]
|
The image encoder to use if any. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
image_encoder
|
Optional[ImageEncoder]
|
The image encoder to use if any. |
required |
audio_encoder
|
Optional[AudioEncoder]
|
The audio encoder to use if any. |
required |
Source code in src/mistral_common/tokens/tokenizers/base.py
decode(tokens, special_token_policy=None)
abstractmethod
Convert token ids to string
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens
|
List[int]
|
The token ids to decode. |
required |
special_token_policy
|
Optional[SpecialTokenPolicy]
|
The policy to use for special tokens.
Passing |
None
|
Returns:
Type | Description |
---|---|
str
|
The decoded string. |
Source code in src/mistral_common/tokens/tokenizers/base.py
encode_fim(request)
abstractmethod
FIM request to Tokenized object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
FIMRequestType
|
The FIM request to encode. |
required |
Returns:
Type | Description |
---|---|
TokenizedType
|
The tokenized FIM request. |
encode_instruct(request)
abstractmethod
Instruct request to Tokenized object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
InstructRequestType
|
The instruct request to encode. |
required |
Returns:
Type | Description |
---|---|
TokenizedType
|
The tokenized instruct request. |
Source code in src/mistral_common/tokens/tokenizers/base.py
encode_transcription(request)
abstractmethod
Encodes an audio transcription request into a tokenized format.
This method processes a transcription request containing audio data, encodes the user message, and returns the tokenized output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
TranscriptionRequest
|
The transcription request object containing the audio data to be encoded. |
required |
Returns:
Type | Description |
---|---|
Tokenized
|
The tokenized representation of the audio data, including processed audio and tokens |
Source code in src/mistral_common/tokens/tokenizers/base.py
encode_user_content(content, is_last, system_prompt=None, force_img_first=False)
abstractmethod
Encode a user content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content
|
Union[str, List[ContentChunk]]
|
The user content to encode. |
required |
is_last
|
bool
|
Whether the content is the last one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray], List[Audio]]
|
The encoded tokens and images. |
Source code in src/mistral_common/tokens/tokenizers/base.py
encode_user_message(message, available_tools, is_last, is_first, system_prompt=None, force_img_first=False)
abstractmethod
Encode a user message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
UserMessage
|
The user message to encode. |
required |
available_tools
|
Optional[List[Tool]]
|
The available tools. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
is_first
|
bool
|
Whether the message is the first one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray], List[Audio]]
|
The encoded tokens and images. |
Source code in src/mistral_common/tokens/tokenizers/base.py
SpecialTokenPolicy
SpecialTokens
[DEPRECATED] Enum of special tokens used in the tokenizer.
Attributes:
Name | Type | Description |
---|---|---|
unk |
The unknown token. |
|
bos |
The beginning of string token. |
|
eos |
The end of string token. |
|
begin_inst |
The beginning of instruction token. |
|
end_inst |
The end of instruction token. |
|
begin_tools |
The beginning of tools token. |
|
end_tools |
The end of tools token. |
|
begin_tool_results |
The beginning of tool results token. |
|
end_tool_results |
The end of tool results token. |
|
tool_calls |
The tool calls token. |
|
img |
The image token. |
|
pad |
The pad token. |
|
img_break |
The image break token. |
|
img_end |
The image end token. |
|
prefix |
The prefix token for FIM. |
|
middle |
The middle token for FIM. |
|
suffix |
The suffix token for FIM. |
|
begin_system |
The beginning of system prompt token. |
|
end_system |
The end of system prompt token. |
|
begin_tool_content |
The beginning of tool content token. |
Examples:
Tokenized(**data)
Bases: MistralBase
A tokenized InstructRequest
.
Attributes:
Name | Type | Description |
---|---|---|
tokens |
List[int]
|
The token ids. |
text |
Optional[str]
|
The text representation of the tokens. |
prefix_ids |
Optional[List[int]]
|
The prefix ids for FIM. |
images |
List[ndarray]
|
The loaded images associated with the tokens. |
Examples:
Source code in .venv/lib/python3.13/site-packages/pydantic/main.py
Tokenizer
Bases: ABC
bos_id
abstractmethod
property
id of the Beginning of String token.
eos_id
abstractmethod
property
id of the End of String token.
file_path
abstractmethod
property
The file path of the tokenizer.
n_words
abstractmethod
property
Vocabulary size of the tokenizer.
pad_id
abstractmethod
property
id of the Pad token.
unk_id
abstractmethod
property
id of the Unk token.
version
abstractmethod
property
Get the version of the tokenizer.
decode(tokens, special_token_policy=None)
abstractmethod
Decode the token ids to a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens
|
List[int]
|
The token ids to decode. |
required |
special_token_policy
|
Optional[SpecialTokenPolicy]
|
The policy to use for special tokens.
Passing |
None
|
Returns:
Type | Description |
---|---|
str
|
The decoded string. |
Source code in src/mistral_common/tokens/tokenizers/base.py
encode(s, bos, eos)
abstractmethod
get_control_token(s)
abstractmethod
id_to_piece(token_id)
abstractmethod
to_string(tokens)
abstractmethod
[DEPRECATED] Converts a list of token ids into a string, keeping special tokens.
Use decode
with special_token_policy=SpecialTokenPolicy.KEEP
instead.
This is a convenient method for debugging.
Source code in src/mistral_common/tokens/tokenizers/base.py
TokenizerVersion
Enum of tokenizer versions.
Allow to distinguish between different versions of the tokenizer and maintain backward compatibility.
Attributes:
Name | Type | Description |
---|---|---|
v1 |
The first version of the tokenizer. |
|
v2 |
The second version of the tokenizer that includes special control tokens [INST], [\INST]. |
|
v3 |
The third version of the tokenizer that includes improved function calling. |
|
v7 |
The seventh version of the tokenizer that includes improved system prompt and function calling. |
|
v11 |
The eleventh version of the tokenizer that includes improved function calling. |
|
v13 |
The thirteenth version of the tokenizer that includes no call id tokenization and better prompt caching. |
Examples: