mistral_common.tokens.tokenizers.instruct
InstructTokenizerBase(tokenizer, image_encoder=None, audio_encoder=None)
Bases: InstructTokenizer
, Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Base instruct tokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
image_encoder
|
Optional[ImageEncoder]
|
The image encoder to use if any. |
None
|
audio_encoder
|
Optional[AudioEncoder]
|
The audio encoder to use. |
None
|
Source code in src/mistral_common/tokens/tokenizers/instruct.py
decode(tokens, special_token_policy=None)
Decode tokens to a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens
|
List[int]
|
The tokens to decode. |
required |
special_token_policy
|
Optional[SpecialTokenPolicy]
|
The policy to use for special tokens.
Passing |
None
|
Returns:
Type | Description |
---|---|
str
|
The decoded string. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_assistant_message(message, is_before_last_user_message, continue_message)
abstractmethod
Encode an assistant message.
Raises:
Type | Description |
---|---|
NotImplementedError
|
The assistant message is not implemented for the base tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_instruct(request)
Encode an instruct request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
InstructRequest[AssistantMessageType, Tool]
|
The request to encode. |
required |
Returns:
Type | Description |
---|---|
Tokenized
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
|
encode_tool_message(message, is_before_last_user_message)
abstractmethod
Encode a tool message.
Raises:
Type | Description |
---|---|
NotImplementedError
|
The tool message is not implemented for the base tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
find_first_last_user(request)
staticmethod
Find the first and last user message in the request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
InstructRequest
|
The request to search for user messages. |
required |
Returns:
Type | Description |
---|---|
Tuple[int, int]
|
The index of the first and last user message. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV1(tokenizer, image_encoder=None, audio_encoder=None)
Bases: InstructTokenizerBase
, Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Instruct tokenizer V1.
This tokenizer has basic for messages. It does not support tools or image inputs.
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_assistant_message(message, is_before_last_user_message, continue_message)
Encode an assistant message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
AssistantMessageType
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Not used. |
required |
continue_message
|
bool
|
Whether to continue the message generation. Only use this if the assistant message is the last message. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_fim(request)
Encode a FIM request.
Raises:
Type | Description |
---|---|
TokenizerException
|
The FIM request is not implemented for this version. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
Encode a tool message.
Raises:
Type | Description |
---|---|
TokenizerException
|
The tool message is not implemented for this version. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_content(content, is_last, system_prompt=None, force_img_first=False)
Encode a user content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content
|
Union[str, List[ContentChunk]]
|
The content to encode. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Not used. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray], List[Audio]]
|
The encoded tokens and empty list. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_message(message, available_tools, is_last, is_first, system_prompt=None, force_img_first=False)
Encode a user message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
UserMessage
|
The message to encode. |
required |
available_tools
|
Optional[List[Tool]]
|
Not used. |
required |
is_last
|
bool
|
Not used. |
required |
is_first
|
bool
|
Whether the message is the first one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Not used. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray], List[Audio]]
|
The encoded tokens and empty list. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV11(tokenizer, image_encoder=None, audio_encoder=None)
Bases: InstructTokenizerV7
Instruct tokenizer V11.
The difference with V7 tokenizer is that it encodes tool calls differently: Tool call results are encoded as : - [begin tool call] call_name_tokens [call id] call_id_tokens [args] content tokens
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV13(tokenizer, image_encoder=None, audio_encoder=None)
Bases: InstructTokenizerV11
Instruct tokenizer V13.
The difference with V11 tokenizer is that it encodes tool calls differently
- available tools are tokenized at the first user message.
- call id is no longer tokenized for tool calls or results.
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
Encode a tool message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
ToolMessage
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Not used. |
required |
Returns: The encoded tokens.
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV2(tokenizer, image_encoder=None, audio_encoder=None)
Bases: InstructTokenizerV1
, Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Instruct tokenizer V2.
This tokenizer adds supports to images, tools and FIM requests.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
image_encoder
|
Optional[ImageEncoder]
|
The image encoder to use. |
None
|
audio_encoder
|
Optional[AudioEncoder]
|
The audio encoder to use. |
None
|
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_assistant_message(message, is_before_last_user_message, continue_message)
Encode an assistant message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
AssistantMessageType
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Whether the message is before the last user message. If has tools and true, the message is not encoded. |
required |
continue_message
|
bool
|
Whether to continue the message generation. Only use this if the assistant message is the last message. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_fim(request)
Encode a FIM request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
FIMRequest
|
The request to encode. |
required |
Returns:
Type | Description |
---|---|
Tokenized
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
Encode a tool message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
ToolMessage
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Whether the message is before the last user message. If true, the message is not encoded. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_message(message, available_tools, is_last, is_first, system_prompt=None, force_img_first=False)
Encode a user message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
UserMessage
|
The message to encode. |
required |
available_tools
|
Optional[List[Tool]]
|
The list of available tools if any. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
is_first
|
bool
|
Not used. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray], List[Audio]]
|
The encoded tokens and the list of images. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV3(tokenizer, image_encoder=None, audio_encoder=None)
Bases: InstructTokenizerV2
, Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Instruct tokenizer V3.
The only difference with V2 tokenizer is that it encodes the tool messages differently.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
image_encoder
|
Optional[ImageEncoder]
|
The image encoder to use. |
None
|
audio_encoder
|
Optional[AudioEncoder]
|
The audio encoder to use. |
None
|
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_assistant_message(message, is_before_last_user_message, continue_message)
Encode an assistant message.
Note
Same as V2 but always encode the tool history. continue_message: Whether to continue the message generation. Only use this if the assistant message is the last message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
AssistantMessageType
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Not used. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
Encode a tool message.
Note
Same as V2 but tools are not wrapped in a list and the history is also tokenized.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
ToolMessage
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Whether the message is before the last user message. If true, the message is not encoded. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_content(content, is_last, system_prompt=None, force_img_first=False)
Encode a user content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content
|
Union[str, List[ContentChunk]]
|
The content to encode. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray], List[Audio]]
|
The encoded tokens and the images. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV7(tokenizer, image_encoder=None, audio_encoder=None)
Bases: InstructTokenizerV3
Instruct tokenizer V7.
The difference with V3 tokenizer is that it encodes the system prompts differently: - in V7 the system prompts are treated as separate SystemMessages - they are no longer prepended to the last user message - they are printed between special tokens
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
image_encoder
|
Optional[ImageEncoder]
|
The image encoder to use. |
None
|
audio_encoder
|
Optional[AudioEncoder]
|
The audio encoder to use. |
None
|
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_assistant_message(message, is_before_last_user_message, continue_message)
Encode an assistant message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
AssistantMessageType
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Not used. |
required |
continue_message
|
bool
|
Whether to continue the message generation. Only use this if the assistant message is the last message. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_system_message(message)
Encode a system message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
SystemMessage
|
The message to encode. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
Encode a tool message.
Note
Same as V3 but tools are not wrapped in a list and history is also tokenized
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
ToolMessage
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Not used. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_transcription(request)
Encodes an audio transcription request into a tokenized format.
This method processes a transcription request containing audio data, encodes the user message, and returns the tokenized output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
TranscriptionRequest
|
The transcription request object containing the audio data to be encoded. |
required |
Returns:
Type | Description |
---|---|
Tokenized
|
The tokenized representation of the audio data, including processed audio and tokens |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_message(message, available_tools, is_last, is_first, system_prompt=None, force_img_first=False)
Encode a user message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
UserMessage
|
The message to encode. |
required |
available_tools
|
Optional[List[Tool]]
|
The list of available tools if any. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
is_first
|
bool
|
Whether the message is the first one. |
required |
system_prompt
|
Optional[str]
|
Not used. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray], List[Audio]]
|
The encoded tokens and the list of images. |