mistral_common.tokens.tokenizers.instruct
InstructTokenizerBase(tokenizer, mm_encoder=None)
Bases: InstructTokenizer
, Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Base instruct tokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
mm_encoder
|
Optional[MultiModalEncoder]
|
The multi-modal encoder to use if any. |
None
|
Source code in src/mistral_common/tokens/tokenizers/instruct.py
decode(tokens)
encode_assistant_message(message, is_before_last_user_message)
abstractmethod
Encode an assistant message.
Raises:
Type | Description |
---|---|
NotImplementedError
|
The assistant message is not implemented for the base tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_instruct(request)
Encode an instruct request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
InstructRequest[AssistantMessageType, Tool]
|
The request to encode. |
required |
Returns:
Type | Description |
---|---|
Tokenized
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
abstractmethod
Encode a tool message.
Raises:
Type | Description |
---|---|
NotImplementedError
|
The tool message is not implemented for the base tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
find_first_last_user(request)
staticmethod
Find the first and last user message in the request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
InstructRequest
|
The request to search for user messages. |
required |
Returns:
Type | Description |
---|---|
Tuple[int, int]
|
The index of the first and last user message. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
function_call_prefix(tool_choice)
Return the function call prefix tokens.
Raises:
Type | Description |
---|---|
NotImplementedError
|
The function call prefix is not implemented for the base tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV1(tokenizer, mm_encoder=None)
Bases: InstructTokenizerBase
, Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Instruct tokenizer V1.
This tokenizer has basic for messages. It does not support tools or multi-modal inputs.
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_assistant_message(message, is_before_last_user_message)
Encode an assistant message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
AssistantMessageType
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Not used. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_fim(request)
Encode a FIM request.
Raises:
Type | Description |
---|---|
TokenizerException
|
The FIM request is not implemented for this version. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
Encode a tool message.
Raises:
Type | Description |
---|---|
TokenizerException
|
The tool message is not implemented for this version. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_content(content, is_last, system_prompt=None, force_img_first=False)
Encode a user content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content
|
Union[str, List[ContentChunk]]
|
The content to encode. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Not used. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray]]
|
The encoded tokens and empty list. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_message(message, available_tools, is_last, is_first, system_prompt=None, force_img_first=False)
Encode a user message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
UserMessage
|
The message to encode. |
required |
available_tools
|
Optional[List[Tool]]
|
Not used. |
required |
is_last
|
bool
|
Not used. |
required |
is_first
|
bool
|
Whether the message is the first one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Not used. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray]]
|
The encoded tokens and empty list. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV2(tokenizer, mm_encoder=None)
Bases: InstructTokenizerV1
, Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Instruct tokenizer V2.
This tokenizer adds supports to images, tools and FIM requests.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
mm_encoder
|
Optional[MultiModalEncoder]
|
The multi-modal encoder to use. |
None
|
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_assistant_message(message, is_before_last_user_message)
Encode an assistant message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
AssistantMessageType
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Whether the message is before the last user message. If has tools and true, the message is not encoded. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_fim(request)
Encode a FIM request.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
FIMRequest
|
The request to encode. |
required |
Returns:
Type | Description |
---|---|
Tokenized
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
Encode a tool message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
ToolMessage
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Whether the message is before the last user message. If true, the message is not encoded. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_message(message, available_tools, is_last, is_first, system_prompt=None, force_img_first=False)
Encode a user message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
UserMessage
|
The message to encode. |
required |
available_tools
|
Optional[List[Tool]]
|
The list of available tools if any. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
is_first
|
bool
|
Not used. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray]]
|
The encoded tokens and the list of images. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV3(tokenizer, mm_encoder=None)
Bases: InstructTokenizerV2
, Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Instruct tokenizer V3.
The only difference with V2 tokenizer is that it encodes the tool messages differently.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
mm_encoder
|
Optional[MultiModalEncoder]
|
The multi-modal encoder to use. |
None
|
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_assistant_message(message, is_before_last_user_message)
Encode an assistant message.
Note
Same as V2 but always encode the tool history.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
AssistantMessageType
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Not used. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
Encode a tool message.
Note
Same as V2 but tools are not wrapped in a list and the history is also tokenized.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
ToolMessage
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Whether the message is before the last user message. If true, the message is not encoded. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_content(content, is_last, system_prompt=None, force_img_first=False)
Encode a user content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content
|
Union[str, List[ContentChunk]]
|
The content to encode. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray]]
|
The encoded tokens and the images. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
InstructTokenizerV7(tokenizer, mm_encoder=None)
Bases: InstructTokenizerV3
Instruct tokenizer V7.
The difference with V3 tokenizer is that it encodes the system prompts differently: - in V7 the system prompts are treated as separate SystemMessages. - they are no longer prepended to the last user message. - they are printed between special tokens.
Tool call results are encoded as :
- [begin tool call] call_id_tokens [tool_content] content tokens [end tool call]
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
mm_encoder
|
Optional[MultiModalEncoder]
|
The multi-modal encoder to use. |
None
|
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_assistant_message(message, is_before_last_user_message)
Encode an assistant message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
AssistantMessageType
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Not used. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_system_message(message)
Encode a system message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
SystemMessage
|
The message to encode. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_tool_message(message, is_before_last_user_message)
Encode a tool message.
Note
Same as V3 but tools are not wrapped in a list and history is also tokenized
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
ToolMessage
|
The message to encode. |
required |
is_before_last_user_message
|
bool
|
Not used. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The encoded tokens. |
Source code in src/mistral_common/tokens/tokenizers/instruct.py
encode_user_message(message, available_tools, is_last, is_first, system_prompt=None, force_img_first=False)
Encode a user message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
UserMessage
|
The message to encode. |
required |
available_tools
|
Optional[List[Tool]]
|
The list of available tools if any. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
is_first
|
bool
|
Whether the message is the first one. |
required |
system_prompt
|
Optional[str]
|
Not used. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray]]
|
The encoded tokens and the list of images. |