mistral_common.tokens.tokenizers.base
ImageEncoding(tokens, image)
dataclass
InstructTokenizer(tokenizer, mm_encoder)
Bases: Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]
Base class for instruct tokenizers.
Attributes:
Name | Type | Description |
---|---|---|
tokenizer |
Tokenizer
|
The tokenizer to use. |
mm_encoder |
Optional[MultiModalEncoder]
|
The multi-modal encoder to use if any. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
mm_encoder
|
Optional[MultiModalEncoder]
|
The multi-modal encoder to use if any. |
required |
Source code in src/mistral_common/tokens/tokenizers/base.py
decode(tokens)
abstractmethod
encode_fim(request)
abstractmethod
FIM request to Tokenized object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
FIMRequestType
|
The FIM request to encode. |
required |
Returns:
Type | Description |
---|---|
TokenizedType
|
The tokenized FIM request. |
encode_instruct(request)
abstractmethod
Instruct request to Tokenized object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
request
|
InstructRequestType
|
The instruct request to encode. |
required |
Returns:
Type | Description |
---|---|
TokenizedType
|
The tokenized instruct request. |
Source code in src/mistral_common/tokens/tokenizers/base.py
encode_user_content(content, is_last, system_prompt=None, force_img_first=False)
abstractmethod
Encode a user content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content
|
Union[str, List[ContentChunk]]
|
The user content to encode. |
required |
is_last
|
bool
|
Whether the content is the last one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray]]
|
The encoded tokens and images. |
Source code in src/mistral_common/tokens/tokenizers/base.py
encode_user_message(message, available_tools, is_last, is_first, system_prompt=None, force_img_first=False)
abstractmethod
Encode a user message.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
UserMessage
|
The user message to encode. |
required |
available_tools
|
Optional[List[Tool]]
|
The available tools. |
required |
is_last
|
bool
|
Whether the message is the last one. |
required |
is_first
|
bool
|
Whether the message is the first one. |
required |
system_prompt
|
Optional[str]
|
The system prompt. |
None
|
force_img_first
|
bool
|
Whether to force the image to be first. |
False
|
Returns:
Type | Description |
---|---|
Tuple[List[int], List[ndarray]]
|
The encoded tokens and images. |
Source code in src/mistral_common/tokens/tokenizers/base.py
MultiModalEncoder
Bases: Protocol
Protocol for multi-modal encoders.
Currently, only image encoders are supported.
image_token
property
The image token id.
__call__(content)
Encode the given content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content
|
Union[ImageChunk, ImageURLChunk]
|
The content to be encoded. |
required |
Returns:
Type | Description |
---|---|
ImageEncoding
|
The encoded image content. |
SpecialImageIDs(img, img_break, img_end)
dataclass
Special image tokens ids.
Attributes:
Name | Type | Description |
---|---|---|
img |
int
|
The image token id. |
img_break |
int
|
The image break token id. |
img_end |
int
|
The image end token id. |
Examples:
from_tokenizer(tokenizer)
staticmethod
Create a SpecialImageIDs
from a Tokenizer
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to use. |
required |
Returns:
Type | Description |
---|---|
SpecialImageIDs
|
The special image tokens ids. |
Source code in src/mistral_common/tokens/tokenizers/base.py
SpecialTokens
[DEPRECATED] Enum of special tokens used in the tokenizer.
Attributes:
Name | Type | Description |
---|---|---|
unk |
The unknown token. |
|
bos |
The beginning of string token. |
|
eos |
The end of string token. |
|
begin_inst |
The beginning of instruction token. |
|
end_inst |
The end of instruction token. |
|
begin_tools |
The beginning of tools token. |
|
end_tools |
The end of tools token. |
|
begin_tool_results |
The beginning of tool results token. |
|
end_tool_results |
The end of tool results token. |
|
tool_calls |
The tool calls token. |
|
img |
The image token. |
|
pad |
The pad token. |
|
img_break |
The image break token. |
|
img_end |
The image end token. |
|
prefix |
The prefix token for FIM. |
|
middle |
The middle token for FIM. |
|
suffix |
The suffix token for FIM. |
|
begin_system |
The beginning of system prompt token. |
|
end_system |
The end of system prompt token. |
|
begin_tool_content |
The beginning of tool content token. |
Examples:
Tokenized(**data)
Bases: MistralBase
A tokenized InstructRequest
.
Attributes:
Name | Type | Description |
---|---|---|
tokens |
List[int]
|
The token ids. |
text |
Optional[str]
|
The text representation of the tokens. |
prefix_ids |
Optional[List[int]]
|
The prefix ids for FIM. |
images |
List[ndarray]
|
The loaded images associated with the tokens. |
Examples:
Source code in .venv/lib/python3.13/site-packages/pydantic/main.py
Tokenizer
Bases: ABC
bos_id
abstractmethod
property
id of the Beginning of String token.
eos_id
abstractmethod
property
id of the End of String token.
n_words
abstractmethod
property
Vocabulary size of the tokenizer.
pad_id
abstractmethod
property
id of the Pad token.
unk_id
abstractmethod
property
id of the Unk token.
version
abstractmethod
property
Get the version of the tokenizer.
decode(t)
abstractmethod
encode(s, bos, eos)
abstractmethod
get_control_token(s)
abstractmethod
id_to_piece(token_id)
abstractmethod
to_string(tokens)
abstractmethod
TokenizerVersion
Enum of tokenizer versions.
Allow to distinguish between different versions of the tokenizer and maintain backward compatibility.
Attributes:
Name | Type | Description |
---|---|---|
v1 |
The first version of the tokenizer. |
|
v2 |
The second version of the tokenizer that includes special control tokens [INST], [\INST]. |
|
v3 |
The third version of the tokenizer that includes improved function calling. |
|
v7 |
The seventh version of the tokenizer that includes improved system prompt and function calling. |
Examples: