Skip to content

mistral_common.tokens.tokenizers.base

ImageEncoding(tokens, image) dataclass

A tokenized image.

Attributes:

Name Type Description
tokens List[int]

The token ids.

image ndarray

The image as a numpy array.

Examples:

>>> import numpy as np
>>> image_encoding = ImageEncoding(tokens=[1, 2, 3], image=np.array([[0., 0.5, 1.]]))

InstructTokenizer(tokenizer, mm_encoder)

Bases: Generic[InstructRequestType, FIMRequestType, TokenizedType, AssistantMessageType]

Base class for instruct tokenizers.

Attributes:

Name Type Description
tokenizer Tokenizer

The tokenizer to use.

mm_encoder Optional[MultiModalEncoder]

The multi-modal encoder to use if any.

Parameters:

Name Type Description Default
tokenizer Tokenizer

The tokenizer to use.

required
mm_encoder Optional[MultiModalEncoder]

The multi-modal encoder to use if any.

required
Source code in src/mistral_common/tokens/tokenizers/base.py
def __init__(self, tokenizer: Tokenizer, mm_encoder: Optional[MultiModalEncoder]) -> None:
    r"""Initialize the instruct tokenizer.

    Args:
        tokenizer: The tokenizer to use.
        mm_encoder: The multi-modal encoder to use if any.
    """

decode(tokens) abstractmethod

Convert token ids to string

Parameters:

Name Type Description Default
tokens List[int]

The token ids to decode.

required

Returns:

Type Description
str

The decoded string.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def decode(self, tokens: List[int]) -> str:
    r"""Convert token ids to string

    Args:
        tokens: The token ids to decode.

    Returns:
        The decoded string.
    """

encode_fim(request) abstractmethod

FIM request to Tokenized object

Parameters:

Name Type Description Default
request FIMRequestType

The FIM request to encode.

required

Returns:

Type Description
TokenizedType

The tokenized FIM request.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def encode_fim(self, request: FIMRequestType) -> TokenizedType:
    r"""FIM request to Tokenized object

    Args:
        request: The FIM request to encode.

    Returns:
        The tokenized FIM request.
    """

encode_instruct(request) abstractmethod

Instruct request to Tokenized object

Parameters:

Name Type Description Default
request InstructRequestType

The instruct request to encode.

required

Returns:

Type Description
TokenizedType

The tokenized instruct request.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def encode_instruct(self, request: InstructRequestType) -> TokenizedType:
    r"""Instruct request to Tokenized object

    Args:
        request: The instruct request to encode.

    Returns:
        The tokenized instruct request.
    """

encode_user_content(content, is_last, system_prompt=None, force_img_first=False) abstractmethod

Encode a user content.

Parameters:

Name Type Description Default
content Union[str, List[ContentChunk]]

The user content to encode.

required
is_last bool

Whether the content is the last one.

required
system_prompt Optional[str]

The system prompt.

None
force_img_first bool

Whether to force the image to be first.

False

Returns:

Type Description
Tuple[List[int], List[ndarray]]

The encoded tokens and images.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def encode_user_content(
    self,
    content: Union[str, List[ContentChunk]],
    is_last: bool,
    system_prompt: Optional[str] = None,
    force_img_first: bool = False,
) -> Tuple[List[int], List[np.ndarray]]:
    r"""Encode a user content.

    Args:
        content: The user content to encode.
        is_last: Whether the content is the last one.
        system_prompt: The system prompt.
        force_img_first: Whether to force the image to be first.

    Returns:
        The encoded tokens and images.
    """
    ...

encode_user_message(message, available_tools, is_last, is_first, system_prompt=None, force_img_first=False) abstractmethod

Encode a user message.

Parameters:

Name Type Description Default
message UserMessage

The user message to encode.

required
available_tools Optional[List[Tool]]

The available tools.

required
is_last bool

Whether the message is the last one.

required
is_first bool

Whether the message is the first one.

required
system_prompt Optional[str]

The system prompt.

None
force_img_first bool

Whether to force the image to be first.

False

Returns:

Type Description
Tuple[List[int], List[ndarray]]

The encoded tokens and images.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def encode_user_message(
    self,
    message: UserMessage,
    available_tools: Optional[List[Tool]],
    is_last: bool,
    is_first: bool,
    system_prompt: Optional[str] = None,
    force_img_first: bool = False,
) -> Tuple[List[int], List[np.ndarray]]:
    r"""Encode a user message.

    Args:
        message: The user message to encode.
        available_tools: The available tools.
        is_last: Whether the message is the last one.
        is_first: Whether the message is the first one.
        system_prompt: The system prompt.
        force_img_first: Whether to force the image to be first.

    Returns:
        The encoded tokens and images.
    """
    ...

MultiModalEncoder

Bases: Protocol

Protocol for multi-modal encoders.

Currently, only image encoders are supported.

image_token property

The image token id.

__call__(content)

Encode the given content.

Parameters:

Name Type Description Default
content Union[ImageChunk, ImageURLChunk]

The content to be encoded.

required

Returns:

Type Description
ImageEncoding

The encoded image content.

Source code in src/mistral_common/tokens/tokenizers/base.py
def __call__(self, content: Union[ImageChunk, ImageURLChunk]) -> ImageEncoding:
    """Encode the given content.

    Args:
        content: The content to be encoded.

    Returns:
        The encoded image content.
    """
    ...

SpecialImageIDs(img, img_break, img_end) dataclass

Special image tokens ids.

Attributes:

Name Type Description
img int

The image token id.

img_break int

The image break token id.

img_end int

The image end token id.

Examples:

>>> special_image_ids = SpecialImageIDs(img=1, img_break=2, img_end=3)

from_tokenizer(tokenizer) staticmethod

Create a SpecialImageIDs from a Tokenizer.

Parameters:

Name Type Description Default
tokenizer Tokenizer

The tokenizer to use.

required

Returns:

Type Description
SpecialImageIDs

The special image tokens ids.

Source code in src/mistral_common/tokens/tokenizers/base.py
@staticmethod
def from_tokenizer(tokenizer: "Tokenizer") -> "SpecialImageIDs":
    r"""Create a `SpecialImageIDs` from a `Tokenizer`.

    Args:
        tokenizer: The tokenizer to use.

    Returns:
        The special image tokens ids.
    """
    return SpecialImageIDs(
        img=tokenizer.get_control_token(SpecialTokens.img.value),
        img_break=tokenizer.get_control_token(SpecialTokens.img_break.value),
        img_end=tokenizer.get_control_token(SpecialTokens.img_end.value),
    )

SpecialTokens

Bases: str, Enum

[DEPRECATED] Enum of special tokens used in the tokenizer.

Attributes:

Name Type Description
unk

The unknown token.

bos

The beginning of string token.

eos

The end of string token.

begin_inst

The beginning of instruction token.

end_inst

The end of instruction token.

begin_tools

The beginning of tools token.

end_tools

The end of tools token.

begin_tool_results

The beginning of tool results token.

end_tool_results

The end of tool results token.

tool_calls

The tool calls token.

img

The image token.

pad

The pad token.

img_break

The image break token.

img_end

The image end token.

prefix

The prefix token for FIM.

middle

The middle token for FIM.

suffix

The suffix token for FIM.

begin_system

The beginning of system prompt token.

end_system

The end of system prompt token.

begin_tool_content

The beginning of tool content token.

Examples:

>>> unk = SpecialTokens.unk

Tokenized(**data)

Bases: MistralBase

A tokenized InstructRequest.

Attributes:

Name Type Description
tokens List[int]

The token ids.

text Optional[str]

The text representation of the tokens.

prefix_ids Optional[List[int]]

The prefix ids for FIM.

images List[ndarray]

The loaded images associated with the tokens.

Examples:

>>> tokenized = Tokenized(tokens=[1, 2, 3], text="Hello world", prefix_ids=[1], images=[])
Source code in .venv/lib/python3.13/site-packages/pydantic/main.py
def __init__(self, /, **data: Any) -> None:
    """Create a new model by parsing and validating input data from keyword arguments.

    Raises [`ValidationError`][pydantic_core.ValidationError] if the input data cannot be
    validated to form a valid model.

    `self` is explicitly positional-only to allow `self` as a field name.
    """
    # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
    __tracebackhide__ = True
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
    if self is not validated_self:
        warnings.warn(
            'A custom validator is returning a value other than `self`.\n'
            "Returning anything other than `self` from a top level model validator isn't supported when validating via `__init__`.\n"
            'See the `model_validator` docs (https://docs.pydantic.dev/latest/concepts/validators/#model-validators) for more details.',
            stacklevel=2,
        )

Tokenizer

Bases: ABC

bos_id abstractmethod property

id of the Beginning of String token.

eos_id abstractmethod property

id of the End of String token.

n_words abstractmethod property

Vocabulary size of the tokenizer.

pad_id abstractmethod property

id of the Pad token.

unk_id abstractmethod property

id of the Unk token.

version abstractmethod property

Get the version of the tokenizer.

decode(t) abstractmethod

Convert the token ids to a string.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def decode(self, t: List[int]) -> str:
    r"""Convert the token ids to a string."""

encode(s, bos, eos) abstractmethod

Convert a string to a list of token ids.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
    """Convert a string to a list of token ids."""

get_control_token(s) abstractmethod

Get the id of a control token.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def get_control_token(self, s: str) -> int:
    r"""Get the id of a control token."""

id_to_piece(token_id) abstractmethod

Convert a token id to the token str.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def id_to_piece(self, token_id: int) -> str:
    r"""Convert a token id to the token str."""

to_string(tokens) abstractmethod

Convert the token ids to a string for debugging purposes.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def to_string(self, tokens: List[int]) -> str:
    r"""Convert the token ids to a string for debugging purposes."""

vocab() abstractmethod

All tokens in the vocabulary as strings.

Source code in src/mistral_common/tokens/tokenizers/base.py
@abstractmethod
def vocab(self) -> List[str]:
    r"""All tokens in the vocabulary as strings."""

TokenizerVersion

Bases: str, Enum

Enum of tokenizer versions.

Allow to distinguish between different versions of the tokenizer and maintain backward compatibility.

Attributes:

Name Type Description
v1

The first version of the tokenizer.

v2

The second version of the tokenizer that includes special control tokens [INST], [\INST].

v3

The third version of the tokenizer that includes improved function calling.

v7

The seventh version of the tokenizer that includes improved system prompt and function calling.

Examples:

>>> version = TokenizerVersion.v1