Skip to content

mistral_common.tokens.tokenizers.tekken

ModelData

Bases: TypedDict

The data of the tekken tokenizer model.

Attributes:

Name Type Description
vocab List[TokenInfo]

The vocabulary of the tokenizer.

config TekkenConfig

The configuration of the tokenizer.

version int

The version of the tokenizer.

type str

The type of the tokenizer.

multimodal MultimodalConfig

The multimodal configuration of the tokenizer.

SpecialTokenInfo

Bases: TypedDict

Special token information in the JSON file.

Attributes:

Name Type Description
rank int

The rank of the token.

token_str str

The token in string format.

is_control bool

Whether the token is a control token.

SpecialTokenPolicy

Bases: Enum

What to do with special tokens when encoding/decoding.

Attributes:

Name Type Description
IGNORE

Ignore special tokens.

KEEP

Keep special tokens.

RAISE

Raise an error if special tokens are found.

TekkenConfig

Bases: TypedDict

Tekken configuration in the JSON file.

Attributes:

Name Type Description
pattern str

The pattern of the tokenizer.

num_vocab_tokens int

The number of vocabulary tokens.

default_vocab_size int

The default vocabulary size.

default_num_special_tokens int

The default number of special tokens.

version str

The version of the tokenizer.

Tekkenizer(vocab, special_tokens, pattern, vocab_size, num_special_tokens, version, *, name='tekkenizer', _path=None, mm_config=None)

Bases: Tokenizer

Tekken tokenizer.

This tokenizer is based on the tiktoken library. It fastens the tokenization for multiple languages.

Parameters:

Name Type Description Default
vocab List[TokenInfo]

The vocabulary of the tokenizer.

required
special_tokens List[SpecialTokenInfo]

The special tokens of the tokenizer.

required
pattern str

The pattern of the tokenizer.

required
vocab_size int

The vocabulary size of the tokenizer.

required
num_special_tokens int

The number of special tokens of the tokenizer.

required
version TokenizerVersion

The version of the tokenizer.

required
name str

The name of the tokenizer.

'tekkenizer'
mm_config Optional[MultimodalConfig]

The multimodal configuration of the tokenizer.

None
Source code in src/mistral_common/tokens/tokenizers/tekken.py
def __init__(
    self,
    vocab: List[TokenInfo],
    special_tokens: List[SpecialTokenInfo],
    pattern: str,
    vocab_size: int,
    num_special_tokens: int,
    version: TokenizerVersion,
    *,
    name: str = "tekkenizer",
    _path: Optional[str] = None,
    mm_config: Optional[MultimodalConfig] = None,
):
    r"""Initialize the tekken tokenizer.

    Args:
        vocab: The vocabulary of the tokenizer.
        special_tokens: The special tokens of the tokenizer.
        pattern: The pattern of the tokenizer.
        vocab_size: The vocabulary size of the tokenizer.
        num_special_tokens: The number of special tokens of the tokenizer.
        version: The version of the tokenizer.
        name: The name of the tokenizer.
        mm_config: The multimodal configuration of the tokenizer.
    """
    assert vocab_size <= len(vocab) + num_special_tokens, (
        vocab_size,
        len(vocab),
        num_special_tokens,
    )
    self._vocab_size = vocab_size

    # The number of special tokens defined in the tokenizer json
    num_defined_special_tokens = len(set([t["token_str"] for t in special_tokens]))

    assert len(special_tokens) == num_defined_special_tokens, f"Special tokens must be unique: {special_tokens}"
    assert len(special_tokens) <= num_special_tokens

    special_filler = [
        SpecialTokenInfo(rank=i, token_str=self.SPECIAL_TOKEN_TEMPLATE.format(id=i), is_control=True)
        for i in range(len(special_tokens), num_special_tokens)
    ]
    if special_filler:
        logger.info(
            f"Adding special tokens {special_filler[0]['token_str']}, ..., {special_filler[-1]['token_str']}"
        )
    special_tokens = special_tokens + special_filler

    assert len(set([t["token_str"] for t in special_tokens])) == len(special_tokens) == num_special_tokens, (
        special_tokens
    )
    inner_vocab_size = vocab_size - num_special_tokens

    # reload vocab
    self._tekken_token2id_nospecial = _reload_mergeable_ranks(vocab, max_vocab=inner_vocab_size)
    assert set(range(inner_vocab_size)) == set(self._tekken_token2id_nospecial.values()), (
        inner_vocab_size,
        self._tekken_token2id_nospecial,
    )
    self._model = tiktoken.Encoding(
        name=name,
        pat_str=pattern,
        mergeable_ranks=self._tekken_token2id_nospecial,
        special_tokens={},  # special tokens are handled manually
    )

    self._version = version
    self._mm_config = mm_config
    self._all_special_tokens = special_tokens
    self._special_tokens_reverse_vocab = {t["token_str"]: t["rank"] for t in special_tokens}
    self._vocab = [self.id_to_piece(i) for i in range(vocab_size)]
    self._special_token_policy = SpecialTokenPolicy.IGNORE

bos_id cached property

The beginning of sentence token id.

eos_id cached property

The end of sentence token id.

multimodal property writable

The multimodal configuration of the tokenizer.

n_words property

Vocabulary size of the tokenizer.

num_special_tokens property

The number of special tokens of the tokenizer.

pad_id cached property

The padding token id.

special_token_policy property writable

The policy for handling special tokens.

unk_id cached property

The unknown token id.

version property

The version of the tokenizer.

decode(tokens)

Decode a list of token ids into a string.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
def decode(self, tokens: List[int]) -> str:
    r"""Decode a list of token ids into a string."""
    return "".join(self._decode_all(tokens, special_token_policy=self._special_token_policy))

encode(s, bos, eos)

Encode a string into a list of token ids.

Parameters:

Name Type Description Default
s str

The string to encode.

required
bos bool

Whether to add the beginning of sentence token.

required
eos bool

Whether to add the end of sentence token.

required

Returns:

Type Description
List[int]

The list of token ids.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
    r"""Encode a string into a list of token ids.

    Args:
        s: The string to encode.
        bos: Whether to add the beginning of sentence token.
        eos: Whether to add the end of sentence token.

    Returns:
        The list of token ids.
    """
    tokens: List[int] = self._model.encode(s)
    tokens = [t + self.num_special_tokens for t in tokens]
    if bos:
        tokens = [self.bos_id, *tokens]
    if eos:
        tokens = [*tokens, self.eos_id]
    return tokens

from_file(path) classmethod

Load the tekken tokenizer from a file.

Parameters:

Name Type Description Default
path Union[str, Path]

The path to the tokenizer file.

required

Returns:

Type Description
Tekkenizer

The tekken tokenizer.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
@classmethod
def from_file(cls: Type["Tekkenizer"], path: Union[str, Path]) -> "Tekkenizer":
    r"""Load the tekken tokenizer from a file.

    Args:
        path: The path to the tokenizer file.

    Returns:
        The tekken tokenizer.
    """
    if isinstance(path, str):
        path = Path(path)
    assert path.exists(), path
    with open(path, "r") as f:
        untyped = json.load(f)

    _version_str = untyped["config"].get("version")
    if _version_str not in TokenizerVersion.__members__:
        raise ValueError(
            f"Unknown version: {_version_str} in {path}. "
            f"Make sure to use a valid version string: {list(TokenizerVersion.__members__)}"
        )

    assert _version_str is not None
    version = TokenizerVersion(_version_str)

    special_tokens_dicts: Optional[List[SpecialTokenInfo]] = untyped.get("special_tokens", None)
    if special_tokens_dicts is None:
        err_msg = (
            f"Special tokens not found in {path} and default to {Tekkenizer.DEPRECATED_SPECIAL_TOKENS}. "
            "This behavior will be deprecated going forward. "
            "Please update your tokenizer file and include all special tokens you need."
        )
        # Tokenizer > v7 should find special tokens in the tokenizer file
        if version > TokenizerVersion("v7"):
            raise ValueError(err_msg)
        else:
            warnings.warn(
                err_msg,
                FutureWarning,
            )
            special_tokens = list(Tekkenizer.DEPRECATED_SPECIAL_TOKENS)
    else:
        special_tokens = [token for token in special_tokens_dicts]

    untyped["special_tokens"] = special_tokens

    if mm := untyped.get("multimodal", None):
        untyped["multimodal"] = MultimodalConfig(**mm)

    model_data: ModelData = untyped

    return cls(
        vocab=model_data["vocab"],
        special_tokens=special_tokens,
        pattern=model_data["config"]["pattern"],
        vocab_size=model_data["config"]["default_vocab_size"],
        num_special_tokens=model_data["config"]["default_num_special_tokens"],
        version=version,
        name=path.name.replace(".json", ""),
        mm_config=model_data.get("multimodal"),
    )

get_control_token(s)

Get the token id of a control token.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
def get_control_token(self, s: str) -> int:
    r"""Get the token id of a control token."""
    if s in self._special_tokens_reverse_vocab:
        return self._special_tokens_reverse_vocab[s]
    else:
        raise ValueError(f"Unknown control token {s}")

id_to_byte_piece(token_id)

Convert a token id to its byte representation.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
def id_to_byte_piece(self, token_id: int) -> bytes:
    r"""Convert a token id to its byte representation."""
    if token_id < self.num_special_tokens:
        if self._special_token_policy == SpecialTokenPolicy.KEEP:
            return self._all_special_tokens[token_id]["token_str"].encode("utf-8")
        elif self._special_token_policy == SpecialTokenPolicy.RAISE:
            raise ValueError(f"{token_id} is a special token")

    return self._model.decode_single_token_bytes(token_id - self.num_special_tokens)

id_to_piece(token_id)

Convert a token id to its string representation.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
def id_to_piece(self, token_id: int) -> str:
    r"""Convert a token id to its string representation."""
    return self._decode_all([token_id], special_token_policy=SpecialTokenPolicy.KEEP)[0]

is_byte(token_id)

Check if a token id is a byte token.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
def is_byte(self, token_id: int) -> bool:
    r"""Check if a token id is a byte token."""
    return 0 <= token_id - self.num_special_tokens < 256

to_string(tokens)

Decode a list of token ids into a string keeping special tokens for debugging purposes.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
def to_string(self, tokens: List[int]) -> str:
    r"""Decode a list of token ids into a string keeping special tokens for debugging purposes."""
    return "".join(self._decode_all(tokens, special_token_policy=SpecialTokenPolicy.KEEP))

vocab()

All tokens in the vocabulary as strings.

Note

This will collapse all tokens for which we have a decoding error into the <?> string. This is bad and results in things like len(set(vocab)) != len(vocab)).

Returns:

Type Description
List[str]

The vocabulary of the tokenizer.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
def vocab(self) -> List[str]:
    r"""All tokens in the vocabulary as strings.

    Note:
       This will collapse all tokens for which we have a decoding error into
       the <?> string. This is bad and results in things like len(set(vocab)) != len(vocab)).

    Returns:
        The vocabulary of the tokenizer.
    """
    # when returning self._vocab this will collapse
    # all tokens for which we have a decoding error into
    # the <?> string. This is bad and results in things
    # like len(set(vocab)) != len(vocab))
    # be careful when using self._vocab
    return self._vocab

TokenInfo

Bases: TypedDict

Token information in the JSON file.

Attributes:

Name Type Description
rank int

The rank of the token.

token_bytes str

The token in bytes, base64 encoded.

token_str Optional[str]

The token in string format.

is_tekken(path)

Check if the given path is a tekken tokenizer file.

Source code in src/mistral_common/tokens/tokenizers/tekken.py
def is_tekken(path: Union[str, Path]) -> bool:
    r"""Check if the given path is a tekken tokenizer file."""
    if isinstance(path, str):
        path = Path(path)
    return path.is_file() and "tekken" in path.name and path.suffix == ".json"