mistral_common.tokens.tokenizers.tekken
ModelData
Bases: TypedDict
The data of the tekken tokenizer model.
Attributes:
Name | Type | Description |
---|---|---|
vocab |
List[TokenInfo]
|
The vocabulary of the tokenizer. |
config |
TekkenConfig
|
The configuration of the tokenizer. |
version |
int
|
The version of the tokenizer. |
type |
str
|
The type of the tokenizer. |
multimodal |
MultimodalConfig
|
The multimodal configuration of the tokenizer. |
SpecialTokenInfo
SpecialTokenPolicy
Bases: Enum
What to do with special tokens when encoding/decoding.
Attributes:
Name | Type | Description |
---|---|---|
IGNORE |
Ignore special tokens. |
|
KEEP |
Keep special tokens. |
|
RAISE |
Raise an error if special tokens are found. |
TekkenConfig
Bases: TypedDict
Tekken configuration in the JSON file.
Attributes:
Name | Type | Description |
---|---|---|
pattern |
str
|
The pattern of the tokenizer. |
num_vocab_tokens |
int
|
The number of vocabulary tokens. |
default_vocab_size |
int
|
The default vocabulary size. |
default_num_special_tokens |
int
|
The default number of special tokens. |
version |
str
|
The version of the tokenizer. |
Tekkenizer(vocab, special_tokens, pattern, vocab_size, num_special_tokens, version, *, name='tekkenizer', _path=None, mm_config=None)
Bases: Tokenizer
Tekken tokenizer.
This tokenizer is based on the tiktoken library. It fastens the tokenization for multiple languages.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vocab
|
List[TokenInfo]
|
The vocabulary of the tokenizer. |
required |
special_tokens
|
List[SpecialTokenInfo]
|
The special tokens of the tokenizer. |
required |
pattern
|
str
|
The pattern of the tokenizer. |
required |
vocab_size
|
int
|
The vocabulary size of the tokenizer. |
required |
num_special_tokens
|
int
|
The number of special tokens of the tokenizer. |
required |
version
|
TokenizerVersion
|
The version of the tokenizer. |
required |
name
|
str
|
The name of the tokenizer. |
'tekkenizer'
|
mm_config
|
Optional[MultimodalConfig]
|
The multimodal configuration of the tokenizer. |
None
|
Source code in src/mistral_common/tokens/tokenizers/tekken.py
bos_id
cached
property
The beginning of sentence token id.
eos_id
cached
property
The end of sentence token id.
multimodal
property
writable
The multimodal configuration of the tokenizer.
n_words
property
Vocabulary size of the tokenizer.
num_special_tokens
property
The number of special tokens of the tokenizer.
pad_id
cached
property
The padding token id.
special_token_policy
property
writable
The policy for handling special tokens.
unk_id
cached
property
The unknown token id.
version
property
The version of the tokenizer.
decode(tokens)
encode(s, bos, eos)
Encode a string into a list of token ids.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s
|
str
|
The string to encode. |
required |
bos
|
bool
|
Whether to add the beginning of sentence token. |
required |
eos
|
bool
|
Whether to add the end of sentence token. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The list of token ids. |
Source code in src/mistral_common/tokens/tokenizers/tekken.py
from_file(path)
classmethod
Load the tekken tokenizer from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Union[str, Path]
|
The path to the tokenizer file. |
required |
Returns:
Type | Description |
---|---|
Tekkenizer
|
The tekken tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/tekken.py
get_control_token(s)
Get the token id of a control token.
Source code in src/mistral_common/tokens/tokenizers/tekken.py
id_to_byte_piece(token_id)
Convert a token id to its byte representation.
Source code in src/mistral_common/tokens/tokenizers/tekken.py
id_to_piece(token_id)
Convert a token id to its string representation.
is_byte(token_id)
to_string(tokens)
Decode a list of token ids into a string keeping special tokens for debugging purposes.
Source code in src/mistral_common/tokens/tokenizers/tekken.py
vocab()
All tokens in the vocabulary as strings.
Note
This will collapse all tokens for which we have a decoding error into the <?> string. This is bad and results in things like len(set(vocab)) != len(vocab)).
Returns:
Type | Description |
---|---|
List[str]
|
The vocabulary of the tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/tekken.py
TokenInfo
is_tekken(path)
Check if the given path is a tekken tokenizer file.