mistral_common.tokens.tokenizers.tekken
ModelData
Bases: TypedDict
The data of the tekken tokenizer model.
Attributes:
Name | Type | Description |
---|---|---|
vocab |
List[TokenInfo]
|
The vocabulary of the tokenizer. |
config |
TekkenConfig
|
The configuration of the tokenizer. |
version |
int
|
The version of the tokenizer. |
type |
str
|
The type of the tokenizer. |
image |
ImageConfig
|
The image configuration of the tokenizer. |
SpecialTokenInfo
TekkenConfig
Bases: TypedDict
Tekken configuration in the JSON file.
Attributes:
Name | Type | Description |
---|---|---|
pattern |
str
|
The pattern of the tokenizer. |
num_vocab_tokens |
int
|
The number of vocabulary tokens. |
default_vocab_size |
int
|
The default vocabulary size. |
default_num_special_tokens |
int
|
The default number of special tokens. |
version |
str
|
The version of the tokenizer. |
Tekkenizer(vocab, special_tokens, pattern, vocab_size, num_special_tokens, version, *, name='tekkenizer', _path=None, image_config=None, audio_config=None)
Bases: Tokenizer
Tekken tokenizer.
This tokenizer is based on the tiktoken library. It fastens the tokenization for multiple languages.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vocab
|
List[TokenInfo]
|
The vocabulary of the tokenizer. |
required |
special_tokens
|
List[SpecialTokenInfo]
|
The special tokens of the tokenizer. |
required |
pattern
|
str
|
The pattern of the tokenizer. |
required |
vocab_size
|
int
|
The vocabulary size of the tokenizer. |
required |
num_special_tokens
|
int
|
The number of special tokens of the tokenizer. |
required |
version
|
TokenizerVersion
|
The version of the tokenizer. |
required |
name
|
str
|
The name of the tokenizer. |
'tekkenizer'
|
image_config
|
Optional[ImageConfig]
|
The image configuration of the tokenizer. |
None
|
Source code in src/mistral_common/tokens/tokenizers/tekken.py
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
|
audio
property
writable
The audio configuration of the tokenizer.
Returns:
Type | Description |
---|---|
Optional[AudioConfig]
|
The audio configuration object if it exists, otherwise None. |
bos_id
cached
property
The beginning of sentence token id.
eos_id
cached
property
The end of sentence token id.
file_path
property
The path to the tokenizer file.
image
property
writable
The image configuration of the tokenizer.
n_words
property
Vocabulary size of the tokenizer.
num_special_tokens
property
The number of special tokens of the tokenizer.
pad_id
cached
property
The padding token id.
special_token_policy
property
writable
The policy for handling special tokens.
unk_id
cached
property
The unknown token id.
version
property
The version of the tokenizer.
decode(tokens, special_token_policy=None)
Decode a list of token ids into a string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens
|
List[int]
|
The list of token ids to decode. |
required |
special_token_policy
|
Optional[SpecialTokenPolicy]
|
The policy for handling special tokens.
Use the tokenizer's attribute
if |
None
|
Returns:
Type | Description |
---|---|
str
|
The decoded string. |
Source code in src/mistral_common/tokens/tokenizers/tekken.py
encode(s, bos, eos)
Encode a string into a list of token ids.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s
|
str
|
The string to encode. |
required |
bos
|
bool
|
Whether to add the beginning of sentence token. |
required |
eos
|
bool
|
Whether to add the end of sentence token. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The list of token ids. |
Source code in src/mistral_common/tokens/tokenizers/tekken.py
from_file(path)
classmethod
Load the tekken tokenizer from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Union[str, Path]
|
The path to the tokenizer file. |
required |
Returns:
Type | Description |
---|---|
Tekkenizer
|
The tekken tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/tekken.py
get_control_token(s)
Get the token id of a control token.
Source code in src/mistral_common/tokens/tokenizers/tekken.py
id_to_byte_piece(token_id, special_token_policy=None)
Convert a token id to its byte representation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token_id
|
int
|
The token id to convert. |
required |
special_token_policy
|
Optional[SpecialTokenPolicy]
|
The policy for handling special tokens.
Use the tokenizer's attribute
if |
None
|
Returns:
Type | Description |
---|---|
bytes
|
The byte representation of the token. |
Source code in src/mistral_common/tokens/tokenizers/tekken.py
id_to_piece(token_id)
is_byte(token_id)
to_string(tokens)
[DEPRECATED] Converts a list of token ids into a string, keeping special tokens.
Use decode
with special_token_policy=SpecialTokenPolicy.KEEP
instead.
This is a convenient method for debugging.
Source code in src/mistral_common/tokens/tokenizers/tekken.py
vocab()
All tokens in the vocabulary as strings.
Note
This will collapse all tokens for which we have a decoding error into the <?> string. This is bad and results in things like len(set(vocab)) != len(vocab)).
Returns:
Type | Description |
---|---|
List[str]
|
The vocabulary of the tokenizer. |
Source code in src/mistral_common/tokens/tokenizers/tekken.py
TokenInfo
is_tekken(path)
Check if the given path is a tekken tokenizer file.