mistral_common.tokens.tokenizers.sentencepiece
SentencePieceTokenizer(model_path, tokenizer_version=None)
Bases: Tokenizer
SentencePiece tokenizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_path
|
Union[str, Path]
|
The path to the |
required |
tokenizer_version
|
Optional[TokenizerVersion]
|
The version of the tokenizer. If not provided, it will be inferred from the model path. |
None
|
Source code in src/mistral_common/tokens/tokenizers/sentencepiece.py
bos_id
cached
property
The beginning of sentence token id.
eos_id
cached
property
The end of sentence token id.
file_path
property
The path to the tokenizer model.
n_words
property
Vocabulary size of the tokenizer.
pad_id
property
The padding token id.
unk_id
property
The unknown token id.
version
property
The version of the tokenizer.
decode(tokens, special_token_policy=None)
Decode the given list of token ids into a string.
Note
Using special_token_policy=SpecialTokenPolicy.KEEP
will keep the special tokens and the normal tokens as
SentencePiece pieces.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokens
|
List[int]
|
The list of token ids. |
required |
special_token_policy
|
Optional[SpecialTokenPolicy]
|
The policy to use for special tokens. If |
None
|
Returns:
Type | Description |
---|---|
str
|
The decoded string. |
Source code in src/mistral_common/tokens/tokenizers/sentencepiece.py
encode(s, bos, eos)
Encode the given string into a list of token ids.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s
|
str
|
The string to encode. |
required |
bos
|
bool
|
Whether to add the beginning of sentence token. |
required |
eos
|
bool
|
Whether to add the end of sentence token. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
The list of token ids. |
Source code in src/mistral_common/tokens/tokenizers/sentencepiece.py
get_control_token(s)
id_to_piece(token_id)
to_string(tokens)
[DEPRECATED] Converts a list of token ids into a string, keeping special tokens.
Use decode
with special_token_policy=SpecialTokenPolicy.KEEP
instead.
This is a convenient method for debugging.
Source code in src/mistral_common/tokens/tokenizers/sentencepiece.py
get_image_config(tokenizer_filename)
Get the image config from the tokenizer filename.
Source code in src/mistral_common/tokens/tokenizers/sentencepiece.py
get_spm_version(tokenizer_filename, raise_deprecated=False)
Get the version of the tokenizer from the filename.
Source code in src/mistral_common/tokens/tokenizers/sentencepiece.py
is_sentencepiece(path)
Check if the given path is a SentencePiece model.