Mistral-common Experimental API
Context
The experimental
module provides access to a FastAPI server designed to handle tokenization and detokenization operations through a REST API. This server features:
- Tokenization: Converting chat completion requests into token sequences
- Detokenization: Converting token sequences back into human-readable formats to an AssistantMessage object or a raw string.
- Chat Completion: Generating an AssistantMessage from a ChatCompletionRequest using an engine backend.
This API serves as a bridge between different providers and tokenization needs regardless of the provider programming language.
Features
- Tokenizer Version Support: Handles all our tokenizer versions automatically
- Tool Call Parsing: Automatically extracts tool calls from token sequences
- Think Chunk Support: Handles think chunks for v13+ tokenizers
- OpenAI Compatibility: Accepts OpenAI-formatted requests for easier integration
Error Handling
The API provides detailed error messages for:
- Invalid token sequences
- Empty token lists
- Missing tokenizer configuration
- Validation errors in requests
- Tool call parsing failures
Usage
Installation
To use the experimental API server, you need to install mistral-common with the appropriate dependencies:
Launching the Server
You can launch the server using the follwing CLI command:
mistral_common serve mistralai/Magistral-Small-2507 [validation_mode] \
--host 127.0.0.1 --port 8000 \
--engine-url http://127.0.0.1:8080 --engine-backend llama_cpp \
--timeout 60
Command Line Options
tokenizer_path
: Path to the tokenizer. Can be a HuggingFace model ID or a local path.validation_mode
: Validation mode to use, choices in: "test", "finetuning", "serving" (Optional, defaults to"test"
)--host
: API host (default:127.0.0.1
)--port
: API port (default:0
- auto-selects available port)--engine-url
: URL of the engine (default:http://127.0.0.1:8080
)--engine-backend
: Engine backend to use (default:llama_cpp
)--timeout
: Timeout for the engine (default:60
)
Available Routes
Documentation
- Route:
/
or/docs
- Method: GET
- Description: The Swagger documentation
The Swagger UI provides interactive documentation for all available endpoints.
Tokenization
Tokenize Request
- Route:
/tokenize/request
- Method: POST
- Description: Tokenizes a chat completion request
- Request Body:
- Chat completion request in either:
- Mistral-common format (ChatCompletionRequest)
- OpenAI-compatible format (OpenAIChatCompletionRequest). Streaming is not supported.
- Chat completion request in either:
- Response: List of token IDs
Example requests:
import requests
from fastapi.encoders import jsonable_encoder
from mistral_common.protocol.instruct.messages import SystemMessage, TextChunk, ThinkChunk, UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
# Simple request
response = requests.post("http://localhost:8000/v1/tokenize/request", json={
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
})
print(response.json())
# [1, 3, 22177, 1044, 2606, 1584, 1636, 1063, 4]
# Complex request with tools and think chunks
system_message = SystemMessage(
content=[
TextChunk(
text=(
"First draft your thinking process (inner monologue) until you arrive at a response. Format your "
"response using Markdown, and use LaTeX for any mathematical equations. Write both your thoughts and "
"the response in the same language as the input.\n\nYour thinking process must follow the template "
"below:"
)
),
ThinkChunk(
thinking=(
"Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and "
"as long as you want until you are confident to generate the response. Use the same language as the "
"input."
),
closed=True,
),
TextChunk(text="Here, provide a self-contained response."),
],
)
response = requests.post(
"http://localhost:8000/v1/tokenize/request",
json=jsonable_encoder(
ChatCompletionRequest(messages=[
system_message,
UserMessage(content="How many 'r' are there in strawberry ?")
])
),
)
print(response.json())
# [1, 17, 10107, 19564, 2143, ..., 33681, 3082, 4]
Detokenization
Detokenize to Assistant Message
- Route:
/detokenize/
- Method: POST
- Description: Converts tokens to an assistant message with tool call parsing
- Request Body:
- List of token IDs
- Response: AssistantMessage object with parsed content and tool calls
Example requests:
import requests
response = requests.post(
"http://localhost:8000/v1/detokenize/",
json=[1, 3, 22177, 1044, 2606, 1584, 1636, 1063, 4]
)
print(response.json())
# {'role': 'assistant', 'content': [{'type': 'text', 'text': 'Hello, how are you?'}], 'tool_calls': None, 'prefix': True}
# With think chunks
response = requests.post(
"http://localhost:8000/v1/detokenize/",
json=[12598, 1639, 3648, 2314, 1494, 1046, 34, 6958, 1584, 1032, 1051, 1576, 1114, 1039, 1294, 51567, 33681, 3082, 35, 19587, 1051, 19587, 2]
)
print(response.json())
# {'role': 'assistant', 'content': [{'type': 'text', 'text': 'Let me think about it.'}, {'type': 'thinking', 'thinking': "There are 3 'r' in strawberry ?", 'closed': True}, {'type': 'text', 'text': '$$3$$'}], 'tool_calls': None, 'prefix': False}
# With tool calls
response = requests.post(
"http://localhost:8000/v1/detokenize/",
json=[9, 12296, 1095, 99571, 32, 38985, 1039, 3494, 4550, 1576, 1314, 3416, 33681, 2096, 1576, 27965, 4550, 1576, 1114, 1039, 27024, 2]
)
print(response.json())
# {'role': 'assistant', 'content': None, 'tool_calls': [{'id': 'null', 'type': 'function', 'function': {'name': 'count_letters', 'arguments': "{'word': 'strawberry', 'letter': 'r'}"}}], 'prefix': False}
Detokenize to String
- Route:
/detokenize/string
- Method: POST
- Description: Converts tokens to a raw string
- Request Body:
tokens
: List of token IDsspecial_token_policy
: Policy for handling special tokens. See SpecialTokenPolicy
- Response: Detokenized string
Example request:
import requests
response = requests.post("http://localhost:8000/v1/detokenize/string", json={
"tokens":[1, 3, 22177, 1044, 2606, 1584, 1636, 1063, 4],
"special_token_policy": 1,
})
print(response.json())
# <s>[INST]Hello, how are you?[/INST]
Chat Completions
- Route:
/v1/chat/completions
- Method: POST
- Description: Generates text from a chat completion request. This endpoint forwards the request to the engine server.
- Request Body:
- Chat completion request in either:
- Mistral-common format (ChatCompletionRequest)
- OpenAI-compatible format (OpenAIChatCompletionRequest). Streaming is not supported.
- Chat completion request in either:
- Response: Chat completion response in OpenAI-compatible format
Example request:
import requests
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"messages": [
{"role": "user", "content": "Hello, how are you?"}
]
})
print(response.json())
# {'role': 'assistant', 'content': 'Hello! I am doing well, thank you for asking.'}
Issues and feedback
If you encounter any issues or have feedback, please open an issue or discussion on the GitHub repository.