GET /v1/models
List all available models — both loaded and registered. Response matches the OpenAI models list format.
curl http://localhost:8000/v1/models
Response
{
"object": "list",
"data": [
{
"id": "my-llama-7b",
"object": "model",
"created": 1708900000,
"owned_by": "npu-stack"
}
]
}
POST /v1/chat/completions
Generate chat completions. Supports streaming via Server-Sent Events (SSE). Models are auto-loaded on first request.
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | ✅ | Model name (as registered) |
messages |
array | ✅ | Array of {role, content} objects |
temperature |
float | Sampling temperature (default: 0.7) | |
max_tokens |
int | Max tokens to generate (default: 256) | |
stream |
bool | Enable SSE streaming (default: false) | |
top_p |
float | Nucleus sampling (default: 1.0) | |
stop |
array | Stop sequences |
Example Request
{
"model": "my-llama-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is an NPU?"}
],
"temperature": 0.7,
"max_tokens": 256
}
Response
{
"id": "chatcmpl-abc123def456",
"object": "chat.completion",
"created": 1708900000,
"model": "my-llama-7b",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "An NPU (Neural Processing Unit) is..."
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 42,
"total_tokens": 67
}
}
Streaming Response
When stream: true, the response returns SSE events:
data: {"id":"chatcmpl-abc","choices":[{"delta":{"role":"assistant","content":"An"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" NPU"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
POST /v1/completions
Legacy text completion endpoint. Same parameters as chat completions but uses prompt instead
of messages.
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | ✅ | Model name |
prompt |
string | ✅ | Text prompt to complete |
max_tokens |
int | Max tokens (default: 256) | |
stream |
bool | Enable streaming |
POST /v1/embeddings
Generate text embeddings for semantic search, RAG, and similarity matching.
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | ✅ | Embedding model name |
input |
string | array | ✅ | Text or array of texts to embed |
Response
{
"object": "list",
"data": [{
"object": "embedding",
"index": 0,
"embedding": [0.0023, -0.0091, 0.0152, ...]
}],
"usage": { "prompt_tokens": 8, "total_tokens": 8 }
}
POST /v1/models/load
Pre-load a model into memory. Useful for warming up before serving.
curl -X POST http://localhost:8000/v1/models/load \
-H "Content-Type: application/json" \
-d '{"name": "my-model"}'
POST /v1/models/unload
Unload a model from memory to free resources.
curl -X POST http://localhost:8000/v1/models/unload \
-H "Content-Type: application/json" \
-d '{"name": "my-model"}'
GET /v1/models/status
Get status of all loaded models including uptime and type.
POST /api/finetune/start
Start a fine-tuning job with LoRA/QLoRA. Requires peft and datasets packages.
| Field | Type | Default | Description |
|---|---|---|---|
model_id |
int | — | Base model ID from registry |
dataset |
string | — | Dataset name or path |
epochs |
int | 3 | Training epochs |
batch_size |
int | 4 | Batch size per device |
learning_rate |
float | 2e-4 | Learning rate |
use_lora |
bool | true | Enable LoRA adapters |
lora_r |
int | 16 | LoRA rank |
lora_alpha |
int | 32 | LoRA alpha |
GET /api/finetune/jobs
List all fine-tuning jobs with their status (initializing, running, completed, failed).
GET /api/finetune/status/{job_id}
Get detailed status, metrics (loss, learning rate per step), and log output.
Python Example
# Using the official OpenAI SDK from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="any" ) # Non-streaming response = client.chat.completions.create( model="my-model", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) # Streaming stream = client.chat.completions.create( model="my-model", messages=[{"role": "user", "content": "Tell me a story"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") # Embeddings embeddings = client.embeddings.create( model="my-embedding-model", input="Hello world" ) print(len(embeddings.data[0].embedding))
JavaScript Example
// Using fetch (works in browser and Node.js) const response = await fetch("http://localhost:8000/v1/chat/completions", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: "my-model", messages: [{ role: "user", content: "Hello!" }] }) }); const data = await response.json(); console.log(data.choices[0].message.content); // Streaming with EventSource const sse = await fetch("http://localhost:8000/v1/chat/completions", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: "my-model", messages: [{ role: "user", content: "Hello!" }], stream: true }) }); const reader = sse.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; console.log(decoder.decode(value)); }
cURL Example
# Chat completion curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [{"role": "user", "content": "Hello!"}] }' # With streaming curl -N http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [{"role": "user", "content": "Hello!"}], "stream": true }' # List models curl http://localhost:8000/v1/models # Load a model curl -X POST http://localhost:8000/v1/models/load \ -H "Content-Type: application/json" \ -d '{"name": "my-model"}'
LangChain Example
from langchain_openai import ChatOpenAI llm = ChatOpenAI( base_url="http://localhost:8000/v1", api_key="any", model="my-model", temperature=0.7 ) response = llm.invoke("What is an NPU?") print(response.content)