GET /v1/models
List all available models — both loaded and registered. Response matches the OpenAI models list format.
curl http://localhost:8000/v1/models
Response
{
"object": "list",
"data": [
{
"id": "my-llama-7b",
"object": "model",
"created": 1708900000,
"owned_by": "npu-stack"
}
]
}
POST /v1/chat/completions
Generate chat completions. Supports streaming via Server-Sent Events (SSE). Models are auto-loaded on first request.
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | ✅ | Model name (as registered) |
messages |
array | ✅ | Array of {role, content} objects |
temperature |
float | Sampling temperature (default: 0.7) | |
max_tokens |
int | Max tokens to generate (default: 256) | |
stream |
bool | Enable SSE streaming (default: false) | |
top_p |
float | Nucleus sampling (default: 1.0) | |
stop |
array | Stop sequences |
Example Request
{
"model": "my-llama-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is an NPU?"}
],
"temperature": 0.7,
"max_tokens": 256
}
Response
{
"id": "chatcmpl-abc123def456",
"object": "chat.completion",
"created": 1708900000,
"model": "my-llama-7b",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "An NPU (Neural Processing Unit) is..."
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 42,
"total_tokens": 67
}
}
Streaming Response
When stream: true, the response returns SSE events:
data: {"id":"chatcmpl-abc","choices":[{"delta":{"role":"assistant","content":"An"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" NPU"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","choices":[{"delta":{},"finish_reason":"stop"}]}
data: [DONE]
POST /v1/completions
Legacy text completion endpoint. Same parameters as chat completions but uses prompt instead
of messages.
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | ✅ | Model name |
prompt |
string | ✅ | Text prompt to complete |
max_tokens |
int | Max tokens (default: 256) | |
stream |
bool | Enable streaming |
POST /v1/embeddings
Generate text embeddings for semantic search, RAG, and similarity matching.
| Field | Type | Required | Description |
|---|---|---|---|
model |
string | ✅ | Embedding model name |
input |
string | array | ✅ | Text or array of texts to embed |
Response
{
"object": "list",
"data": [{
"object": "embedding",
"index": 0,
"embedding": [0.0023, -0.0091, 0.0152, ...]
}],
"usage": { "prompt_tokens": 8, "total_tokens": 8 }
}
POST /v1/models/load
Pre-load a model into memory. Useful for warming up before serving.
curl -X POST http://localhost:8000/v1/models/load \
-H "Content-Type: application/json" \
-d '{"name": "my-model"}'
POST /v1/models/unload
Unload a model from memory to free resources.
curl -X POST http://localhost:8000/v1/models/unload \
-H "Content-Type: application/json" \
-d '{"name": "my-model"}'
GET /v1/models/status
Get status of all loaded models including uptime and type.
GET /api/finetune/status
Get Unsloth & HuggingFace Hub ecosystem status.
POST /api/finetune/train
Start a QLoRA fine-tuning job with Unsloth. Exports to GGUF, SafeTensors, or LoRA.
| Field | Description |
|---|---|
model_name |
HuggingFace model name or path |
dataset_source |
Dataset name or path |
use_4bit |
Use 4-bit quantization (default: true) |
lora_r |
LoRA rank (default: 16) |
POST /api/finetune/publish/*
Push a model directory, GGUF file, or dataset to the HuggingFace Hub with auto-generated model cards.
GET /api/gguf/pipeline/status
Check available GGUF tools (e.g., llama-cli, llama-quantize), 21 supported
quantization types, and 38+ supported architectures.
POST /api/gguf/quantize
Quantize a GGUF model using llama.cpp. Supports IQ variants with imatrix.
POST /api/gguf/convert/hf-to-gguf
Convert HuggingFace/SafeTensors models to GGUF format.
GET /api/convert/cross/paths
Retrieve all supported cross-format conversion paths (e.g., PyTorch to ONNX, PyTorch to OpenVINO, HuggingFace to GGUF). Returns a list of available conversions based on installed backend tools.
GET /api/assets/catalog
Retrieve the registry of pre-bundled curated models and datasets.
POST /api/litert/convert/pytorch-to-tflite
Convert PyTorch models to TFLite format using the Google LiteRT ecosystem.
Python Example
# Using the official OpenAI SDK from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="any" ) # Non-streaming response = client.chat.completions.create( model="my-model", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content) # Streaming stream = client.chat.completions.create( model="my-model", messages=[{"role": "user", "content": "Tell me a story"}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") # Embeddings embeddings = client.embeddings.create( model="my-embedding-model", input="Hello world" ) print(len(embeddings.data[0].embedding))
JavaScript Example
// Using fetch (works in browser and Node.js) const response = await fetch("http://localhost:8000/v1/chat/completions", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: "my-model", messages: [{ role: "user", content: "Hello!" }] }) }); const data = await response.json(); console.log(data.choices[0].message.content); // Streaming with EventSource const sse = await fetch("http://localhost:8000/v1/chat/completions", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: "my-model", messages: [{ role: "user", content: "Hello!" }], stream: true }) }); const reader = sse.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; console.log(decoder.decode(value)); }
cURL Example
# Chat completion curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [{"role": "user", "content": "Hello!"}] }' # With streaming curl -N http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "my-model", "messages": [{"role": "user", "content": "Hello!"}], "stream": true }' # List models curl http://localhost:8000/v1/models # Load a model curl -X POST http://localhost:8000/v1/models/load \ -H "Content-Type: application/json" \ -d '{"name": "my-model"}'
LangChain Example
from langchain_openai import ChatOpenAI llm = ChatOpenAI( base_url="http://localhost:8000/v1", api_key="any", model="my-model", temperature=0.7 ) response = llm.invoke("What is an NPU?") print(response.content)