GET /v1/models

List all available models — both loaded and registered. Response matches the OpenAI models list format.

curl http://localhost:8000/v1/models

Response

{
  "object": "list",
  "data": [
    {
      "id": "my-llama-7b",
      "object": "model",
      "created": 1708900000,
      "owned_by": "npu-stack"
    }
  ]
}

POST /v1/chat/completions

Generate chat completions. Supports streaming via Server-Sent Events (SSE). Models are auto-loaded on first request.

Request Body

Field Type Required Description
model string Model name (as registered)
messages array Array of {role, content} objects
temperature float Sampling temperature (default: 0.7)
max_tokens int Max tokens to generate (default: 256)
stream bool Enable SSE streaming (default: false)
top_p float Nucleus sampling (default: 1.0)
stop array Stop sequences

Example Request

{
  "model": "my-llama-7b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is an NPU?"}
  ],
  "temperature": 0.7,
  "max_tokens": 256
}

Response

{
  "id": "chatcmpl-abc123def456",
  "object": "chat.completion",
  "created": 1708900000,
  "model": "my-llama-7b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "An NPU (Neural Processing Unit) is..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 42,
    "total_tokens": 67
  }
}

Streaming Response

When stream: true, the response returns SSE events:

data: {"id":"chatcmpl-abc","choices":[{"delta":{"role":"assistant","content":"An"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" NPU"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{},"finish_reason":"stop"}]}

data: [DONE]

POST /v1/completions

Legacy text completion endpoint. Same parameters as chat completions but uses prompt instead of messages.

Field Type Required Description
model string Model name
prompt string Text prompt to complete
max_tokens int Max tokens (default: 256)
stream bool Enable streaming

POST /v1/embeddings

Generate text embeddings for semantic search, RAG, and similarity matching.

Field Type Required Description
model string Embedding model name
input string | array Text or array of texts to embed

Response

{
  "object": "list",
  "data": [{
    "object": "embedding",
    "index": 0,
    "embedding": [0.0023, -0.0091, 0.0152, ...]
  }],
  "usage": { "prompt_tokens": 8, "total_tokens": 8 }
}

POST /v1/models/load

Pre-load a model into memory. Useful for warming up before serving.

curl -X POST http://localhost:8000/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"name": "my-model"}'

POST /v1/models/unload

Unload a model from memory to free resources.

curl -X POST http://localhost:8000/v1/models/unload \
  -H "Content-Type: application/json" \
  -d '{"name": "my-model"}'

GET /v1/models/status

Get status of all loaded models including uptime and type.

POST /api/finetune/start

Start a fine-tuning job with LoRA/QLoRA. Requires peft and datasets packages.

Field Type Default Description
model_id int Base model ID from registry
dataset string Dataset name or path
epochs int 3 Training epochs
batch_size int 4 Batch size per device
learning_rate float 2e-4 Learning rate
use_lora bool true Enable LoRA adapters
lora_r int 16 LoRA rank
lora_alpha int 32 LoRA alpha

GET /api/finetune/jobs

List all fine-tuning jobs with their status (initializing, running, completed, failed).

GET /api/finetune/status/{job_id}

Get detailed status, metrics (loss, learning rate per step), and log output.

Python Example

# Using the official OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any"
)

# Non-streaming
response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Embeddings
embeddings = client.embeddings.create(
    model="my-embedding-model",
    input="Hello world"
)
print(len(embeddings.data[0].embedding))

JavaScript Example

// Using fetch (works in browser and Node.js)
const response = await fetch("http://localhost:8000/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "my-model",
    messages: [{ role: "user", content: "Hello!" }]
  })
});
const data = await response.json();
console.log(data.choices[0].message.content);

// Streaming with EventSource
const sse = await fetch("http://localhost:8000/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "my-model",
    messages: [{ role: "user", content: "Hello!" }],
    stream: true
  })
});
const reader = sse.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  console.log(decoder.decode(value));
}

cURL Example

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# With streaming
curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

# List models
curl http://localhost:8000/v1/models

# Load a model
curl -X POST http://localhost:8000/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"name": "my-model"}'

LangChain Example

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any",
    model="my-model",
    temperature=0.7
)

response = llm.invoke("What is an NPU?")
print(response.content)