API Reference — NPU-STACK

GET /v1/models

List all available models — both loaded and registered. Response matches the OpenAI models list format.

curl http://localhost:8000/v1/models

Response

{
  "object": "list",
  "data": [
    {
      "id": "my-llama-7b",
      "object": "model",
      "created": 1708900000,
      "owned_by": "npu-stack"
    }
  ]
}

POST /v1/chat/completions

Generate chat completions. Supports streaming via Server-Sent Events (SSE). Models are auto-loaded on first request.

Request Body

Field	Type	Required	Description
`model`	string	✅	Model name (as registered)
`messages`	array	✅	Array of {role, content} objects
`temperature`	float		Sampling temperature (default: 0.7)
`max_tokens`	int		Max tokens to generate (default: 256)
`stream`	bool		Enable SSE streaming (default: false)
`top_p`	float		Nucleus sampling (default: 1.0)
`stop`	array		Stop sequences

Example Request

{
  "model": "my-llama-7b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is an NPU?"}
  ],
  "temperature": 0.7,
  "max_tokens": 256
}

Response

{
  "id": "chatcmpl-abc123def456",
  "object": "chat.completion",
  "created": 1708900000,
  "model": "my-llama-7b",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "An NPU (Neural Processing Unit) is..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 42,
    "total_tokens": 67
  }
}

Streaming Response

When stream: true, the response returns SSE events:

data: {"id":"chatcmpl-abc","choices":[{"delta":{"role":"assistant","content":"An"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":" NPU"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{},"finish_reason":"stop"}]}

data: [DONE]

POST /v1/completions

Legacy text completion endpoint. Same parameters as chat completions but uses prompt instead of messages.

Field	Type	Required	Description
`model`	string	✅	Model name
`prompt`	string	✅	Text prompt to complete
`max_tokens`	int		Max tokens (default: 256)
`stream`	bool		Enable streaming

POST /v1/embeddings

Generate text embeddings for semantic search, RAG, and similarity matching.

Field	Type	Required	Description
`model`	string	✅	Embedding model name
`input`	string \| array	✅	Text or array of texts to embed

Response

{
  "object": "list",
  "data": [{
    "object": "embedding",
    "index": 0,
    "embedding": [0.0023, -0.0091, 0.0152, ...]
  }],
  "usage": { "prompt_tokens": 8, "total_tokens": 8 }
}

POST /v1/models/load

Pre-load a model into memory. Useful for warming up before serving.

curl -X POST http://localhost:8000/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"name": "my-model"}'

POST /v1/models/unload

Unload a model from memory to free resources.

curl -X POST http://localhost:8000/v1/models/unload \
  -H "Content-Type: application/json" \
  -d '{"name": "my-model"}'

GET /v1/models/status

Get status of all loaded models including uptime and type.

POST /api/finetune/start

Start a fine-tuning job with LoRA/QLoRA. Requires peft and datasets packages.

Field	Type	Default	Description
`model_id`	int	—	Base model ID from registry
`dataset`	string	—	Dataset name or path
`epochs`	int	3	Training epochs
`batch_size`	int	4	Batch size per device
`learning_rate`	float	2e-4	Learning rate
`use_lora`	bool	true	Enable LoRA adapters
`lora_r`	int	16	LoRA rank
`lora_alpha`	int	32	LoRA alpha

GET /api/finetune/jobs

List all fine-tuning jobs with their status (initializing, running, completed, failed).

GET /api/finetune/status/{job_id}

Get detailed status, metrics (loss, learning rate per step), and log output.

Python Example

# Using the official OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any"
)

# Non-streaming
response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Embeddings
embeddings = client.embeddings.create(
    model="my-embedding-model",
    input="Hello world"
)
print(len(embeddings.data[0].embedding))

JavaScript Example

// Using fetch (works in browser and Node.js)
const response = await fetch("http://localhost:8000/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "my-model",
    messages: [{ role: "user", content: "Hello!" }]
  })
});
const data = await response.json();
console.log(data.choices[0].message.content);

// Streaming with EventSource
const sse = await fetch("http://localhost:8000/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "my-model",
    messages: [{ role: "user", content: "Hello!" }],
    stream: true
  })
});
const reader = sse.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  console.log(decoder.decode(value));
}

cURL Example

# Chat completion
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# With streaming
curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

# List models
curl http://localhost:8000/v1/models

# Load a model
curl -X POST http://localhost:8000/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"name": "my-model"}'

LangChain Example

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any",
    model="my-model",
    temperature=0.7
)

response = llm.invoke("What is an NPU?")
print(response.content)