Installation

Windows (Recommended)

The setup script handles everything automatically — Python, virtual environment, and all dependencies.

git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK
setup.bat

This will:

Manual Setup

pip install -r backend/requirements.txt
cd frontend && npm install
cp .env.example .env

Quick Start

After installation, launch both backend and frontend:

run-all.bat

This starts:

Configuration

Edit .env in the project root to customize settings:

Variable Default Description
HOST 0.0.0.0 Server bind address
PORT 8000 Server port
HUGGINGFACE_TOKEN HuggingFace API token for private models
NPU_STACK_API_KEY Optional API key for /v1 endpoints (empty = no auth)
MODEL_STORAGE backend/data/models Where models are stored
LOG_LEVEL info Logging level

Model Serving

NPU-STACK includes an OpenAI-compatible API server, acting as a local alternative to LM Studio. Any tool or SDK that works with the OpenAI API works with NPU-STACK.

Supported Endpoints

Method Endpoint Description
GET /v1/models List all available models
POST /v1/chat/completions Chat completion (streaming + non-streaming)
POST /v1/completions Text completion (legacy)
POST /v1/embeddings Generate text embeddings
POST /v1/models/load Load a model into memory
POST /v1/models/unload Unload a model from memory
GET /v1/models/status Status of loaded models

Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="any")
response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Models are auto-loaded on first request. You can also pre-load models via the /v1/models/load endpoint or the Serving page in the UI.

Compatibility

Tested and compatible with: OpenAI Python/JS SDK, LangChain, LlamaIndex, Open WebUI, Chatbot UI, Vercel AI SDK, cURL, Postman.

Playground

The Playground provides an interactive UI for testing models without code:

HuggingFace Hub

Browse, search, and download models directly from HuggingFace:

Set HUGGINGFACE_TOKEN in .env to access gated/private models.

Training

Train models from scratch with built-in architectures:

Fine-Tuning

Parameter-efficient fine-tuning using LoRA and QLoRA:

Conversion & Quantization

Convert models between formats and apply quantization:

Benchmark

Benchmark inference performance across hardware:

Dataset Manager

Manage training and evaluation datasets:

Hardware Support

NPU-STACK auto-detects and supports all major AI accelerators:

Hardware Backend Status
NVIDIA CUDA GPUs PyTorch CUDA, ONNX Runtime CUDA
AMD ROCm GPUs PyTorch HIP, ONNX Runtime ROCm
Intel NPU (Core Ultra) OpenVINO NPU plugin
Google Coral Edge TPU TFLite Delegate
DirectML (Windows) ONNX Runtime DML Provider
CPU (x86/ARM) ONNX Runtime CPU, OpenVINO CPU

Docker

Run with Docker Compose for a containerized deployment:

docker compose up --build

Services:

Contributing

We welcome contributions! Here's the workflow:

git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK
git checkout dev
# make your changes
git push origin dev
# then open a Pull Request on GitHub

All PRs should target the dev branch. See the dev branch for the latest development code.