Documentation — NPU-STACK

Installation

Windows (Recommended)

The setup script handles everything automatically — Python, virtual environment, and all dependencies.

git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK
setup.bat

This will:

Download portable Python 3.11 if no compatible Python is found
Create an isolated .venv virtual environment
Install all backend + frontend dependencies
Generate a .env configuration file
Create launcher scripts: run-backend.bat, run-frontend.bat, run-all.bat

Manual Setup

pip install -r backend/requirements.txt
cd frontend && npm install
cp .env.example .env

Quick Start

After installation, launch both backend and frontend:

run-all.bat

This starts:

Backend — FastAPI on http://localhost:8000
Frontend — Vite dev server on http://localhost:5173
OpenAI API — Available at http://localhost:8000/v1
Swagger Docs — Interactive at http://localhost:8000/api/docs

Configuration

Edit .env in the project root to customize settings:

Variable	Default	Description
`HOST`	0.0.0.0	Server bind address
`PORT`	8000	Server port
`HUGGINGFACE_TOKEN`	—	HuggingFace API token for private models
`NPU_STACK_API_KEY`	—	Optional API key for /v1 endpoints (empty = no auth)
`MODEL_STORAGE`	backend/data/models	Where models are stored
`LOG_LEVEL`	info	Logging level

Model Serving

NPU-STACK includes an OpenAI-compatible API server, acting as a local alternative to LM Studio. Any tool or SDK that works with the OpenAI API works with NPU-STACK.

Supported Endpoints

Method	Endpoint	Description
`GET`	/v1/models	List all available models
`POST`	/v1/chat/completions	Chat completion (streaming + non-streaming)
`POST`	/v1/completions	Text completion (legacy)
`POST`	/v1/embeddings	Generate text embeddings
`POST`	/v1/models/load	Load a model into memory
`POST`	/v1/models/unload	Unload a model from memory
`GET`	/v1/models/status	Status of loaded models

Usage

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="any")
response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Models are auto-loaded on first request. You can also pre-load models via the /v1/models/load endpoint or the Serving page in the UI.

Compatibility

Tested and compatible with: OpenAI Python/JS SDK, LangChain, LlamaIndex, Open WebUI, Chatbot UI, Vercel AI SDK, cURL, Postman.

Playground

The Playground provides an interactive UI for testing models without code:

Image Classification — Upload an image, get top-K predictions with confidence scores
Object Detection — YOLO/SSD-style bounding box detection
Text Generation — Prompt a language model and see streamed output
Image Generation — Text-to-image with Stable Diffusion (ONNX)

HuggingFace Hub

Browse, search, and download models directly from HuggingFace:

Search by query with task filtering (text-generation, image-classification, etc.)
View model details — downloads, likes, tags, file list
One-click download — auto-detects model format (ONNX, PyTorch, safetensors)
Downloaded models are automatically registered in the local model registry

Set HUGGINGFACE_TOKEN in .env to access gated/private models.

Training

Train models from scratch with built-in architectures:

Select architecture (ResNet, MobileNet, custom) and dataset
Configure hyperparameters: epochs, batch size, learning rate, optimizer
Real-time metrics via WebSocket — loss, accuracy, learning rate curves
Trained models are saved to the model registry

Fine-Tuning

Parameter-efficient fine-tuning using LoRA and QLoRA:

Select a base model from the registry + a dataset from the Dataset Manager
Configure LoRA parameters: rank (r), alpha, dropout, target modules
Background training with real-time step/epoch/loss tracking
Fine-tuned adapters are saved alongside the base model
Requires: pip install peft datasets

Conversion & Quantization

Convert models between formats and apply quantization:

PyTorch → ONNX — Export with configurable input shapes
ONNX → OpenVINO IR — Optimized for Intel hardware
INT8 / INT4 Quantization — Via NNCF for smaller, faster models
Converted models are registered with original as parent

Benchmark

Benchmark inference performance across hardware:

Measure latency (p50, p95, p99) and throughput (inferences/sec)
Compare CPU vs GPU vs NPU side-by-side
Compare quantization levels (FP32 vs INT8 vs INT4)
System info: detailed hardware detection with capabilities

Dataset Manager

Manage training and evaluation datasets:

Upload — Drag and drop files; zip archives are auto-extracted
Scan — Point to a local datasets/ folder and auto-detect contents
Auto-detect — Image datasets, CSVs, JSON/JSONL, Parquet, plain text
Delete — Remove local datasets from the UI

Hardware Support

NPU-STACK auto-detects and supports all major AI accelerators:

Hardware	Backend	Status
NVIDIA CUDA GPUs	PyTorch CUDA, ONNX Runtime CUDA	✅
AMD ROCm GPUs	PyTorch HIP, ONNX Runtime ROCm	✅
Intel NPU (Core Ultra)	OpenVINO NPU plugin	✅
Google Coral Edge TPU	TFLite Delegate	✅
DirectML (Windows)	ONNX Runtime DML Provider	✅
CPU (x86/ARM)	ONNX Runtime CPU, OpenVINO CPU	✅

Docker

Run with Docker Compose for a containerized deployment:

docker compose up --build

Services:

backend — FastAPI + all ML dependencies on port 8000
frontend — Nginx-served React build on port 3000

Contributing

We welcome contributions! Here's the workflow:

git clone https://github.com/chainchopper/NPU-STACK.git
cd NPU-STACK
git checkout dev
# make your changes
git push origin dev
# then open a Pull Request on GitHub

All PRs should target the dev branch. See the dev branch for the latest development code.