Installation
Windows (Recommended)
The setup script handles everything automatically — Python, virtual environment, and all dependencies.
git clone https://github.com/chainchopper/NPU-STACK.git cd NPU-STACK setup.bat
This will:
- Download portable Python 3.11 if no compatible Python is found
- Create an isolated
.venvvirtual environment - Install all backend + frontend dependencies
- Generate a
.envconfiguration file - Create launcher scripts:
run-backend.bat,run-frontend.bat,run-all.bat
Manual Setup
pip install -r backend/requirements.txt cd frontend && npm install cp .env.example .env
Quick Start
After installation, launch both backend and frontend:
run-all.bat
This starts:
- Backend — FastAPI on
http://localhost:8000 - Frontend — Vite dev server on
http://localhost:5173 - OpenAI API — Available at
http://localhost:8000/v1 - Swagger Docs — Interactive at
http://localhost:8000/api/docs
Configuration
Edit .env in the project root to customize settings:
| Variable | Default | Description |
|---|---|---|
HOST |
0.0.0.0 | Server bind address |
PORT |
8000 | Server port |
HUGGINGFACE_TOKEN |
— | HuggingFace API token for private models |
NPU_STACK_API_KEY |
— | Optional API key for /v1 endpoints (empty = no auth) |
MODEL_STORAGE |
backend/data/models | Where models are stored |
LOG_LEVEL |
info | Logging level |
Model Serving
NPU-STACK includes an OpenAI-compatible API server, acting as a local alternative to LM Studio. Any tool or SDK that works with the OpenAI API works with NPU-STACK.
Supported Endpoints
| Method | Endpoint | Description |
|---|---|---|
GET |
/v1/models | List all available models |
POST |
/v1/chat/completions | Chat completion (streaming + non-streaming) |
POST |
/v1/completions | Text completion (legacy) |
POST |
/v1/embeddings | Generate text embeddings |
POST |
/v1/models/load | Load a model into memory |
POST |
/v1/models/unload | Unload a model from memory |
GET |
/v1/models/status | Status of loaded models |
Usage
from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="any") response = client.chat.completions.create( model="my-model", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content)
Models are auto-loaded on first request. You can also pre-load models via the
/v1/models/load endpoint or the Serving page in the UI.
Compatibility
Tested and compatible with: OpenAI Python/JS SDK, LangChain, LlamaIndex, Open WebUI, Chatbot UI, Vercel AI SDK, cURL, Postman.
Playground
The Playground provides an interactive UI for testing models without code:
- Image Classification — Upload an image, get top-K predictions with confidence scores
- Object Detection — YOLO/SSD-style bounding box detection
- Text Generation — Prompt a language model and see streamed output
- Image Generation — Text-to-image with Stable Diffusion (ONNX)
HuggingFace Hub
Browse, search, and download models directly from HuggingFace:
- Search by query with task filtering (text-generation, image-classification, etc.)
- View model details — downloads, likes, tags, file list
- One-click download — auto-detects model format (ONNX, PyTorch, safetensors)
- Downloaded models are automatically registered in the local model registry
Set HUGGINGFACE_TOKEN in .env to access gated/private models.
Training
Train models from scratch with built-in architectures:
- Select architecture (ResNet, MobileNet, custom) and dataset
- Configure hyperparameters: epochs, batch size, learning rate, optimizer
- Real-time metrics via WebSocket — loss, accuracy, learning rate curves
- Trained models are saved to the model registry
Fine-Tuning
Parameter-efficient fine-tuning using LoRA and QLoRA:
- Select a base model from the registry + a dataset from the Dataset Manager
- Configure LoRA parameters: rank (r), alpha, dropout, target modules
- Background training with real-time step/epoch/loss tracking
- Fine-tuned adapters are saved alongside the base model
- Requires:
pip install peft datasets
Conversion & Quantization
Convert models between formats and apply quantization:
- PyTorch → ONNX — Export with configurable input shapes
- ONNX → OpenVINO IR — Optimized for Intel hardware
- INT8 / INT4 Quantization — Via NNCF for smaller, faster models
- Converted models are registered with original as parent
Benchmark
Benchmark inference performance across hardware:
- Measure latency (p50, p95, p99) and throughput (inferences/sec)
- Compare CPU vs GPU vs NPU side-by-side
- Compare quantization levels (FP32 vs INT8 vs INT4)
- System info: detailed hardware detection with capabilities
Dataset Manager
Manage training and evaluation datasets:
- Upload — Drag and drop files; zip archives are auto-extracted
- Scan — Point to a local
datasets/folder and auto-detect contents - Auto-detect — Image datasets, CSVs, JSON/JSONL, Parquet, plain text
- Delete — Remove local datasets from the UI
Hardware Support
NPU-STACK auto-detects and supports all major AI accelerators:
| Hardware | Backend | Status |
|---|---|---|
| NVIDIA CUDA GPUs | PyTorch CUDA, ONNX Runtime CUDA | ✅ |
| AMD ROCm GPUs | PyTorch HIP, ONNX Runtime ROCm | ✅ |
| Intel NPU (Core Ultra) | OpenVINO NPU plugin | ✅ |
| Google Coral Edge TPU | TFLite Delegate | ✅ |
| DirectML (Windows) | ONNX Runtime DML Provider | ✅ |
| CPU (x86/ARM) | ONNX Runtime CPU, OpenVINO CPU | ✅ |
Docker
Run with Docker Compose for a containerized deployment:
docker compose up --build
Services:
- backend — FastAPI + all ML dependencies on port 8000
- frontend — Nginx-served React build on port 3000
Contributing
We welcome contributions! Here's the workflow:
git clone https://github.com/chainchopper/NPU-STACK.git cd NPU-STACK git checkout dev # make your changes git push origin dev # then open a Pull Request on GitHub
All PRs should target the dev branch. See the
dev
branch for the latest development code.