From model discovery to production deployment — all in one place
OpenAI-compatible API. Serve any model via /v1/chat/completions. Drop-in replacement for
OpenAI, works with LangChain, Open WebUI, and more.
Test models interactively — text generation, image classification, object detection, image synthesis all in browser.
Interactive 5-step onboarding wizard on first launch, plus per-tab contextual guides on Conversion Studio, GGUF Studio, and Fine-Tuning — with step-through tips, localStorage-persisted dismiss, and a floating re-open button.
Search, browse, one-click download models from HuggingFace directly into your model registry.
Convert PyTorch → ONNX, GGUF, or OpenVINO. Dedicated GGUF Studio with 5 tabs (Inspect, Quantize, HuggingFace to GGUF, LoRA Merge, Split). Apply INT8/INT4 NNCF and 21+ GGUF quantization formats.
Ultra-fast QLoRA fine-tuning powered by Unsloth. Support for custom datasets, real-time metrics, and direct HuggingFace Hub publishing.
Run latency and throughput benchmarks across CPU, GPU, and NPU. Compare quantization levels side-by-side.
Upload, organize, and auto-detect dataset types. Supports images, CSVs, JSON, Parquet, and zip archives.
Instantly deploy 20+ pre-optimized assets (ResNet, YOLO, Gemma) curated for RKNN, MediaPipe, LiteRT, and ONNX Runtime.
Use NPU-STACK as a drop-in replacement for OpenAI. Works with any SDK, framework, or tool.
List all available & loaded models
Chat completion with streaming SSE
Legacy text completion endpoint
Generate text embeddings
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="any" # Not required for local
)
response = client.chat.completions.create(
model="my-model",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
Deploy on any accelerator — auto-detected and ready to go
Full CUDA GPU support with multi-GPU enumeration.
✅ GPURDNA & CDNA architectures via ROCm/HIP.
✅ GPUIntel Core Ultra AI accelerators via OpenVINO.
✅ NPUEdge TPU support via TFLite delegates.
✅ TPUWindows GPU fallback via ONNX Runtime.
✅ DMLOptimized CPU inference. Always available.
✅ CPUClone, setup, run. It's that easy.
git clone https://github.com/chainchopper/NPU-STACK.git && cd NPU-STACK
setup.bat (Windows)./setup.sh (Linux/macOS)
Downloads portable Python, creates venv, installs all dependencies, generates .env
run-all.bat (Windows)./run-all.sh (Linux/macOS)
Backend (FastAPI :8000) + Frontend (Vite :5173) + OpenAI API (/v1)
NPU-STACK is free and open source. Help us build the future of edge AI.
Fork → checkout dev → make changes → submit
PR