Open Source · Edge AI Toolkit · OpenAI-Compatible API

Deploy AI Models
On Any Hardware

Train, fine-tune, convert, quantize, serve, and benchmark models for NPU, TPU, GPU, and CPU. One toolkit — every accelerator. OpenAI-compatible API included.

13+Accelerators
10+AI Frameworks
100%Open Source

Everything You Need

From model discovery to production deployment — all in one place

🖥️

Model Serving

OpenAI-compatible API. Serve any model via /v1/chat/completions. Drop-in replacement for OpenAI, works with LangChain, Open WebUI, and more.

🧪

Playground

Test models interactively — text generation, image classification, object detection, image synthesis all in browser.

🪄

Contextual Wizards

Interactive 5-step onboarding wizard on first launch, plus per-tab contextual guides on Conversion Studio, GGUF Studio, and Fine-Tuning — with step-through tips, localStorage-persisted dismiss, and a floating re-open button.

🤗

HuggingFace Hub

Search, browse, one-click download models from HuggingFace directly into your model registry.

🔄

Convert & Quantize

Convert PyTorch → ONNX, GGUF, or OpenVINO. Dedicated GGUF Studio with 5 tabs (Inspect, Quantize, HuggingFace to GGUF, LoRA Merge, Split). Apply INT8/INT4 NNCF and 21+ GGUF quantization formats.

🏋️

Train & Fine-Tune

Ultra-fast QLoRA fine-tuning powered by Unsloth. Support for custom datasets, real-time metrics, and direct HuggingFace Hub publishing.

📊

Benchmark

Run latency and throughput benchmarks across CPU, GPU, and NPU. Compare quantization levels side-by-side.

📁

Dataset Manager

Upload, organize, and auto-detect dataset types. Supports images, CSVs, JSON, Parquet, and zip archives.

📦

Model Catalog

Instantly deploy 20+ pre-optimized assets (ResNet, YOLO, Gemma) curated for RKNN, MediaPipe, LiteRT, and ONNX Runtime.

OpenAI-Compatible Model Serving

Use NPU-STACK as a drop-in replacement for OpenAI. Works with any SDK, framework, or tool.

GET
/v1/models

List all available & loaded models

POST
/v1/chat/completions

Chat completion with streaming SSE

POST
/v1/completions

Legacy text completion endpoint

POST
/v1/embeddings

Generate text embeddings

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any"  # Not required for local
)

response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")
OpenAI SDK
LangChain
Open WebUI
LlamaIndex
Chatbot UI
Vercel AI SDK

Universal Hardware Support

Deploy on any accelerator — auto-detected and ready to go

NVIDIA CUDA

Full CUDA GPU support with multi-GPU enumeration.

✅ GPU

AMD ROCm

RDNA & CDNA architectures via ROCm/HIP.

✅ GPU

Intel NPU

Intel Core Ultra AI accelerators via OpenVINO.

✅ NPU

Google Coral

Edge TPU support via TFLite delegates.

✅ TPU

DirectML

Windows GPU fallback via ONNX Runtime.

✅ DML

CPU / OpenVINO

Optimized CPU inference. Always available.

✅ CPU

Get Started in 3 Steps

Clone, setup, run. It's that easy.

1

Clone the repo

git clone https://github.com/chainchopper/NPU-STACK.git && cd NPU-STACK
2

Run setup

setup.bat (Windows)
./setup.sh (Linux/macOS)

Downloads portable Python, creates venv, installs all dependencies, generates .env

3

Launch

run-all.bat (Windows)
./run-all.sh (Linux/macOS)

Backend (FastAPI :8000) + Frontend (Vite :5173) + OpenAI API (/v1)

Contribute & Support

NPU-STACK is free and open source. Help us build the future of edge AI.

Fork → checkout dev → make changes → submit PR