NPU-STACK — Neural Processor Toolkit for Edge AI

Everything You Need

From model discovery to production deployment — all in one place

🖥️

Model Serving

OpenAI-compatible API. Serve any model via /v1/chat/completions. Drop-in replacement for OpenAI, works with LangChain, Open WebUI, and more.

🧪

Playground

Test models interactively — text generation, image classification, object detection, image synthesis all in browser.

🪄

Contextual Wizards

Interactive 5-step onboarding wizard on first launch, plus per-tab contextual guides on Conversion Studio, GGUF Studio, and Fine-Tuning — with step-through tips, localStorage-persisted dismiss, and a floating re-open button.

🤗

HuggingFace Hub

Search, browse, one-click download models from HuggingFace directly into your model registry.

🔄

Convert & Quantize

Convert PyTorch → ONNX, GGUF, or OpenVINO. Dedicated GGUF Studio with 5 tabs (Inspect, Quantize, HuggingFace to GGUF, LoRA Merge, Split). Apply INT8/INT4 NNCF and 21+ GGUF quantization formats.

🏋️

Train & Fine-Tune

Ultra-fast QLoRA fine-tuning powered by Unsloth. Support for custom datasets, real-time metrics, and direct HuggingFace Hub publishing.

📊

Benchmark

Run latency and throughput benchmarks across CPU, GPU, and NPU. Compare quantization levels side-by-side.

📁

Dataset Manager

Upload, organize, and auto-detect dataset types. Supports images, CSVs, JSON, Parquet, and zip archives.

📦

Model Catalog

Instantly deploy 20+ pre-optimized assets (ResNet, YOLO, Gemma) curated for RKNN, MediaPipe, LiteRT, and ONNX Runtime.

OpenAI-Compatible Model Serving

Use NPU-STACK as a drop-in replacement for OpenAI. Works with any SDK, framework, or tool.

GET

/v1/models

List all available & loaded models

POST

/v1/chat/completions

Chat completion with streaming SSE

POST

/v1/completions

Legacy text completion endpoint

POST

/v1/embeddings

Generate text embeddings

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any"  # Not required for local
)

response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

const response = await fetch("http://localhost:8000/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "my-model",
    messages: [{ role: "user", content: "Hello!" }]
  })
});

const data = await response.json();
console.log(data.choices[0].message.content);

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

OpenAI SDK✅

LangChain✅

Open WebUI✅

LlamaIndex✅

Chatbot UI✅

Vercel AI SDK✅

Universal Hardware Support

Deploy on any accelerator — auto-detected and ready to go

NVIDIA CUDA

Full CUDA GPU support with multi-GPU enumeration.

✅ GPU

AMD ROCm

RDNA & CDNA architectures via ROCm/HIP.

✅ GPU

Intel NPU

Intel Core Ultra AI accelerators via OpenVINO.

✅ NPU

Google Coral

Edge TPU support via TFLite delegates.

✅ TPU

DirectML

Windows GPU fallback via ONNX Runtime.

✅ DML

CPU / OpenVINO

Optimized CPU inference. Always available.

✅ CPU

Get Started in 3 Steps

Clone, setup, run. It's that easy.

1

Clone the repo

git clone https://github.com/chainchopper/NPU-STACK.git && cd NPU-STACK

2

Run setup

setup.bat (Windows)
./setup.sh (Linux/macOS)

Downloads portable Python, creates venv, installs all dependencies, generates .env

3

Launch

run-all.bat (Windows)
./run-all.sh (Linux/macOS)

Backend (FastAPI :8000) + Frontend (Vite :5173) + OpenAI API (/v1)

Contribute & Support

NPU-STACK is free and open source. Help us build the future of edge AI.

🛠️ Contribute (dev branch)

Fork → checkout dev → make changes → submit PR

Deploy AI ModelsOn Any Hardware

Everything You Need

Model Serving

Playground

Contextual Wizards

HuggingFace Hub

Convert & Quantize

Train & Fine-Tune

Benchmark

Dataset Manager

Model Catalog

OpenAI-Compatible Model Serving

Universal Hardware Support

NVIDIA CUDA

AMD ROCm

Intel NPU

Google Coral

DirectML

CPU / OpenVINO

Get Started in 3 Steps

Clone the repo

Run setup

Launch

Contribute & Support

Deploy AI Models
On Any Hardware