Open Source · Edge AI Toolkit · OpenAI-Compatible API

Deploy AI Models
On Any Hardware

Train, fine-tune, convert, quantize, serve, and benchmark models for NPU, TPU, GPU, and CPU. One toolkit — every accelerator. OpenAI-compatible API included.

8+Accelerators
5AI Frameworks
100%Open Source

Everything You Need

From model discovery to production deployment — all in one place

🖥️

Model Serving

OpenAI-compatible API. Serve any model via /v1/chat/completions. Drop-in replacement for OpenAI, works with LangChain, Open WebUI, and more.

🧪

Playground

Test models interactively — image classification, object detection, text generation, and image synthesis all in browser.

🤗

HuggingFace Hub

Search, browse, one-click download models from HuggingFace directly into your model registry.

🔄

Convert & Quantize

Convert PyTorch → ONNX → OpenVINO IR. Apply INT8/INT4 quantization with NNCF for hardware deployment.

🏋️

Train & Fine-Tune

Full training with WebSocket metrics, plus LoRA/QLoRA fine-tuning. Custom datasets, hyperparameters, real-time loss tracking.

📊

Benchmark

Run latency and throughput benchmarks across CPU, GPU, and NPU. Compare quantization levels side-by-side.

📁

Dataset Manager

Upload, organize, and auto-detect dataset types. Supports images, CSVs, JSON, Parquet, and zip archives.

📱

Cross-Platform

Web app, Electron desktop (coming soon), Android & iOS (planned). One backend, every device.

OpenAI-Compatible Model Serving

Use NPU-STACK as a drop-in replacement for OpenAI. Works with any SDK, framework, or tool.

GET
/v1/models

List all available & loaded models

POST
/v1/chat/completions

Chat completion with streaming SSE

POST
/v1/completions

Legacy text completion endpoint

POST
/v1/embeddings

Generate text embeddings

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="any"  # Not required for local
)

response = client.chat.completions.create(
    model="my-model",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")
OpenAI SDK
LangChain
Open WebUI
LlamaIndex
Chatbot UI
Vercel AI SDK

Universal Hardware Support

Deploy on any accelerator — auto-detected and ready to go

NVIDIA CUDA

Full CUDA GPU support with multi-GPU enumeration.

✅ GPU

AMD ROCm

RDNA & CDNA architectures via ROCm/HIP.

✅ GPU

Intel NPU

Intel Core Ultra AI accelerators via OpenVINO.

✅ NPU

Google Coral

Edge TPU support via TFLite delegates.

✅ TPU

DirectML

Windows GPU fallback via ONNX Runtime.

✅ DML

CPU / OpenVINO

Optimized CPU inference. Always available.

✅ CPU

Get Started in 3 Steps

Clone, setup, run. It's that easy.

1

Clone the repo

git clone https://github.com/chainchopper/NPU-STACK.git && cd NPU-STACK
2

Run setup

setup.bat

Downloads portable Python, creates venv, installs all dependencies, generates .env

3

Launch

run-all.bat

Backend (FastAPI :8000) + Frontend (Vite :5173) + OpenAI API (/v1)

Contribute & Support

NPU-STACK is free and open source. Help us build the future of edge AI.

Fork → checkout dev → make changes → submit PR