Use this route to orient hardware decisions before you turn them into capex.
Most teams do not need a rack on day one. They need a fast read on privacy, daily demand and the bottleneck that will actually hurt first.
Boundary
Reference lanes
API-first to private node
Quick read
budget, bottleneck, privacy
VRAM bands
NPU / 16-24GB / 48-80GB
Shared serving
Only when demand is stable
Start from privacy, workload shape and daily usage.
Orientation tier
Stay API-first or use a very light local lane while the traffic shape is still uncertain.
Workstation tier
Move to a GPU or high-memory desk only when local inference is already a daily habit.
Serving tier
Shared private nodes only make sense after demand, privacy and support needs are already proven.
- Teams overbuy GPU before they understand whether the important layer is retrieval, chat or multimodal.
- NPU laptops get sold as universal answers when they are only good for the lightest local lanes.
- Ops cost gets ignored until the box turns into an internal API that nobody owns.
Hardware route snapshot
| Profile | Best use | Budget band | Local fit | Watch-out |
|---|---|---|---|---|
| API-first | Frontier reasoning and bursty demand | Low capex | Best before buying local | Variable spend and provider lock-in |
| NPU laptop | Mobile privacy and lightweight local help | Low-mid | Tiny local models | Bandwidth and thermal limits |
| CPU + RAM retrieval node | Embeddings and rerank | Low-mid | Retrieval-heavy stacks | Poor generation throughput |
| 16-24GB workstation | Daily local prototyping | Mid | 7B-14B class | VRAM ceiling and desk noise |
| 48-80GB private node | Shared internal API | High | Serious internal serving | Ops overhead |
Inference hardware guide
The practical decision layer for API-first, quiet desks, retrieval boxes and private serving nodes.
Agent stack board
Use it when hardware demand is being justified by agents, browser workers or orchestration.
Workflow recipes
Move into repeatable operating flows once the serving posture is already narrowed.