Hardware guide

Choose the hardware lane that matches the workload before you buy a local problem.

Inference Hardware Guide is the decision layer for local and edge inference. It answers when a laptop NPU is enough, when a single GPU box makes sense, when embeddings do not need an expensive GPU and when API-first is still the saner move.

ℹ️

Guide boundary

This page uses a 2026-03 local snapshot. It is a hardware decision guide, not a live SKU leaderboard. Start here when the question is whether local inference deserves to exist at all and only then move into the exact box class.

Decision lanes

6

API-first to shared private serving

Quiet local lanes

2

Desk-friendly or mobile options

Retrieval-first lane

1

Embeddings before giant GPU spend

Shared private API lane

1

Only when usage is already stable

How to read the guide

  1. 1

    Scope

    Decide whether local inference should exist at all

    The first question is not GPU brand. It is whether privacy, latency or daily volume justify a local lane.

  2. 2

    Fit

    Match the box to the dominant workload

    LLM chat, multimodal analysis and embeddings do not need the same hardware shape.

  3. 3

    Timing

    Buy only after the usage pattern is stable enough

    Premature private serving creates a local ops problem faster than it creates product value.

GPU threshold

A local GPU earns its keep only after demand becomes daily

If the team still experiments with prompts and sees bursty traffic, API-first usually beats premature capex.

Retrieval reality

Embeddings and rerank often need RAM discipline before they need a huge GPU

Indexing, rerank and retrieval stacks can justify a CPU-plus-RAM node long before they justify a private generation server.

Multimodal cost

Serious multimodal work punishes memory ceilings before it punishes raw hype metrics

Video, vision and long document flows usually fail on memory headroom, bandwidth and sustained serving, not on brand slogans.

Budget band

Under $1.5k

Stay API-first unless travel privacy, offline demos or retrieval indexing force a local lane. NPU laptops and RAM-heavy retrieval boxes belong here.

Budget band

$2k-$4k

This is the real prototyping band. Choose either a 16-24GB single-GPU workstation or a high-memory unified workstation, not both at once.

Budget band

$6k+

Buy a private 48-80GB serving node only when several internal flows already need stable private inference every week.

VRAM

It decides model class and concurrency far earlier than marketing implies.

RAM

It matters for long documents, retrieval batches and unified-memory workstations.

Bandwidth

Memory bandwidth decides whether long context and multimodal feel usable.

Power and noise

A box that nobody wants near the desk stops being a practical tool fast.

Ops cost

Drivers, queues, rollouts and monitoring kill naive ROI calculations.

Decision table

Hardware profiles by workload, bottleneck and budget

Operational snapshot
Profile Best for LLM / multimodal / embeddings Bottlenecks and ops Choose local vs stay API
API-first without a local box

Managed APIs plus a normal dev workstation

Low capex / variable opex No meaningful local power or noise footprint.
Frontier reasoning, heavy multimodal, bursty traffic and teams with no ops bench.

LLM: Best default when value sits in frontier models rather than hosting control.

Multimodal: Very strong for audio, video and heavy vision without buying local memory first.

Embeddings: Good enough for pilots or moderate indexing through managed services.

Variable cost, data residence and provider dependence.

Minimal local ops, but spend can spike when volume grows without discipline.

Choose it when you are still learning the traffic shape or need to ship fast.

Stay API when: Still better than local when demand is bursty or depends on frontier multimodal.

Caution: Do not confuse zero capex with low cost once the team runs daily at real volume.

NPU laptop

Personal edge and on-device privacy

Low-mid Excellent on power and noise.
Private copilots, local notes, offline demos and lightweight mobile helpers.

LLM: Works for small quantized LLMs, simple routers and local tasks with little parallelism.

Multimodal: Acceptable only for light vision or audio; not for serious video or large batches.

Embeddings: Good for small embeddings and light rerank directly on the device.

Shared memory, bandwidth and thermal throttling.

Very simple to operate, but with almost no headroom for concurrency or shared serving.

Choose it when mobility, privacy and a reasonable single-user local UX matter most.

Stay API when: API wins when the flow needs long context, coding agents or several sessions at once.

Caution: Do not treat a laptop NPU as a universal replacement for a discrete GPU.

CPU + RAM retrieval node

Embeddings, rerank and internal indexing

Low-mid Moderate power draw and low noise.
Retrieval pipelines, large re-indexing jobs and internal services where RAM matters most.

LLM: Weak for heavy generation; not worth using as the main chat box.

Multimodal: Very limited for generative vision or serious multimodal analysis.

Embeddings: Excellent when RAM, disk and indexing batches matter more than generation.

Poor tokens-per-second for generation; total latency falls apart quickly in real chat.

Cheap to run and easier to maintain than a large GPU rig.

Choose it when retrieval is the important product layer, not long-form local reasoning.

Stay API when: API is better when you also need serious chat, multimodal work or high-quality generation.

Caution: Do not buy this lane expecting it to also solve your main serving problem.

16-24GB single-GPU workstation

Daily local LLM work and serious prototyping

Mid Noticeable desktop power draw and noise.
Coding copilots, quantized 7B-14B models, local workers and a small team.

LLM: It is the sweet spot once API bills start to hurt and privacy already matters.

Multimodal: Fine for light vision and documents; short on heavy multimodal or video.

Embeddings: Very good for embeddings, rerank and cheap worker lanes.

VRAM and concurrency; the ceiling arrives earlier than the marketing suggests.

Starts to require cooling, drivers, queues and some observability.

Choose it when daily local inference demand already exists and the use case is not just experimental.

Stay API when: API is better if usage is still sporadic or you require frontier multimodal.

Caution: Do not buy this lane expecting comfortable 70B use or several concurrent teams without queues.

Unified-memory workstation

Single-operator multimodal and a quiet desk

Mid-high Very strong desk-level power and noise profile.
Long documents, moderate multimodal analysis and local work where silence matters.

LLM: Very useful for larger quantized models when total memory and desk stability matter more than raw serving speed.

Multimodal: Better than a mid GPU for some single-operator multimodal flows.

Embeddings: Good for retrieval, medium batches and local document analysis.

Lower sustained throughput than a GPU server; bandwidth plus price-per-capacity becomes the limit.

Clean single-user operations; a bad idea as the shared API for an entire team.

Choose it when a quiet desk and lots of memory matter more than concurrency.

Stay API when: API wins when usage becomes shared, continuous or depends on hard frontier multimodal.

Caution: Do not accidentally turn a single-operator workstation into the team backend.

48-80GB private inference node

Shared internal API and serious private serving

High High power draw and serious noise.
Internal API, shared agents, serious multimodal and larger local models.

LLM: The best local lane once several flows depend on private generation every day.

Multimodal: The only reasonable lane if you truly want serious multimodal inside a private perimeter.

Embeddings: Excellent for retrieval, rerank and generation inside the same perimeter.

Capex, cooling, queues, networking and operational discipline.

High complexity: monitoring, rollouts, security and continuity all matter now.

Only pays off with stable demand, compliance pressure or a tightly watched spend ceiling.

Stay API when: API is still better when volume remains uncertain or nobody can operate the box.

Caution: If the team is still discovering prompts and use cases, this lane arrives too early.

API-first without a local box

Managed APIs plus a normal dev workstation

Low capex / variable opex

Best for: Frontier reasoning, heavy multimodal, bursty traffic and teams with no ops bench.

LLM: Best default when value sits in frontier models rather than hosting control.

Multimodal: Very strong for audio, video and heavy vision without buying local memory first.

Embeddings: Good enough for pilots or moderate indexing through managed services.

Bottleneck: Variable cost, data residence and provider dependence.

Ops cost: Minimal local ops, but spend can spike when volume grows without discipline.

Power and noise: No meaningful local power or noise footprint.

Choose local when: Choose it when you are still learning the traffic shape or need to ship fast.

Stay API when: Still better than local when demand is bursty or depends on frontier multimodal.

Caution: Do not confuse zero capex with low cost once the team runs daily at real volume.

NPU laptop

Personal edge and on-device privacy

Low-mid

Best for: Private copilots, local notes, offline demos and lightweight mobile helpers.

LLM: Works for small quantized LLMs, simple routers and local tasks with little parallelism.

Multimodal: Acceptable only for light vision or audio; not for serious video or large batches.

Embeddings: Good for small embeddings and light rerank directly on the device.

Bottleneck: Shared memory, bandwidth and thermal throttling.

Ops cost: Very simple to operate, but with almost no headroom for concurrency or shared serving.

Power and noise: Excellent on power and noise.

Choose local when: Choose it when mobility, privacy and a reasonable single-user local UX matter most.

Stay API when: API wins when the flow needs long context, coding agents or several sessions at once.

Caution: Do not treat a laptop NPU as a universal replacement for a discrete GPU.

CPU + RAM retrieval node

Embeddings, rerank and internal indexing

Low-mid

Best for: Retrieval pipelines, large re-indexing jobs and internal services where RAM matters most.

LLM: Weak for heavy generation; not worth using as the main chat box.

Multimodal: Very limited for generative vision or serious multimodal analysis.

Embeddings: Excellent when RAM, disk and indexing batches matter more than generation.

Bottleneck: Poor tokens-per-second for generation; total latency falls apart quickly in real chat.

Ops cost: Cheap to run and easier to maintain than a large GPU rig.

Power and noise: Moderate power draw and low noise.

Choose local when: Choose it when retrieval is the important product layer, not long-form local reasoning.

Stay API when: API is better when you also need serious chat, multimodal work or high-quality generation.

Caution: Do not buy this lane expecting it to also solve your main serving problem.

16-24GB single-GPU workstation

Daily local LLM work and serious prototyping

Mid

Best for: Coding copilots, quantized 7B-14B models, local workers and a small team.

LLM: It is the sweet spot once API bills start to hurt and privacy already matters.

Multimodal: Fine for light vision and documents; short on heavy multimodal or video.

Embeddings: Very good for embeddings, rerank and cheap worker lanes.

Bottleneck: VRAM and concurrency; the ceiling arrives earlier than the marketing suggests.

Ops cost: Starts to require cooling, drivers, queues and some observability.

Power and noise: Noticeable desktop power draw and noise.

Choose local when: Choose it when daily local inference demand already exists and the use case is not just experimental.

Stay API when: API is better if usage is still sporadic or you require frontier multimodal.

Caution: Do not buy this lane expecting comfortable 70B use or several concurrent teams without queues.

Unified-memory workstation

Single-operator multimodal and a quiet desk

Mid-high

Best for: Long documents, moderate multimodal analysis and local work where silence matters.

LLM: Very useful for larger quantized models when total memory and desk stability matter more than raw serving speed.

Multimodal: Better than a mid GPU for some single-operator multimodal flows.

Embeddings: Good for retrieval, medium batches and local document analysis.

Bottleneck: Lower sustained throughput than a GPU server; bandwidth plus price-per-capacity becomes the limit.

Ops cost: Clean single-user operations; a bad idea as the shared API for an entire team.

Power and noise: Very strong desk-level power and noise profile.

Choose local when: Choose it when a quiet desk and lots of memory matter more than concurrency.

Stay API when: API wins when usage becomes shared, continuous or depends on hard frontier multimodal.

Caution: Do not accidentally turn a single-operator workstation into the team backend.

48-80GB private inference node

Shared internal API and serious private serving

High

Best for: Internal API, shared agents, serious multimodal and larger local models.

LLM: The best local lane once several flows depend on private generation every day.

Multimodal: The only reasonable lane if you truly want serious multimodal inside a private perimeter.

Embeddings: Excellent for retrieval, rerank and generation inside the same perimeter.

Bottleneck: Capex, cooling, queues, networking and operational discipline.

Ops cost: High complexity: monitoring, rollouts, security and continuity all matter now.

Power and noise: High power draw and serious noise.

Choose local when: Only pays off with stable demand, compliance pressure or a tightly watched spend ceiling.

Stay API when: API is still better when volume remains uncertain or nobody can operate the box.

Caution: If the team is still discovering prompts and use cases, this lane arrives too early.

Hardware route

Start from the lighter hardware route if you still need orientation before budget decisions.

LLM route

Choose the model lane before buying a box for the wrong workload.

Agent stack board

Use it when hardware is being justified by agents, browser workers or orchestration.

Workflow recipes

Move into operating recipes once the serving posture is narrow enough.