Benchmark pages should help you choose a stack, not just admire charts.
Every comparison in this section is framed around operational decisions: local vs API, risk vs speed and whether the workflow still holds under real team usage.
Benchmarks
1 ↑
Published
Methodology
Fixed •
Latency, cost, stability
Update rhythm
Weekly ↑
On active stacks
Bias control
Human ↑
Editorial verification
Benchmark rubric
| Layer | Metric | Why it matters |
|---|---|---|
| Latency | TTFT and p95 | Determines whether an AI workflow can feel operationally useful |
| Cost | Per run and per million tokens | Prevents demo economics from leaking into production |
| Stability | Error rate and retry pressure | Shows what breaks when real load arrives |
| Governance | Privacy and routing | Defines if the stack is viable for sensitive work |
Move from the scorecard into the next decision desk
Benchmarks should stay connected to the directory, prompt library and archive so the decision does not stop at the chart.
Benchmark: LLM local vs API en latencia real
Comparativa tecnica de latencia, coste y estabilidad entre inferencia local y API en equipos de producto.