GPU Concurrency Benchmark — vLLM Serving Box (RTX PRO 6000 Blackwell)
Summary
Measured the concurrent-request ceiling of a single RTX PRO 6000 Blackwell Server Edition GPU serving Llama 3.3 70B Q4_K_M GGUF under vLLM 0.21.0, to replace the ~25-user-per-box estimate in the sovereign-compute architecture with a measured number. Result: as-deployed ceiling is ~5 in-flight requests, falsifying the architecture-doc hypothesis 5×; the gap is attributable to software configuration (GGUF format, fp16 KV cache, FlashAttention v2, native PyTorch sampler) rather than hardware. Derived per-user GPU cost ~€16/month validates the €29 subscription placeholder within ~€2.
Claims
- —On a single RTX PRO 6000 Blackwell Server Edition (96GB) running Llama 3.3 70B Q4_K_M GGUF under vLLM 0.21.0 with FlashAttention v2 and fp16 KV cache, the median time-to-first-token crosses 5 seconds between 5 and 10 concurrent chat-sized requests, placing the pricing-relevant ceiling at ~5 in-flight requests.
- —The original architecture hypothesis of a ~25 concurrent-request ceiling is falsified for the as-deployed-tonight stack by ~5×, but the gap is attributable to software configuration (GGUF instead of AWQ, fp16 instead of fp8 KV cache, FlashAttention v2 instead of v3, native sampler instead of flashinfer) rather than the hardware.
- —Long-context prompts (~3.5k tokens in) are inherently above the 5-second TTFT threshold at C=1 due to prompt evaluation cost (20.6s measured), so long-context UX requires either smaller models, prefix caching, or a higher-throughput pipeline — not a higher concurrent-request count.
- —Derived per-user GPU cost at the measured ceiling and a 0.7% duty cycle assumption is ~€16/month, which when added to the €15 base puts the subscription at ~€31/month and validates the €29 placeholder within ~€2.
Assumptions
- —The synthetic random-token prompts emitted by `vllm bench serve` are an acceptable proxy for real chat traffic at the resolution needed to size a single GPU; real traffic with persona prompts and conversation history may have better KV-cache reuse and slightly higher throughput.
- —The 0.7% duty cycle used to derive users-per-box is a reasoned estimate for sovereign-brain engagement (chatbot floor is 0.3%); it has not been measured and is treated as a ±50% knob until Phase-1 cohort telemetry exists.
- —The RTX PRO 6000 Blackwell Server Edition is a sufficiently close proxy to the production Hetzner GEX131 (Max-Q Workstation variant) for sizing decisions; a ~10-15% downward adjustment is applied to account for the Max-Q power envelope.
- —`vllm bench serve` with `--request-rate inf` (worst-case burst) yields a lower-bound estimate of real-world ceiling; production traffic arrives stochastically and the true steady-state ceiling is somewhat higher.
Context
Domain context: brainfoundry, sovereign-compute
Reproducibility
Deterministic seed: no
Replication status: none
Structural Metrics
Rigor Score 0 / 8structural transparency index
Tier T0 compliance 1 / 1(100% of declared tier requirements met)
✓ Claims documented (at least one)
▶▼Computed classification recommendationmismatch
| Dimension | Declared | Recommended | Reasons |
|---|---|---|---|
| Zone | Evidence | Hypothesis | • Claims present but no deterministic seed or replication • Artifact appears to be a conceptual or theoretical claim without computational backing |
| Tier | T0 | T0 ✓ | • No deterministic seed — results cannot be reproduced deterministically • T1 requires a fixed random seed |
| AI Level | A0 | A0 ✓ | • No AI model disclosed — assuming no AI used (A0) |
Recommendations are heuristic — based on reproducibility fields and object type.