GPU memory, not FLOPs, is the hard ceiling on how large an LLM you can load and how long a context you can serve. BF16 models use 16 bits per weight; that doubles the footprint relative to INT8 quants but preserves training‑time fidelity. DFloat 11 (DF11) compresses BF16 losslessly to ≈11 bits by Huffman‑coding the sparse exponent field.
Result: ~30 % smaller footprints at runtime, 100 % identical outputs.
KV‑cache wins. Because DF11 compresses activations too, every token’s KV entries shrink by 30 %. On long‑context workloads (chat history, RAG, ERP docs) that translates to +43 % context length before evicting tokens.
# one‑liner on a Buzz A6000 48 GB instance
pip install dfloat11 vllm==0.4.1
HF_MODEL="leanmodels/df11_llama-3.1-8b-it"
python -m vllm.entrypoints.openai.api_server \
--model $HF_MODEL \
--dtype df11 \
--gpu-memory-utilization 0.9
dfloat11.compress.py your_model_dir
produces a DF11 → .pt
compatible with vLLM & TensorRT‑LLM.Running Llama‑3‑70B‑Instruct on a single H100‑80 GB with DF11 versus dual H100s with BF16:
Annualised, that’s >$47k savings per replica before power rebates.
In 1865, economist William Stanley Jevons observed that technical efficiencies tend to increase overall consumption of a resource—because lower cost unlocks new use‑cases. Buzz already sees this dynamic with DF11 pilots:
Takeaway: DF11 slashes unit cost, but aggregate demand will likely outpace the savings. Buzz’s upcoming datacenter expansion—plus fresh H100/H200 inventory—ensures capacity keeps pace with the Jevons‑curve uptick.
Design for scale: treat your DF11 migration as step 1. Step 2 is autoscaling policy, placement groups, and inter‑GPU speed (NVLink vs PCIe) so you can ride the demand wave without bottlenecks.