Landscape Report — Inference Engines

AI Inference Engines & Frameworks: The Technology Layer Powering the $126B Market

vLLM • SGLang • TensorRT-LLM • FireAttention • FlashAttention • llama.cpp — Performance, Adoption & Strategic Moats

Feb 2026 MinjAI Agents 75+ Sources 13 Sections
Internal — Strategic Intelligence
Section 01

Why Engines Matter

15+
Major Inference Engines
3
Dominant Cloud Engines
2
Spinouts Jan 2026
$126B
Inference Market TAM 2030
Why This Layer Decides Winners

Models are commoditizing. Hardware is standardized around NVIDIA. The inference engine is the layer where cost advantages, latency guarantees, and throughput differentiation are actually created. A 2x engine optimization translates directly to either 2x margin improvement or 50% price reduction. Every platform profiled in our managed inference landscape report derives its competitive position from engine-layer decisions.

The inference engine sits between the model weights and the GPU hardware. It determines how tokens are batched, how memory is allocated, how attention is computed, and how multiple requests share finite GPU capacity.[1] Two providers running the same Llama 3.3 70B model on identical H200 GPUs can deliver throughput that differs by 3–5x depending on engine choice and optimization depth.[2]

Three dynamics define the February 2026 landscape: open-source engine commercialization (two $400M–$800M spinouts in January[3]), NVIDIA's vertical stack consolidation from silicon to orchestration[4], and disaggregated prefill-decode serving becoming the default architecture.[5] This report maps 15+ engines across 12 dimensions to inform MARA's engine-layer strategy for Project Sapien.

Section 02

Executive Summary

The production inference engine market has consolidated around three open-source projects and a handful of proprietary alternatives. Each engine makes different tradeoffs between performance, flexibility, and ecosystem lock-in.

Engine Maintainer GitHub Stars Key Innovation Primary Users License
vLLM Inferact (UC Berkeley) 70.8K[6] PagedAttention, continuous batching Modal, RunPod, Anyscale, BentoML Apache 2.0
SGLang RadixArk (UC Berkeley) 23.6K[7] RadixAttention, structured output LMSYS Chatbot Arena, xAI Apache 2.0
TensorRT-LLM NVIDIA 12.7K[8] FP8/NVFP4, EAGLE-3 speculative Baseten, DeepInfra, NVIDIA NIM Apache 2.0
llama.cpp Georgi Gerganov 95.2K[9] GGUF format, CPU-first inference Ollama, LM Studio, Jan MIT
Ollama Ollama Inc. 163K[10] One-command deployment on llama.cpp Individual devs, prototyping MIT
TGI Hugging Face 9.7K[11] Multi-backend (vLLM/TRT-LLM) HF Inference Endpoints HFOIL
FlashAttention Tri Dao (Princeton/Together) IO-aware exact attention, FA-4 Blackwell Every major engine BSD 3-Clause
NVIDIA Dynamo NVIDIA New (2025) Disaggregated P/D, LLM-aware routing Azure AKS, NVIDIA NIM Apache 2.0
Key Finding: Convergence + Commercialization

The market is converging on vLLM (open-source default), TensorRT-LLM (NVIDIA-optimized), and SGLang (performance alternative). TGI entering maintenance mode in December 2025 confirms this consolidation.[12] Meanwhile, the January 2026 spinouts of Inferact ($800M valuation) and RadixArk ($400M valuation) mark the shift from academic projects to venture-backed companies with commercial interests that may increasingly diverge from pure open-source community needs.[13]

The proprietary vs. open-source dynamic is nuanced. Open-source engines match proprietary performance within 10–20% on most workloads.[14] But providers like Fireworks AI (FireAttention) and Together AI (custom kernels + FlashAttention) claim 3–4x advantages through end-to-end stack optimization that goes beyond any single engine component. The question is whether these advantages are durable or transient. Section 06 addresses this directly.

Section 03

Engine Landscape Snapshot

Twelve open-source and semi-open engines mapped across origin, scale, innovation, hardware support, and business model. Proprietary engines from Fireworks, Together, Crusoe, Nebius, and fal are profiled separately in Section 06.

Engine Origin Stars Version Key Innovation Hardware Quantization Structured Output Disagg. P/D License Funding / Val. Primary Users
vLLM UC Berkeley 70.8K v0.15.1 PagedAttention, continuous batching, prefix caching NVIDIA, AMD, Intel, TPU, Ascend, Gaudi[15] FP8, GPTQ, AWQ, Marlin Via Outlines Yes (experimental) Apache 2.0 Inferact: $150M seed, $800M val.[16] Modal, RunPod, Anyscale, BentoML
SGLang UC Berkeley LMSYS 23.6K v0.4+ RadixAttention, zero-overhead scheduling, XGrammar NVIDIA, TPU (SGLang-Jax), GB200 NVL72[17] FP8, GPTQ, AWQ Native (XGrammar, 10x faster)[18] Yes Apache 2.0 RadixArk: $400M val.[19] LMSYS Arena, xAI
TensorRT-LLM NVIDIA 12.7K v1.3.0rc4 FP8/NVFP4, EAGLE-3, Wide Expert Parallelism NVIDIA only (Hopper, Blackwell, Ada)[20] FP8, NVFP4, INT8, INT4 Limited Yes (via Dynamo) Apache 2.0 NVIDIA (corporate) Baseten, DeepInfra, NIM
llama.cpp Georgi Gerganov 95.2K Continuous GGUF format, CPU-first, 1.5–8-bit quant CPU (ARM, x86), Metal, CUDA, ROCm, Vulkan, WebGPU[21] GGUF (1.5–8-bit) Grammar-based No MIT Community-driven Ollama, LM Studio, Jan, GPT4All
Ollama Ollama Inc. 163K Continuous One-command deploy, 200+ models CPU, Metal, CUDA (via llama.cpp)[22] GGUF (via llama.cpp) Via llama.cpp No MIT Undisclosed VC Individual devs, prototyping
TGI Hugging Face 9.7K v3.3.5 Multi-backend (vLLM/TRT-LLM/llama.cpp) NVIDIA, AMD, TPU, Neuron[23] GPTQ, AWQ, EXL2, bitsandbytes Via Outlines No HFOIL HF: $4.5B val. HF Endpoints (maintenance mode)
Triton Server NVIDIA 8.7K v2.65.0 Multi-framework, dynamic batching, BLS NVIDIA (Hopper, Blackwell) Via backends No Via Dynamo BSD 3-Clause NVIDIA (corporate) Enterprise, SageMaker, Azure ML
NVIDIA Dynamo NVIDIA New v1.0 Disaggregated P/D, dynamic GPU scheduling NVIDIA (Hopper, Blackwell)[24] Via engine backends Via engine backends Core feature Apache 2.0 NVIDIA (corporate) Azure AKS, NIM, K8s
DeepSpeed-MII Microsoft 2.1K v0.2.x Dynamic SplitFuse, ZeroQuant NVIDIA GPUs ZeroQuant (INT8/INT4) No No Apache 2.0 Microsoft (corporate) Declining (MS shifting to ONNX)[25]
MLC LLM MLC AI (TVM) 19.5K v0.1.0 Compiler-driven, cross-platform (WebGPU) CUDA, OpenCL, Vulkan, Metal, WebGPU[26] TVM-based auto quant No No Apache 2.0 Community/OctoML On-device, browser (WebLLM)
ONNX Runtime Microsoft 15.3K v1.23.2 Universal ONNX format, cross-platform CUDA, TensorRT, DirectML, OpenVINO, CoreML[27] INT8, INT4 (MoE kernels) No No MIT Microsoft (corporate) Windows ML, Azure ML, enterprise
Exo EXO Labs 21.8K v0.0.15-alpha Peer-to-peer distributed inference Any (phones, laptops, DGX Spark)[28] Via backends No P2P distributed GPL-3.0 Undisclosed (startup) Consumer, heterogeneous clusters

Two patterns emerge from this landscape. First, the GitHub star distribution follows a power law: the top 3 engines (Ollama, llama.cpp, vLLM) hold 72% of total community attention, confirming that developer mindshare has concentrated rapidly. Second, every engine that matters now supports some form of continuous batching and KV cache management; the differentiation has moved to higher-level innovations like cache-aware routing, structured output, and disaggregated serving.

Section 04

The Big Three Cloud Engines

70.8K
vLLM GitHub Stars
23.6K
SGLang GitHub Stars
12.7K
TensorRT-LLM Stars
1,800+
Combined Contributors (Big Three)

For production cloud inference, three engines define the frontier. Each makes fundamentally different tradeoffs. vLLM optimizes for broad hardware compatibility and ecosystem adoption. SGLang optimizes for cache efficiency and structured generation. TensorRT-LLM optimizes for raw NVIDIA hardware utilization. The choice between them defines a provider's performance ceiling, operational complexity, and vendor lock-in.

Dimension vLLM SGLang TensorRT-LLM
Throughput (100 concurrent, H100) 4,741 tok/s[29] ~5,000 tok/s ~5,000 tok/s (short input)
TTFT Fastest across concurrency Moderate Slowest
Per-token latency stability Variable Most stable (4–21ms)[30] Good
Cache hit rate (few-shot) 15–25% 85–95%[31] Standard
Blackwell performance Good Good Best (deepest optimization)
Hardware breadth NVIDIA, AMD, Intel, TPU, Ascend, Gaudi NVIDIA, TPU (Jax) NVIDIA only
Structured output Via Outlines Native XGrammar (10x faster) Limited
Speculative decoding Draft model EAGLE integration EAGLE-3 native (up to 3.6x on B200)[32]
MoE optimization Wide-EP (2.2K tok/s/H200)[33] DP attention (1.9x decode) Wide EP (native)
Governance PyTorch Foundation[34] RadixArk (startup) NVIDIA (corporate)
January 2026: Two Spinouts, $1.2B in Valuation

In a single week, both dominant open-source inference projects commercialized. Inferact (vLLM) raised $150M at $800M valuation from a16z and Lightspeed, led by Ion Stoica (Databricks co-founder).[35] RadixArk (SGLang) secured an Accel-led round at $400M valuation, with Ying Sheng (ex-xAI) as CEO.[36] Both remain Apache 2.0 licensed, but the commercial entities will increasingly control roadmap priorities, enterprise features, and community governance. Monitor for licensing or feature gating changes.

vLLM: The De Facto Standard

vLLM is to LLM inference what Linux is to operating systems: the default choice that works everywhere. Its PagedAttention mechanism applies virtual memory principles to KV cache management, achieving near-zero memory waste and enabling significantly larger batch sizes.[37] Combined with continuous batching, it delivers 10–24x faster serving versus naive implementations.

The V1 architecture (complete migration by v0.11.0) removed V0 code entirely, enabling mixed prefill+decode in the same step and cross-node KV cache reuse for disaggregated inference.[38] The v0.14.0 release introduced vLLM-Omni, the first open-source omni-modality serving framework (text, image, video, audio, TTS). vLLM's governance under the PyTorch Foundation, with maintainers from Anyscale, AWS, Databricks, IBM, and Snowflake, ensures no single corporate interest controls the project.

SGLang: The Performance Challenger

SGLang's core innovation is RadixAttention, which uses a radix tree data structure for automatic KV cache reuse across requests. Where vLLM achieves 15–25% cache hit rates on few-shot workloads, SGLang achieves 85–95%.[31] On cache-heavy workloads like multi-turn chat, SGLang delivers 3.1x higher throughput than vLLM on Llama-70B.[39]

SGLang also leads in structured generation, having moved to XGrammar as its default backend. XGrammar uses compressed finite state machines for constrained output decoding, delivering up to 10x performance improvement over regex-based approaches. On GB200 NVL72 hardware, SGLang achieves 3.8x prefill and 4.8x decode throughput versus H100.[40]

SGLang's weaknesses: Narrower hardware support than vLLM (primarily NVIDIA, with experimental AMD/TPU); a smaller contributor base (~400 vs. vLLM's 1,000+); and governance now dependent on RadixArk, a seed-stage startup, rather than a foundation. If RadixArk's priorities diverge from the open-source community, SGLang's roadmap could fragment.

TensorRT-LLM: The NVIDIA Native

TensorRT-LLM trades hardware breadth for maximum NVIDIA performance. On B200 GPUs, it consistently outperforms both vLLM and SGLang across all metrics due to deep Blackwell kernel optimization.[41] Native NVFP4 support enables 4-bit inference with less than 1% accuracy degradation when properly calibrated. EAGLE-3 speculative decoding delivers up to 3.6x throughput boost on B200 at low batch sizes (the 2–6x range reported in literature varies by batch size, model size, and acceptance rate; see Section 09). Wide Expert Parallelism optimizes MoE model serving for architectures like DeepSeek and Mixtral.

The tradeoff is clear: TensorRT-LLM is NVIDIA-only, requires more setup complexity, and has a smaller contributor base. But for providers committed to NVIDIA hardware (which is most of the market), it offers the highest performance ceiling.

Deep Dive: vLLM V1 Architecture

The V1 architecture, completed by v0.11.0 in late 2025, represents the most significant structural change in vLLM's history. V0 code was fully removed.

  • Mixed prefill+decode: V0 could only do one operation at a time per step. V1 mixes both, improving GPU utilization during batched serving by 20–40%.
  • Disaggregated inference: KV cache is fetched from remote nodes, enabling cross-node cache reuse. Prefill and decode can run on separate GPU pools optimized for their respective compute profiles.
  • DeepSeek MoE support: Wide-EP (Expert Parallelism) delivers 2.2K tok/s per H200 for DeepSeek-style MoE models.[33]
  • Per-token latency reduction: v0.4 release achieved 40% reduction in per-output-token latency for DeepSeek V3.1 on H200.
  • Omni-modality (v0.14.0): vLLM-Omni serves text, image, video, audio, and TTS models through a unified serving framework.

The V1 migration signals vLLM's maturation from a research prototype to production infrastructure. It also raised the bar for SGLang, which must now match vLLM's disaggregated serving capabilities to maintain its performance advantage.

Section 05

NVIDIA's Inference Stack

NVIDIA is assembling the most vertically integrated inference stack in the industry: from silicon to orchestration, from kernel libraries to managed microservices. Understanding this stack is essential because every inference provider builds on top of it, and NVIDIA's decisions constrain or enable everyone else's options.

Layer 4: Dynamo (Orchestration)
Disaggregated Prefill/Decode
LLM-Aware Request Routing
Dynamic GPU Scheduling
30x Throughput (DeepSeek-R1)[42]
Layer 3: TensorRT-LLM (Engine)
FP8 / NVFP4 Quantization
EAGLE-3 Speculative Decoding
Wide Expert Parallelism
Inflight Batching
Layer 2: Triton Inference Server (Serving)
Multi-Framework Support
Dynamic Batching
Model Pipeline (BLS)
Prometheus / OpenTelemetry
Layer 1: CUDA / cuDNN / Blackwell Hardware
FlashAttention-4
H100 / H200 / B200
NVLink / NVSwitch
Vera Rubin (Late 2026)
$20B
Groq Acquisition Price[43]
30x
Dynamo DeepSeek-R1 Throughput
250K+
GPUs Deployed (NIM Ecosystem)
4
Vertically Integrated Stack Layers

NVIDIA's three-pronged inference strategy is unprecedented in scope. First, TensorRT-LLM provides deep kernel optimization for NVIDIA hardware, especially Blackwell. Second, Dynamo (released at GTC 2025) provides open-source datacenter-scale orchestration that supports all major backends (vLLM, TRT-LLM, SGLang).[44] Third, the Groq IP deal ($20B, December 2025, structured as a non-exclusive licensing agreement with significant talent transfer) gives NVIDIA access to LPU architecture that delivers approximately 10x throughput of GPUs at approximately 90% less power.[45]

Dynamo's headline claim is 30x more requests served for DeepSeek-R1 on Blackwell hardware and 2x+ throughput on Llama 70B on Hopper. The mechanism: disaggregated prefill-decode with LLM-aware routing that dynamically allocates GPU resources based on workload characteristics. This is not just an engine; it is an orchestration layer that makes the engine choice less important by abstracting across vLLM, TRT-LLM, and SGLang.

FlashAttention-4, announced at Hot Chips in September 2025, runs exclusively on Blackwell and achieves 20–22% faster performance than cuDNN attention through a 5-stage pipeline with online softmax optimization that skips 90% of rescaling operations.[58] The Blackwell-only restriction is deliberate: it creates a hardware upgrade incentive that benefits NVIDIA's GPU sales.

NVIDIA also publishes its most performant inference kernels through FlashInfer, which won Best Paper at MLSys 2025.[62] FlashInfer is already integrated into SGLang, vLLM, and MLC-Engine as the default attention kernel library. This creates an interesting dynamic: NVIDIA funds and controls the kernel distribution channel that competing engines depend on.

Threat Assessment

NVIDIA controls the full inference stack from silicon to orchestration. The Groq acquisition ($20B) absorbs the most credible alternative silicon.[43] FlashAttention-4 is Blackwell-only. FlashInfer is NVIDIA's kernel distribution channel. Dynamo abstracts across engines, making NVIDIA the orchestration default. Providers who do not build proprietary optimization on top of this stack have zero engine differentiation. The window for non-NVIDIA inference silicon (SambaNova, Etched, Cerebras) is narrowing with each acquisition and integration cycle.

Section 06

Proprietary Engines & Provider Moats

Five providers have built proprietary inference engines that go beyond open-source defaults. Each claims meaningful performance advantages, but the durability of these moats varies significantly by depth of optimization and hardware coupling.

Provider Engine Type Key Differentiator Key Limitation Notable Metric
Fireworks AI FireAttention V4 Proprietary CUDA kernels TensorCore Gen 5 optimization, NVFP4 on B200 Closed-source, NVIDIA-only, single-vendor dependency 250+ tok/s on DeepSeek V3[48]
Together AI Together Engine FlashAttention + custom kernels Tri Dao (FA creator) as Chief Scientist Key-person risk (Tri Dao); FA-4 is Blackwell-only 4x faster than vLLM (claimed)[49]
Crusoe MemoryAlloy Proprietary distributed KV cache Cluster-wide KV sharing, peer-to-peer GPU memory Requires vertically integrated infra; unclear portability 9.9x faster TTFT[50]
Nebius Token Factory Proprietary stack on Aether MLPerf-validated, own data centers (Finland/Paris) Limited model breadth vs. open-source; geographic concentration MLPerf benchmark leader[51]
fal.ai fal Engine Proprietary Diffusion/generative media specialization LLM inference is secondary; narrow model focus Up to 10x faster (diffusion)[52]

Fireworks AI has the deepest engine investment. Founded by the ex-PyTorch team at Meta, Fireworks built FireAttention from scratch with custom CUDA kernels.[46] The company claims 4–15x faster performance than open-source alternatives, though independent benchmarks show open-source engines within 10–20% on standard workloads.[14] FireAttention V4 adds NVFP4 precision on B200 GPUs and claims 3.5x throughput improvement over SGLang on H200. The company processes 10+ trillion tokens per day for 10,000+ customers and raised $250M at $4B valuation in October 2025.[47] The moat is not any single kernel but the integrated optimization across scheduler, memory management, batching policy, and routing logic.

Crusoe's MemoryAlloy takes a fundamentally different approach: instead of optimizing single-node inference, it creates a cluster-wide distributed KV cache fabric with peer-to-peer GPU memory sharing. The result is 9.9x faster TTFT for multi-node inference workloads. This architectural innovation is harder to replicate than kernel-level optimization because it requires control of the network fabric between GPUs, which Crusoe has through its vertically integrated infrastructure.

Together AI's advantage is unique: Tri Dao, the creator of FlashAttention, serves as Chief Scientist. Together has early access to FlashAttention-4 and the deepest kernel expertise in the market. The company is deploying 36,000 GB200 GPUs, the largest single allocation by an independent provider.[55]

Proprietary Engine Advantages

The providers with proprietary engines share three characteristics: (1) founding teams with GPU kernel expertise (ex-PyTorch, ex-NVIDIA, FlashAttention creators), (2) tight hardware-software coupling (optimizing for specific GPU generations), and (3) end-to-end stack control (not just the engine, but scheduler, router, and memory manager). The moat is in the integration, not any single component.

Deep Dive: Can a Proprietary Engine Create a Sustainable Moat?

Short answer: engine-only moats last approximately 6 months. Integration-layer moats are durable. Evidence: vLLM's PagedAttention innovation (June 2023) was matched by SGLang and TensorRT-LLM within two release cycles. SGLang's RadixAttention advantage prompted vLLM to ship Automatic Prefix Caching within months.

The evidence for commoditization is strong. Open-source engines match proprietary performance within 10–20% on most workloads, and the gap closes with each release cycle.[14] NVIDIA Dynamo abstracts across engines, dissolving engine lock-in. TGI's move to maintenance mode confirms that even Hugging Face decided engine optimization is not where value accrues.[12]

But integration-layer moats are different. Consider the evidence:

  • Hardware-software co-design (Groq LPU, SambaNova dataflow, Etched ASIC) requires massive capital but creates multi-year advantages. NVIDIA validated this by paying $20B for Groq.
  • Operational excellence at scale (Fireworks) compounds over time. Each optimization feeds into the next. The moat is not a single kernel but thousands of micro-optimizations across the entire serving path.
  • Customer-specific optimization pipelines (automated quantization, model distillation, prompt optimization) create switching costs. Once a customer's models are profiled and optimized for a specific platform, migration requires re-profiling everything.
  • Sovereign/regulated deployment capability is a structural moat that open-source engines alone cannot provide. Air-gapped, on-premises, data-residency-compliant inference requires integrated infrastructure, not just software.

For MARA, the implication is clear: do not invest in building a custom inference engine. Use vLLM or SGLang as the base. Instead, invest in the orchestration layer (routing, scaling, caching), automated optimization pipelines, and sovereign deployment capabilities. These are the moats that open-source cannot easily replicate.[56]

Section 07

Attention Mechanisms & Kernels

Attention computation is the single most expensive operation in transformer inference. The evolution from naive attention requiring O(n²) memory to IO-aware tiled computation with O(n) memory, and now to hardware-specialized pipelining, has delivered cumulative 15x+ speedups in four years. Understanding this stack is critical because every engine's performance ceiling is ultimately set by its attention kernel.

Mechanism Year Key Innovation Performance Hardware
FlashAttention-1 2022 IO-aware tiling, exact attention without materialization 2–4x speedup over PyTorch A100
FlashAttention-2 2023 Better work partitioning, reduced non-matmul FLOPs ~2x over FA-1 A100, H100
FlashAttention-3 2024 Warp-specialization, async TMA, FP8 support 740 TFLOPS (75% H100 utilization); FP8: 1.2 PFLOPS[57] H100 (Hopper)
FlashAttention-4 Sep 2025 5-stage pipeline, online softmax (90% rescaling skip), CUDA-based softmax 20–22% over cuDNN attention; 15x over FA-1[58] B200 (Blackwell only)
PagedAttention 2023 Virtual memory for KV cache, non-contiguous block allocation Near-zero memory waste[59] vLLM (any GPU)
RadixAttention 2024 Radix tree prefix caching, automatic KV reuse across requests 85–95% cache hit rate[60] SGLang
Multi-Head Latent Attention (MLA) 2024 Low-rank KV compression into shared latent space 57x KV cache reduction, 93.3% memory savings[61] DeepSeek V3/V3.2
FlashInfer 2024–2025 Unified attention kernel library, NVIDIA kernel release channel Best Paper MLSys 2025[62] SGLang, vLLM, MLC-Engine
The FlashAttention Lineage

FlashAttention is the most consequential inference optimization of the 2020s. Created by Tri Dao (Princeton professor, Together AI Chief Scientist), it has won Outstanding Paper awards at ICML 2022, COLM 2024, and MLSys 2025 Honorable Mention. Every major inference engine uses FlashAttention or its derivatives. FA-4 is Blackwell-only and currently forward-pass-only (no backward pass, no GQA/MQA support), which limits training use but is sufficient for inference.[63]

FlashMLA: DeepSeek's Open-Source Contribution

FlashMLA, open-sourced during DeepSeek's Open Source Week (February 2025), provides optimized CUDA kernels for Multi-Head Latent Attention. On H800 SXM5, it achieves up to 3,000 GB/s memory-bound throughput and 580–660 TFLOPS compute-bound performance. On B200, it reaches 1,460 TFLOPS forward and 1,000 TFLOPS backward.[64] MLA's 93.3% KV cache memory savings means that models like DeepSeek V3 can serve dramatically more concurrent users on the same hardware. A recent ACL 2025 paper demonstrated "Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer," signaling that MLA will spread beyond DeepSeek to become an architectural standard.[65]

For MARA, supporting MLA-based models efficiently via FlashMLA kernels will be table-stakes within 12 months as more model providers adopt this architecture for inference cost optimization.

Section 08

Quantization Landscape

Quantization is the single most impactful cost optimization lever for inference. Reducing precision from FP16 to FP8 cuts memory by 2x; to INT4, by 4x. The question is no longer whether to quantize, but which method delivers the best quality-speed-memory tradeoff for each deployment scenario.

Method Bits Type Quality Retention Speed Gain Best For Engine Support
FP8 8-bit float Weight + Activation ~99.5% ~1.5–2x Production default on Hopper/Blackwell TRT-LLM, vLLM, SGLang
NVFP4 4-bit float Weight + KV Cache ~99% w/calibration ~2–3x Blackwell-native, MoE models TRT-LLM, FireAttention[66]
AWQ 4-bit int Weight-only (activation-aware) ~95% ~2–3x (10.9x w/Marlin kernel)[67] Best INT4 quality for GPU serving vLLM, TGI, SGLang
GPTQ 4-bit int Weight-only (post-training) ~90% ~2–3x Established, wide support vLLM, TGI, TRT-LLM, SGLang
GGUF 1.5–8-bit Weight-only (multi-format) ~92% CPU/Apple optimized Edge/local inference standard llama.cpp, Ollama
INT8 SmoothQuant 8-bit int Weight + Activation ~98% ~1.5–2x Safe starting point for enterprise TRT-LLM, DeepSpeed
Hybrid FP8+INT4 Mixed Per-layer precision ~97% Frontier Attention FP8 + MLP INT4 Experimental (research)
99.5%
FP8 Quality Retention
3.5x
NVFP4 Memory Reduction vs FP16
10.9x
Marlin Kernel Speedup (AWQ)
<1%
NVFP4 Accuracy Loss (Calibrated)
Production Recommendation

Use FP8 as the production default for Hopper/Blackwell. NVFP4 for Blackwell cost optimization (uses block size 16 vs. MXFP4's block size 32, reducing quantization error; Blackwell performs FP4 at double the rate of FP8; Blackwell Ultra at 3x). AWQ for 4-bit GPU serving when you must fit on fewer GPUs. GGUF for edge/local deployments. Kernel optimization matters more than quantization method choice: Marlin kernels provide 2.6–10.9x speedup over standard GPTQ/AWQ kernels on identical quantized weights.[68]

Recommended Method by Use Case

Use Case Recommended Method Rationale
Production cloud (quality-sensitive) FP8 Near-lossless at 2x memory reduction; no calibration needed
Blackwell cost optimization NVFP4 Native hardware support, <1% degradation with calibration
GPU serving on fewer GPUs AWQ + Marlin kernel Best INT4 quality; 10.9x with optimized kernels
Edge / local / Apple Silicon GGUF Q4_K_M + iMatrix Broadest hardware support; importance-matrix calibration
Enterprise (conservative) INT8 SmoothQuant Proven, safe; 98% quality with minimal risk
Research / frontier Hybrid FP8+INT4 Per-layer precision; attention in FP8, MLP in INT4

The emerging pattern is format hybridization: different layers within a single model receive different quantization levels based on sensitivity analysis. A differentiated inference platform should offer automated quantization pipelines that profile models and select optimal per-layer precision. This is a real value-add over generic cloud inference APIs.[69]

Section 09

Optimization Techniques

Beyond attention kernels and quantization, a constellation of optimization techniques determines real-world throughput and latency. These techniques stack multiplicatively: a provider combining continuous batching, prefix caching, speculative decoding, and disaggregated serving can achieve 50–100x throughput versus a naive implementation.

Technique Speedup How It Works Engine Support Production Status
Speculative Decoding (EAGLE-3) 2–6x[70] Draft model generates tokens ahead; target model verifies in single pass TRT-LLM, vLLM Production
Continuous Batching 10–24x vs naive Dynamic request admission mid-batch; eliminates head-of-line blocking All major engines Standard
Prefix Caching (RadixAttention) Up to 3.1x Radix tree KV reuse for shared prompt prefixes SGLang (85–95% hit rate) Production
Prefix Caching (APC) Moderate Automatic prefix matching on hash-based lookup vLLM (15–25% hit rate) Production
Disaggregated P/D Up to 30x[71] Separate prefill (compute-heavy) and decode (memory-bound) onto different hardware Dynamo, vLLM, SGLang Standard
Structured Generation (XGrammar) 10x over regex[72] Compressed FSM for constrained output decoding (JSON, EBNF) SGLang (default) Production
Wide Expert Parallelism MoE-specific Route MoE experts to different GPUs; optimized all-to-all communication TRT-LLM, vLLM Production
Deep Dive: Speculative Decoding Evolution

Speculative decoding has evolved from a research curiosity to a production-critical optimization in 18 months. The EAGLE family dominates because it preserves output distribution guarantees for both greedy and non-greedy sampling (Medusa and Lookahead do not).

Method Speedup Key Property Status
EAGLE-1 1.5–2x Auto-regressive draft with tree attention Superseded
EAGLE-2 1.7–2.1x over Lookahead Improved tree structure, better acceptance rates vLLM, TRT-LLM
EAGLE-3 2–6x Token-level prediction, multi-layer fusion, TRT-LLM native Production
VSD (Variational) ~9.6% better acceptance than EAGLE-3 Variational approach to draft distribution Research (Feb 2026)
QuantSpec Up to 2.5x Self-speculative with 4-bit quantized KV cache, >90% acceptance Research

Critical insight for MARA: Speculative decoding benefits shrink at high batch sizes because the throughput bottleneck shifts from memory bandwidth to compute. The technique is most valuable for interactive, low-batch scenarios, which is exactly the target for low-latency interactive inference. Integrating EAGLE-3 for interactive workloads directly supports the core OKR.[73]

Disaggregated Inference: The New Default

Disaggregated prefill-decode serving has gone from research paper (DistServe, 2024) to default architecture across every major framework in under 18 months. The insight is simple: prefill is compute-bound (benefits from high-FLOPS hardware) while decode is memory-bandwidth-bound (benefits from high-bandwidth memory). Separating them onto optimized hardware pools yields dramatic improvements.

NVIDIA Dynamo's implementation on DeepSeek-R1 with Blackwell demonstrates the ceiling: 30x more requests served compared to baseline. On 96 H100s, disaggregated vLLM achieves 52.3K input tok/s + 22.3K output tok/s per node.[74] Every serious inference provider now either implements disaggregated serving or plans to. For MARA, this is not optional; it is a Day 1 architecture requirement.

Section 10

Edge & Local Inference

Edge inference is not MARA's target market. But understanding the edge landscape matters because enterprise customers will ask about hybrid edge+cloud architectures, edge devices will route complex tasks to cloud, and quantization techniques developed for edge (GGUF, iMatrix) directly apply to cloud cost optimization.

Engine GitHub Stars Target Key Feature Status
llama.cpp 95.2K CPU / Apple / cross-platform GGUF standard, 1.5–8-bit quantization. Not designed for multi-GPU cloud serving; use vLLM/SGLang/TRT-LLM for datacenter workloads. Dominant (edge)
Ollama 163K Desktop (Mac, Windows, Linux) One-command deployment (ollama run llama3.3) Most popular local
ExecuTorch N/A (Meta) Mobile / embedded 50KB base footprint, 12+ hardware backends GA (Oct 2025)
Apple MLX N/A (Apple) Apple Silicon Metal-optimized, Python-native, best M-series throughput Growing fast
MLC LLM 19.5K Cross-platform (iOS, Android, browser) Compiler-driven (TVM), MLCEngine API On-device focus
WebLLM N/A (MLC AI) Browser (WebGPU) ~80% native speed; Llama-3.1-8B at 41.1 tok/s in browser[75] Maturing
Exo 21.8K Peer-to-peer distributed Heterogeneous clustering, RDMA over Thunderbolt 5 Alpha
Oct 2025
ExecuTorch GA — Meta's on-device framework reaches general availability. 50KB base footprint with support for Apple, Qualcomm, Arm, MediaTek, and Vulkan backends. Meta's play to make Llama the default on-device model.[76]
Dec 2025
llama.cpp Android/ChromeOS — Native app development via GUI binding. Brings CPU-first inference to mobile platforms with direct hardware access.
Jan 2026
Sub-1B models practical — Llama 3.2 (1B/3B), Gemma 3 (270M), Qwen2.5 (0.5–1.5B) handle many practical tasks. Combined with 4-bit quantization, these run on phones.
2026 Forecast
Hybrid edge+cloud becomes enterprise default — Edge handles simple tasks (classification, extraction); cloud handles complex reasoning. Privacy, latency, cost, and offline availability drive adoption.
Edge Is Complementary, Not Competitive

Edge inference does not threaten MARA's cloud inference business. The two are complementary: edge devices handle latency-sensitive, privacy-sensitive, or offline tasks while routing complex reasoning (70B+ models, long-context, multi-step) to cloud. The enterprise pattern emerging in 2026 is a routing layer that automatically selects edge vs. cloud based on task complexity, privacy requirements, and cost constraints. MARA should prepare for this hybrid architecture in its API design.[77]

Section 11

Provider-Engine Matrix

Every managed inference provider makes engine-layer decisions that define their competitive position. This matrix maps providers to their engines, hardware, and capability scores across six dimensions critical for enterprise buyers.

Provider Primary Engine Hardware Throughput Latency Structured Output Disaggregated Edge
Fireworks AI FireAttention V4 H200, B200 High High Yes Yes No
Together AI Together Engine + FA H100, H200, GB200 High High Moderate Yes No
Baseten TRT-LLM / Truss NVIDIA (AWS) High Moderate Moderate Via TRT-LLM No
Crusoe MemoryAlloy H100, H200 High 9.9x TTFT Limited Native No
Nebius Token Factory H100, H200, Blackwell Ultra MLPerf-validated High Moderate Yes No
Modal vLLM (primary) NVIDIA GPUs Moderate Moderate Via vLLM No No
fal.ai fal Engine H100, H200 10x (diffusion) High N/A (media) Partial No
DeepInfra TRT-LLM / Blackwell H100, Blackwell High Moderate Moderate Yes No
Inferact vLLM (commercial) Multi-hardware High High Via Outlines V1 Native No
Groq LPU Runtime LPU v2 (NVIDIA) 1,600+ tok/s Ultra-low Moderate N/A (ASIC) No
Cerebras WSE Runtime WSE-3 High Ultra-low Limited N/A (wafer) No
SambaNova RDU Runtime SN40L RDU High Moderate Limited Dataflow arch No
AWS (Inferentia/Neuron) Neuron SDK / Transformers NeuronX Inferentia2, Trainium Moderate Moderate Limited No No

Note: Amazon Inferentia2/Trainium with the Neuron SDK represents the most significant non-NVIDIA inference hardware effort from a hyperscaler. While its model support is narrower and tooling less mature than NVIDIA's stack, AWS's scale and pricing (up to 40% cheaper than comparable GPU instances) make it relevant for cost-optimized batch workloads. Trainium2, expected in 2026, aims to close the performance gap.

Best Engine For Each Use Case

Use Case Recommended Engine Why
High-throughput batch processing vLLM Best concurrency scaling, broadest hardware support, PyTorch Foundation governance
Low-latency interactive serving SGLang or TRT-LLM RadixAttention cache efficiency (SGLang) or native NVIDIA kernel optimization (TRT-LLM)
Structured output (JSON/EBNF) SGLang + XGrammar 10x faster constrained generation; compressed FSM approach
Edge / local deployment llama.cpp + GGUF Broadest hardware support (CPU, Metal, CUDA, Vulkan, WebGPU)
Diffusion / media generation fal Engine or SGLang Diffusion Specialized optimization for image/video generation workloads
DeepSeek MoE models vLLM or SGLang Wide-EP + MLA support; DeepSeek-specific optimizations in both engines
Maximum NVIDIA optimization TensorRT-LLM Deepest kernel optimization; best B200 performance; NVFP4 native
Section 12

Strategic Implications

~6 mo
Engine Moat Half-Life
10–20%
Open vs Proprietary Gap
5
Durable Moat Areas
5–7x
Self-Host Cost Advantage*
Reality Check: Engine Commoditization

Engine-only moats last approximately 6 months before open-source catches up. vLLM's PagedAttention (June 2023) was matched by SGLang and TRT-LLM within two release cycles. NVIDIA's Dynamo abstracts across engines, dissolving lock-in. TGI's death shows even well-funded engines lose. Meta's Llama has commoditized model access: open models achieve ~90% of closed model performance at 87% lower inference cost.[78][79] See Section 06 for the full moat sustainability analysis.

Opportunity: Five Durable Moat Areas

Despite engine-layer commoditization, durable advantages exist where open-source cannot easily follow:

  1. Hardware-software co-design — Custom silicon (Groq LPU, Etched ASIC) creates multi-year advantages. NVIDIA validated this by paying $20B for Groq.
  2. End-to-end stack optimization — Fireworks AI demonstrates that thousands of micro-optimizations compound into a moat no single open-source project can replicate.
  3. Customer-specific optimization pipelines — Automated quantization profiling, model distillation, and per-workload tuning create switching costs. Migration requires re-profiling everything.
  4. Sovereign / air-gapped deployment — Data-residency-compliant inference requires integrated infrastructure, not just software.
  5. Guaranteed latency SLAs — Contractual latency guarantees are a product, not infrastructure. Most providers offer best-effort only.[80]

Engine Switching Costs

Switching inference engines is not trivial. Practitioners report that migrating from vLLM to TensorRT-LLM (or vice versa) typically requires weeks of engineering effort for model re-optimization, batching policy tuning, and integration testing. Quantization profiles must be rebuilt from scratch. Monitoring and alerting pipelines require reconfiguration. For enterprises running 10+ models in production, a full engine migration is a quarter-long project. This switching cost is itself a moat for platforms that lock in early.

Commercialization Risk

Both Inferact (vLLM) and RadixArk (SGLang) are now venture-backed companies with commercial interests. The risk: enterprise features may be gated behind paid tiers, or open-source release cadence may slow to protect commercial offerings. MARA's contingency: maintain the ability to run either engine, avoid deep coupling to Inferact-specific or RadixArk-specific APIs, and monitor their licensing decisions quarterly. The Apache 2.0 license protects the current codebase, but future innovations may not be open-sourced.

Dimension Market Reality MARA Implication
Engine layer Commoditizing. vLLM/SGLang are "good enough" for 90% of workloads. Do not build a custom engine. Build ON open-source engines. Invest in orchestration.
Quantization FP8 default, NVFP4 emerging, AWQ for 4-bit. Kernel optimization > method choice. Offer automated quantization-as-a-service. Differentiate on per-layer optimization.
Speculative decoding EAGLE-3 at 2–6x. Best for low-batch interactive scenarios. Key differentiator for low-latency targets. Integrate EAGLE-3 for Day 1.
Hardware dependency NVIDIA controls the full stack. Blackwell dominates. Groq IP acquired. Optimize for Blackwell first. Plan for Vera Rubin. Accept NVIDIA dependency with eyes open.
Attention architecture MLA spreading beyond DeepSeek. FlashMLA open-sourced. Must support MLA models efficiently. This is table-stakes within 12 months.
Deployment model Sovereign/air-gapped is underserved. Hybrid edge+cloud emerging. Core differentiator. Build sovereign deployment as a product, not an afterthought.
Pricing dynamics Commoditizing fast. Self-hosting can be 5–7x cheaper than proprietary APIs at scale (varies by model size and utilization). Compete on value (SLAs, ease, optimization), not raw price per token.

The MLA Question

Supporting DeepSeek-style Multi-Head Latent Attention efficiently will be table-stakes for any serious inference platform. MLA's 93.3% KV cache memory savings and 57x cache reduction mean that models using this architecture can serve dramatically more concurrent users on the same hardware. As more model providers adopt MLA (signaled by the ACL 2025 paper on enabling MLA in any transformer), platforms that cannot serve MLA models efficiently will lose on cost-per-token for the fastest-growing model family in the market. MARA's engine selection must include FlashMLA kernel support from day one.

For detailed analysis of where sustainable moats exist (and where they do not), see the five integration-layer moat categories in Section 06.[81]

Section 13

Methodology & Sources

Research Methodology

This report synthesizes 75+ primary sources collected between February 15–20, 2026. Analysis covers the full inference engine landscape from low-level attention kernels to production serving frameworks and managed inference platforms. All GitHub star counts are snapshots from February 2026 and may vary by several hundred. Performance benchmarks are sourced from independent evaluations (Clarifai, Cerebrium) and official project blogs; proprietary engine claims (FireAttention, Together Engine, MemoryAlloy) are self-reported and not independently verified. Financial data comes from press reports and may not reflect final deal terms. Quantization quality retention percentages are averages across standard benchmarks (MMLU, HumanEval, GSM8K) and may differ significantly for specific enterprise tasks.

Source Categories

Source Type Count Examples
GitHub repositories & release notes 15+ vLLM, SGLang, TRT-LLM, llama.cpp, FlashAttention, FlashInfer, Ollama, Exo, WebLLM
Official project blogs 15+ vLLM blog, NVIDIA developer blog, PyTorch blog, LMSYS blog, Fireworks blog
Independent benchmarks & evaluations 10+ Clarifai, Cerebrium, JarvisLabs, NurbolSakenov, MarkTechPost, OpenRouter
Technical papers (arXiv, MLSys, ACL) 8+ FlashAttention-3, FlashInfer, EAGLE-3, VSD, MLA, on-device LLMs (Apple Silicon)
Financial & analyst reporting 10+ Sacra, Fortune, TechCrunch, SiliconAngle, SDxCentral, VentureBeat
Industry analysis & strategy 10+ California Management Review, WorkOS, Wing VC, IntuitionLabs, Edge AI Vision

All data points are tagged with footnote references to their primary source. Where multiple sources report conflicting figures (particularly around the NVIDIA-Groq deal structure and proprietary benchmark claims), we note the discrepancy and present the most conservative interpretation.

References & Footnotes

  1. [1] MarkTechPost, "Comparing the Top 6 Inference Runtimes for LLM Serving in 2025," Nov 2025. marktechpost.com
  2. [2] Clarifai, "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B," Feb 2026. clarifai.com
  3. [3] SiliconAngle, "Inferact Launches with $150M Funding to Commercialize vLLM," Jan 2026. siliconangle.com; TechCrunch, "Project SGLang Spins Out as RadixArk with $400M Valuation," Jan 2026. techcrunch.com
  4. [4] NVIDIA Developer Blog, "Introducing NVIDIA Dynamo," 2025. developer.nvidia.com
  5. [5] vLLM Blog, "Large-Scale Serving with Disaggregated Inference," Dec 2025. blog.vllm.ai
  6. [6] vLLM GitHub repository, star count as of Feb 2026. github.com/vllm-project/vllm
  7. [7] SGLang GitHub repository, star count as of Feb 2026. github.com/sgl-project/sglang
  8. [8] TensorRT-LLM GitHub repository. github.com/NVIDIA/TensorRT-LLM
  9. [9] llama.cpp GitHub repository, star count as of Feb 2026. github.com/ggml-org/llama.cpp
  10. [10] Ollama GitHub repository, star count as of Feb 2026. github.com/ollama/ollama
  11. [11] TGI GitHub repository; maintenance mode announced Dec 2025. github.com/huggingface/text-generation-inference
  12. [12] Hugging Face Blog, "TGI Multi-Backend Architecture," 2025. huggingface.co/blog/tgi-multi-backend
  13. [13] SiliconAngle, Inferact $150M seed announcement, Jan 2026. siliconangle.com
  14. [14] Cerebrium, "Benchmarking vLLM, SGLang, TensorRT for Llama 3.1 API," 2025. cerebrium.ai
  15. [15] vLLM documentation, supported hardware backends. github.com/vllm-project/vllm
  16. [16] SiliconAngle, "Inferact Launches with $150M Funding," Jan 2026. siliconangle.com
  17. [17] LMSYS Blog, "SGLang v0.4 Release," Dec 2024. lmsys.org
  18. [18] SGLang Documentation, "Structured Outputs with XGrammar." docs.sglang.io
  19. [19] TechCrunch, "SGLang Spins Out as RadixArk," Jan 2026. techcrunch.com
  20. [20] NVIDIA TensorRT-LLM Release Notes. nvidia.github.io/TensorRT-LLM
  21. [21] llama.cpp GitHub, hardware support documentation. github.com/ggml-org/llama.cpp
  22. [22] Ollama GitHub repository. github.com/ollama/ollama
  23. [23] Hugging Face Blog, "TGI Multi-Backend," supporting NVIDIA, AMD, TPU, Neuron. huggingface.co
  24. [24] NVIDIA Dynamo GitHub repository. github.com/ai-dynamo/dynamo
  25. [25] Microsoft ONNX Runtime GitHub repository. github.com/microsoft/onnxruntime
  26. [26] MLC LLM GitHub, WebGPU and cross-platform support. github.com/mlc-ai/web-llm
  27. [27] ONNX Runtime, hardware acceleration providers. github.com/microsoft/onnxruntime
  28. [28] Exo GitHub repository, peer-to-peer distributed inference. github.com/exo-explore/exo
  29. [29] Clarifai, "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B on 2x H100." clarifai.com
  30. [30] Cerebrium, "SGLang maintains stable 4-21ms per-token latency across concurrency levels." cerebrium.ai
  31. [31] LMSYS Blog, SGLang RadixAttention achieving 85-95% cache hit rates. lmsys.org
  32. [32] NVIDIA TensorRT-LLM Speculative Decoding Documentation, EAGLE-3 up to 3.6x. nvidia.github.io
  33. [33] vLLM Blog, "GPT-OSS Optimizations: DeepSeek MoE at 2.2K tok/s per H200 with Wide-EP," Feb 2026. blog.vllm.ai
  34. [34] PyTorch Blog, "PyTorch Foundation Expands, Welcomes vLLM and DeepSpeed," May 2025. pytorch.org
  35. [35] SiliconAngle, Inferact $150M seed at $800M valuation, a16z + Lightspeed. siliconangle.com
  36. [36] TechCrunch, RadixArk $400M valuation, Accel-led, Ying Sheng as CEO. techcrunch.com
  37. [37] vLLM original paper, PagedAttention for virtual memory KV cache management. github.com/vllm-project/vllm
  38. [38] vLLM Blog, "V1 Alpha Release: mixed prefill+decode, cross-node KV cache reuse," Jan 2025. blog.vllm.ai
  39. [39] LMSYS Blog, SGLang 3.1x higher throughput than vLLM on Llama-70B. lmsys.org
  40. [40] SGLang documentation, GB200 NVL72 performance: 3.8x prefill, 4.8x decode throughput. github.com/sgl-project/sglang
  41. [41] NVIDIA TensorRT-LLM Release Notes, B200 performance benchmarks. nvidia.github.io
  42. [42] NVIDIA Developer Blog, "Dynamo: 30x throughput on DeepSeek-R1 with Blackwell." developer.nvidia.com
  43. [43] Fortune, "After NVIDIA's Groq Deal, AI Chip Startups in Play," Jan 2026. fortune.com
  44. [44] NVIDIA Dynamo GitHub, multi-backend support (vLLM, TRT-LLM, SGLang). github.com/ai-dynamo/dynamo
  45. [45] Groq Newsroom, "Groq and NVIDIA Enter Non-Exclusive Inference Technology Licensing Agreement." groq.com; IntuitionLabs analysis. intuitionlabs.ai
  46. [46] Fireworks AI Blog, "FireAttention V4: FP4 on B200." fireworks.ai
  47. [47] Sacra, Fireworks AI company profile: $327M total funding, $4B valuation. sacra.com
  48. [48] WorkOS, "Fireworks AI: The PyTorch Team's Bet on Inference as the New Runtime." workos.com
  49. [49] Together AI, FlashAttention integration and Tri Dao as Chief Scientist. github.com/Dao-AILab/flash-attention
  50. [50] Crusoe, MemoryAlloy: cluster-wide distributed KV cache fabric for 9.9x faster TTFT. Internal product documentation.
  51. [51] Nebius, Token Factory launch with MLPerf-validated performance on Aether infrastructure.
  52. [52] fal.ai, proprietary inference engine for diffusion/generative media, $140M raise at $4.5B valuation.
  53. [53] California Management Review, "How Open-Source AI Will Challenge Closed-Model Giants," Jan 2026. cmr.berkeley.edu
  54. [54] SDxCentral, "AI Inferencing Will Define 2026 -- And the Market's Wide Open." sdxcentral.com
  55. [55] OpenRouter, "State of AI: 100T Token Study on Inference Patterns." openrouter.ai
  56. [56] VentureBeat, "NVIDIA, Groq, and the Race to Real-Time AI: Why Enterprises Win." venturebeat.com
  57. [57] PyTorch Blog, "FlashAttention-3: achieving 740 TFLOPS on H100 (75% utilization), 1.2 PFLOPS with FP8." pytorch.org; arXiv paper. arxiv.org
  58. [58] Modal, "Reverse Engineering FlashAttention-4: 5-stage pipeline, 20-22% over cuDNN, Blackwell-only." modal.com
  59. [59] vLLM original paper, PagedAttention for near-zero KV cache memory waste. github.com/vllm-project/vllm
  60. [60] LMSYS Blog, SGLang v0.4: RadixAttention achieves 85-95% cache hit rates on few-shot workloads. lmsys.org
  61. [61] DeepSeek FlashMLA GitHub, "Multi-Head Latent Attention: 57x KV cache reduction, 93.3% memory savings." github.com/deepseek-ai/FlashMLA
  62. [62] FlashInfer GitHub, Best Paper Award MLSys 2025; NVIDIA kernel release channel. github.com/flashinfer-ai/flashinfer
  63. [63] Dao-AILab FlashAttention GitHub, FA-4 forward-only limitation, BSD 3-Clause license. github.com/Dao-AILab/flash-attention
  64. [64] DeepSeek FlashMLA GitHub, "H800: 3000 GB/s memory-bound, 580-660 TFLOPS compute-bound; B200: 1460 TFLOPS forward." github.com/deepseek-ai/FlashMLA
  65. [65] ACL 2025, "Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer." aclanthology.org
  66. [66] NVIDIA Developer Blog, "Introducing NVFP4 for Efficient and Accurate Low-Precision Inference." developer.nvidia.com; Fireworks AI FireAttention V4 with NVFP4. fireworks.ai
  67. [67] JarvisLabs, "vLLM Quantization Complete Guide: Marlin kernel achieves 10.9x speedup for AWQ (741 tok/s vs 68 tok/s)." docs.jarvislabs.ai
  68. [68] JarvisLabs, "Kernel optimization matters more than quantization method: Marlin provides 2.6-10.9x speedup." docs.jarvislabs.ai
  69. [69] LocalAIMaster, "Quantization Explained: AWQ vs GPTQ vs GGUF format comparison." localaimaster.com
  70. [70] E2E Networks, "Accelerating LLM Inference with EAGLE-3: 2-6x speedups." e2enetworks.com
  71. [71] NVIDIA Developer Blog, "Dynamo disaggregated serving: up to 30x more requests on DeepSeek-R1." developer.nvidia.com
  72. [72] SGLang Structured Outputs Documentation, "XGrammar: up to 10x faster than regex-based constrained generation." docs.sglang.io
  73. [73] NVIDIA, "An Introduction to Speculative Decoding for Reducing Latency in AI Inference." developer.nvidia.com
  74. [74] vLLM Blog, "Large-Scale Serving: 52.3K input tok/s + 22.3K output tok/s per node on 96 H100s." blog.vllm.ai
  75. [75] WebLLM GitHub, "Browser-based inference: Llama-3.1-8B at 41.1 tok/s, ~80% native speed." github.com/mlc-ai/web-llm
  76. [76] Edge AI Vision, "On-Device LLMs in 2026: ExecuTorch GA, 50KB footprint." edge-ai-vision.com; Vikas Chandra (Meta) on-device LLMs state of the union. v-chandra.github.io
  77. [77] arxiv, "Production-Grade Local LLM Inference on Apple Silicon," Nov 2025. arxiv.org
  78. [78] California Management Review, "The Coming Disruption: How Open-Source AI Will Challenge Closed-Model Giants," Jan 2026. cmr.berkeley.edu
  79. [79] OpenRouter, "State of AI: open models achieve ~90% of closed model performance; 80% of tokens still through closed APIs." openrouter.ai
  80. [80] SDxCentral, "AI Inferencing Will Define 2026 -- And the Market's Wide Open." sdxcentral.com
  81. [81] WorkOS, "Fireworks AI: The PyTorch Team's Bet on Inference as the New Runtime." workos.com