Landscape Report — Inference Engines

AI Inference Engines & Frameworks: The Technology Layer Powering the $126B Market

vLLM • SGLang • TensorRT-LLM • FireAttention • FlashAttention • llama.cpp — Performance, Adoption & Strategic Moats

Feb 2026 MinjAI Agents 75+ Sources 13 Sections

Internal — Strategic Intelligence

Section 01

Why Engines Matter

15+

Major Inference Engines

Dominant Cloud Engines

Spinouts Jan 2026

$126B

Inference Market TAM 2030

Why This Layer Decides Winners

Models are commoditizing. Hardware is standardized around NVIDIA. The inference engine is the layer where cost advantages, latency guarantees, and throughput differentiation are actually created. A 2x engine optimization translates directly to either 2x margin improvement or 50% price reduction. Every platform profiled in our managed inference landscape report derives its competitive position from engine-layer decisions.

The inference engine sits between the model weights and the GPU hardware. It determines how tokens are batched, how memory is allocated, how attention is computed, and how multiple requests share finite GPU capacity.^[1] Two providers running the same Llama 3.3 70B model on identical H200 GPUs can deliver throughput that differs by 3–5x depending on engine choice and optimization depth.^[2]

Three dynamics define the February 2026 landscape: open-source engine commercialization (two $400M–$800M spinouts in January^[3]), NVIDIA's vertical stack consolidation from silicon to orchestration^[4], and disaggregated prefill-decode serving becoming the default architecture.^[5] This report maps 15+ engines across 12 dimensions to inform MARA's engine-layer strategy for Project Sapien.

Section 02

Executive Summary

The production inference engine market has consolidated around three open-source projects and a handful of proprietary alternatives. Each engine makes different tradeoffs between performance, flexibility, and ecosystem lock-in.

Engine	Maintainer	GitHub Stars	Key Innovation	Primary Users	License
vLLM	Inferact (UC Berkeley)	70.8K^[6]	PagedAttention, continuous batching	Modal, RunPod, Anyscale, BentoML	Apache 2.0
SGLang	RadixArk (UC Berkeley)	23.6K^[7]	RadixAttention, structured output	LMSYS Chatbot Arena, xAI	Apache 2.0
TensorRT-LLM	NVIDIA	12.7K^[8]	FP8/NVFP4, EAGLE-3 speculative	Baseten, DeepInfra, NVIDIA NIM	Apache 2.0
llama.cpp	Georgi Gerganov	95.2K^[9]	GGUF format, CPU-first inference	Ollama, LM Studio, Jan	MIT
Ollama	Ollama Inc.	163K^[10]	One-command deployment on llama.cpp	Individual devs, prototyping	MIT
TGI	Hugging Face	9.7K^[11]	Multi-backend (vLLM/TRT-LLM)	HF Inference Endpoints	HFOIL
FlashAttention	Tri Dao (Princeton/Together)	—	IO-aware exact attention, FA-4 Blackwell	Every major engine	BSD 3-Clause
NVIDIA Dynamo	NVIDIA	New (2025)	Disaggregated P/D, LLM-aware routing	Azure AKS, NVIDIA NIM	Apache 2.0

Key Finding: Convergence + Commercialization

The market is converging on vLLM (open-source default), TensorRT-LLM (NVIDIA-optimized), and SGLang (performance alternative). TGI entering maintenance mode in December 2025 confirms this consolidation.^[12] Meanwhile, the January 2026 spinouts of Inferact ($800M valuation) and RadixArk ($400M valuation) mark the shift from academic projects to venture-backed companies with commercial interests that may increasingly diverge from pure open-source community needs.^[13]

The proprietary vs. open-source dynamic is nuanced. Open-source engines match proprietary performance within 10–20% on most workloads.^[14] But providers like Fireworks AI (FireAttention) and Together AI (custom kernels + FlashAttention) claim 3–4x advantages through end-to-end stack optimization that goes beyond any single engine component. The question is whether these advantages are durable or transient. Section 06 addresses this directly.

Section 03

Engine Landscape Snapshot

Twelve open-source and semi-open engines mapped across origin, scale, innovation, hardware support, and business model. Proprietary engines from Fireworks, Together, Crusoe, Nebius, and fal are profiled separately in Section 06.

Engine	Origin	Stars	Version	Key Innovation	Hardware	Quantization	Structured Output	Disagg. P/D	License	Funding / Val.	Primary Users
vLLM	UC Berkeley	70.8K	v0.15.1	PagedAttention, continuous batching, prefix caching	NVIDIA, AMD, Intel, TPU, Ascend, Gaudi^[15]	FP8, GPTQ, AWQ, Marlin	Via Outlines	Yes (experimental)	Apache 2.0	Inferact: $150M seed, $800M val.^[16]	Modal, RunPod, Anyscale, BentoML
SGLang	UC Berkeley LMSYS	23.6K	v0.4+	RadixAttention, zero-overhead scheduling, XGrammar	NVIDIA, TPU (SGLang-Jax), GB200 NVL72^[17]	FP8, GPTQ, AWQ	Native (XGrammar, 10x faster)^[18]	Yes	Apache 2.0	RadixArk: $400M val.^[19]	LMSYS Arena, xAI
TensorRT-LLM	NVIDIA	12.7K	v1.3.0rc4	FP8/NVFP4, EAGLE-3, Wide Expert Parallelism	NVIDIA only (Hopper, Blackwell, Ada)^[20]	FP8, NVFP4, INT8, INT4	Limited	Yes (via Dynamo)	Apache 2.0	NVIDIA (corporate)	Baseten, DeepInfra, NIM
llama.cpp	Georgi Gerganov	95.2K	Continuous	GGUF format, CPU-first, 1.5–8-bit quant	CPU (ARM, x86), Metal, CUDA, ROCm, Vulkan, WebGPU^[21]	GGUF (1.5–8-bit)	Grammar-based	No	MIT	Community-driven	Ollama, LM Studio, Jan, GPT4All
Ollama	Ollama Inc.	163K	Continuous	One-command deploy, 200+ models	CPU, Metal, CUDA (via llama.cpp)^[22]	GGUF (via llama.cpp)	Via llama.cpp	No	MIT	Undisclosed VC	Individual devs, prototyping
TGI	Hugging Face	9.7K	v3.3.5	Multi-backend (vLLM/TRT-LLM/llama.cpp)	NVIDIA, AMD, TPU, Neuron^[23]	GPTQ, AWQ, EXL2, bitsandbytes	Via Outlines	No	HFOIL	HF: $4.5B val.	HF Endpoints (maintenance mode)
Triton Server	NVIDIA	8.7K	v2.65.0	Multi-framework, dynamic batching, BLS	NVIDIA (Hopper, Blackwell)	Via backends	No	Via Dynamo	BSD 3-Clause	NVIDIA (corporate)	Enterprise, SageMaker, Azure ML
NVIDIA Dynamo	NVIDIA	New	v1.0	Disaggregated P/D, dynamic GPU scheduling	NVIDIA (Hopper, Blackwell)^[24]	Via engine backends	Via engine backends	Core feature	Apache 2.0	NVIDIA (corporate)	Azure AKS, NIM, K8s
DeepSpeed-MII	Microsoft	2.1K	v0.2.x	Dynamic SplitFuse, ZeroQuant	NVIDIA GPUs	ZeroQuant (INT8/INT4)	No	No	Apache 2.0	Microsoft (corporate)	Declining (MS shifting to ONNX)^[25]
MLC LLM	MLC AI (TVM)	19.5K	v0.1.0	Compiler-driven, cross-platform (WebGPU)	CUDA, OpenCL, Vulkan, Metal, WebGPU^[26]	TVM-based auto quant	No	No	Apache 2.0	Community/OctoML	On-device, browser (WebLLM)
ONNX Runtime	Microsoft	15.3K	v1.23.2	Universal ONNX format, cross-platform	CUDA, TensorRT, DirectML, OpenVINO, CoreML^[27]	INT8, INT4 (MoE kernels)	No	No	MIT	Microsoft (corporate)	Windows ML, Azure ML, enterprise
Exo	EXO Labs	21.8K	v0.0.15-alpha	Peer-to-peer distributed inference	Any (phones, laptops, DGX Spark)^[28]	Via backends	No	P2P distributed	GPL-3.0	Undisclosed (startup)	Consumer, heterogeneous clusters

Two patterns emerge from this landscape. First, the GitHub star distribution follows a power law: the top 3 engines (Ollama, llama.cpp, vLLM) hold 72% of total community attention, confirming that developer mindshare has concentrated rapidly. Second, every engine that matters now supports some form of continuous batching and KV cache management; the differentiation has moved to higher-level innovations like cache-aware routing, structured output, and disaggregated serving.

Section 04

The Big Three Cloud Engines

70.8K

vLLM GitHub Stars

23.6K

SGLang GitHub Stars

12.7K

TensorRT-LLM Stars

1,800+

Combined Contributors (Big Three)

For production cloud inference, three engines define the frontier. Each makes fundamentally different tradeoffs. vLLM optimizes for broad hardware compatibility and ecosystem adoption. SGLang optimizes for cache efficiency and structured generation. TensorRT-LLM optimizes for raw NVIDIA hardware utilization. The choice between them defines a provider's performance ceiling, operational complexity, and vendor lock-in.

Dimension	vLLM	SGLang	TensorRT-LLM
Throughput (100 concurrent, H100)	4,741 tok/s^[29]	~5,000 tok/s	~5,000 tok/s (short input)
TTFT	Fastest across concurrency	Moderate	Slowest
Per-token latency stability	Variable	Most stable (4–21ms)^[30]	Good
Cache hit rate (few-shot)	15–25%	85–95%^[31]	Standard
Blackwell performance	Good	Good	Best (deepest optimization)
Hardware breadth	NVIDIA, AMD, Intel, TPU, Ascend, Gaudi	NVIDIA, TPU (Jax)	NVIDIA only
Structured output	Via Outlines	Native XGrammar (10x faster)	Limited
Speculative decoding	Draft model	EAGLE integration	EAGLE-3 native (up to 3.6x on B200)^[32]
MoE optimization	Wide-EP (2.2K tok/s/H200)^[33]	DP attention (1.9x decode)	Wide EP (native)
Governance	PyTorch Foundation^[34]	RadixArk (startup)	NVIDIA (corporate)

January 2026: Two Spinouts, $1.2B in Valuation

In a single week, both dominant open-source inference projects commercialized. Inferact (vLLM) raised $150M at $800M valuation from a16z and Lightspeed, led by Ion Stoica (Databricks co-founder).^[35] RadixArk (SGLang) secured an Accel-led round at $400M valuation, with Ying Sheng (ex-xAI) as CEO.^[36] Both remain Apache 2.0 licensed, but the commercial entities will increasingly control roadmap priorities, enterprise features, and community governance. Monitor for licensing or feature gating changes.

vLLM: The De Facto Standard

vLLM is to LLM inference what Linux is to operating systems: the default choice that works everywhere. Its PagedAttention mechanism applies virtual memory principles to KV cache management, achieving near-zero memory waste and enabling significantly larger batch sizes.^[37] Combined with continuous batching, it delivers 10–24x faster serving versus naive implementations.

The V1 architecture (complete migration by v0.11.0) removed V0 code entirely, enabling mixed prefill+decode in the same step and cross-node KV cache reuse for disaggregated inference.^[38] The v0.14.0 release introduced vLLM-Omni, the first open-source omni-modality serving framework (text, image, video, audio, TTS). vLLM's governance under the PyTorch Foundation, with maintainers from Anyscale, AWS, Databricks, IBM, and Snowflake, ensures no single corporate interest controls the project.

SGLang: The Performance Challenger

SGLang's core innovation is RadixAttention, which uses a radix tree data structure for automatic KV cache reuse across requests. Where vLLM achieves 15–25% cache hit rates on few-shot workloads, SGLang achieves 85–95%.^[31] On cache-heavy workloads like multi-turn chat, SGLang delivers 3.1x higher throughput than vLLM on Llama-70B.^[39]

SGLang also leads in structured generation, having moved to XGrammar as its default backend. XGrammar uses compressed finite state machines for constrained output decoding, delivering up to 10x performance improvement over regex-based approaches. On GB200 NVL72 hardware, SGLang achieves 3.8x prefill and 4.8x decode throughput versus H100.^[40]

SGLang's weaknesses: Narrower hardware support than vLLM (primarily NVIDIA, with experimental AMD/TPU); a smaller contributor base (~400 vs. vLLM's 1,000+); and governance now dependent on RadixArk, a seed-stage startup, rather than a foundation. If RadixArk's priorities diverge from the open-source community, SGLang's roadmap could fragment.

TensorRT-LLM: The NVIDIA Native

TensorRT-LLM trades hardware breadth for maximum NVIDIA performance. On B200 GPUs, it consistently outperforms both vLLM and SGLang across all metrics due to deep Blackwell kernel optimization.^[41] Native NVFP4 support enables 4-bit inference with less than 1% accuracy degradation when properly calibrated. EAGLE-3 speculative decoding delivers up to 3.6x throughput boost on B200 at low batch sizes (the 2–6x range reported in literature varies by batch size, model size, and acceptance rate; see Section 09). Wide Expert Parallelism optimizes MoE model serving for architectures like DeepSeek and Mixtral.

The tradeoff is clear: TensorRT-LLM is NVIDIA-only, requires more setup complexity, and has a smaller contributor base. But for providers committed to NVIDIA hardware (which is most of the market), it offers the highest performance ceiling.

Deep Dive: vLLM V1 Architecture

The V1 architecture, completed by v0.11.0 in late 2025, represents the most significant structural change in vLLM's history. V0 code was fully removed.

Mixed prefill+decode: V0 could only do one operation at a time per step. V1 mixes both, improving GPU utilization during batched serving by 20–40%.
Disaggregated inference: KV cache is fetched from remote nodes, enabling cross-node cache reuse. Prefill and decode can run on separate GPU pools optimized for their respective compute profiles.
DeepSeek MoE support: Wide-EP (Expert Parallelism) delivers 2.2K tok/s per H200 for DeepSeek-style MoE models.^[33]
Per-token latency reduction: v0.4 release achieved 40% reduction in per-output-token latency for DeepSeek V3.1 on H200.
Omni-modality (v0.14.0): vLLM-Omni serves text, image, video, audio, and TTS models through a unified serving framework.

The V1 migration signals vLLM's maturation from a research prototype to production infrastructure. It also raised the bar for SGLang, which must now match vLLM's disaggregated serving capabilities to maintain its performance advantage.

Section 05

NVIDIA's Inference Stack

NVIDIA is assembling the most vertically integrated inference stack in the industry: from silicon to orchestration, from kernel libraries to managed microservices. Understanding this stack is essential because every inference provider builds on top of it, and NVIDIA's decisions constrain or enable everyone else's options.

Layer 4: Dynamo (Orchestration)

Disaggregated Prefill/Decode

LLM-Aware Request Routing

Dynamic GPU Scheduling

30x Throughput (DeepSeek-R1)^[42]

Layer 3: TensorRT-LLM (Engine)

FP8 / NVFP4 Quantization

EAGLE-3 Speculative Decoding

Wide Expert Parallelism

Inflight Batching

Layer 2: Triton Inference Server (Serving)

Multi-Framework Support

Dynamic Batching

Model Pipeline (BLS)

Prometheus / OpenTelemetry

Layer 1: CUDA / cuDNN / Blackwell Hardware

FlashAttention-4

H100 / H200 / B200

NVLink / NVSwitch

Vera Rubin (Late 2026)

$20B

Groq Acquisition Price^[43]

30x

Dynamo DeepSeek-R1 Throughput

250K+

GPUs Deployed (NIM Ecosystem)

Vertically Integrated Stack Layers

NVIDIA's three-pronged inference strategy is unprecedented in scope. First, TensorRT-LLM provides deep kernel optimization for NVIDIA hardware, especially Blackwell. Second, Dynamo (released at GTC 2025) provides open-source datacenter-scale orchestration that supports all major backends (vLLM, TRT-LLM, SGLang).^[44] Third, the Groq IP deal ($20B, December 2025, structured as a non-exclusive licensing agreement with significant talent transfer) gives NVIDIA access to LPU architecture that delivers approximately 10x throughput of GPUs at approximately 90% less power.^[45]

Dynamo's headline claim is 30x more requests served for DeepSeek-R1 on Blackwell hardware and 2x+ throughput on Llama 70B on Hopper. The mechanism: disaggregated prefill-decode with LLM-aware routing that dynamically allocates GPU resources based on workload characteristics. This is not just an engine; it is an orchestration layer that makes the engine choice less important by abstracting across vLLM, TRT-LLM, and SGLang.

FlashAttention-4, announced at Hot Chips in September 2025, runs exclusively on Blackwell and achieves 20–22% faster performance than cuDNN attention through a 5-stage pipeline with online softmax optimization that skips 90% of rescaling operations.^[58] The Blackwell-only restriction is deliberate: it creates a hardware upgrade incentive that benefits NVIDIA's GPU sales.

NVIDIA also publishes its most performant inference kernels through FlashInfer, which won Best Paper at MLSys 2025.^[62] FlashInfer is already integrated into SGLang, vLLM, and MLC-Engine as the default attention kernel library. This creates an interesting dynamic: NVIDIA funds and controls the kernel distribution channel that competing engines depend on.

Threat Assessment

NVIDIA controls the full inference stack from silicon to orchestration. The Groq acquisition ($20B) absorbs the most credible alternative silicon.^[43] FlashAttention-4 is Blackwell-only. FlashInfer is NVIDIA's kernel distribution channel. Dynamo abstracts across engines, making NVIDIA the orchestration default. Providers who do not build proprietary optimization on top of this stack have zero engine differentiation. The window for non-NVIDIA inference silicon (SambaNova, Etched, Cerebras) is narrowing with each acquisition and integration cycle.

Section 06

Proprietary Engines & Provider Moats

Five providers have built proprietary inference engines that go beyond open-source defaults. Each claims meaningful performance advantages, but the durability of these moats varies significantly by depth of optimization and hardware coupling.

Provider	Engine	Type	Key Differentiator	Key Limitation	Notable Metric
Fireworks AI	FireAttention V4	Proprietary CUDA kernels	TensorCore Gen 5 optimization, NVFP4 on B200	Closed-source, NVIDIA-only, single-vendor dependency	250+ tok/s on DeepSeek V3^[48]
Together AI	Together Engine	FlashAttention + custom kernels	Tri Dao (FA creator) as Chief Scientist	Key-person risk (Tri Dao); FA-4 is Blackwell-only	4x faster than vLLM (claimed)^[49]
Crusoe	MemoryAlloy	Proprietary distributed KV cache	Cluster-wide KV sharing, peer-to-peer GPU memory	Requires vertically integrated infra; unclear portability	9.9x faster TTFT^[50]
Nebius	Token Factory	Proprietary stack on Aether	MLPerf-validated, own data centers (Finland/Paris)	Limited model breadth vs. open-source; geographic concentration	MLPerf benchmark leader^[51]
fal.ai	fal Engine	Proprietary	Diffusion/generative media specialization	LLM inference is secondary; narrow model focus	Up to 10x faster (diffusion)^[52]

Fireworks AI has the deepest engine investment. Founded by the ex-PyTorch team at Meta, Fireworks built FireAttention from scratch with custom CUDA kernels.^[46] The company claims 4–15x faster performance than open-source alternatives, though independent benchmarks show open-source engines within 10–20% on standard workloads.^[14] FireAttention V4 adds NVFP4 precision on B200 GPUs and claims 3.5x throughput improvement over SGLang on H200. The company processes 10+ trillion tokens per day for 10,000+ customers and raised $250M at $4B valuation in October 2025.^[47] The moat is not any single kernel but the integrated optimization across scheduler, memory management, batching policy, and routing logic.

Crusoe's MemoryAlloy takes a fundamentally different approach: instead of optimizing single-node inference, it creates a cluster-wide distributed KV cache fabric with peer-to-peer GPU memory sharing. The result is 9.9x faster TTFT for multi-node inference workloads. This architectural innovation is harder to replicate than kernel-level optimization because it requires control of the network fabric between GPUs, which Crusoe has through its vertically integrated infrastructure.

Together AI's advantage is unique: Tri Dao, the creator of FlashAttention, serves as Chief Scientist. Together has early access to FlashAttention-4 and the deepest kernel expertise in the market. The company is deploying 36,000 GB200 GPUs, the largest single allocation by an independent provider.^[55]

Proprietary Engine Advantages

The providers with proprietary engines share three characteristics: (1) founding teams with GPU kernel expertise (ex-PyTorch, ex-NVIDIA, FlashAttention creators), (2) tight hardware-software coupling (optimizing for specific GPU generations), and (3) end-to-end stack control (not just the engine, but scheduler, router, and memory manager). The moat is in the integration, not any single component.

Deep Dive: Can a Proprietary Engine Create a Sustainable Moat?

Short answer: engine-only moats last approximately 6 months. Integration-layer moats are durable. Evidence: vLLM's PagedAttention innovation (June 2023) was matched by SGLang and TensorRT-LLM within two release cycles. SGLang's RadixAttention advantage prompted vLLM to ship Automatic Prefix Caching within months.

The evidence for commoditization is strong. Open-source engines match proprietary performance within 10–20% on most workloads, and the gap closes with each release cycle.^[14] NVIDIA Dynamo abstracts across engines, dissolving engine lock-in. TGI's move to maintenance mode confirms that even Hugging Face decided engine optimization is not where value accrues.^[12]

But integration-layer moats are different. Consider the evidence:

Hardware-software co-design (Groq LPU, SambaNova dataflow, Etched ASIC) requires massive capital but creates multi-year advantages. NVIDIA validated this by paying $20B for Groq.
Operational excellence at scale (Fireworks) compounds over time. Each optimization feeds into the next. The moat is not a single kernel but thousands of micro-optimizations across the entire serving path.
Customer-specific optimization pipelines (automated quantization, model distillation, prompt optimization) create switching costs. Once a customer's models are profiled and optimized for a specific platform, migration requires re-profiling everything.
Sovereign/regulated deployment capability is a structural moat that open-source engines alone cannot provide. Air-gapped, on-premises, data-residency-compliant inference requires integrated infrastructure, not just software.

For MARA, the implication is clear: do not invest in building a custom inference engine. Use vLLM or SGLang as the base. Instead, invest in the orchestration layer (routing, scaling, caching), automated optimization pipelines, and sovereign deployment capabilities. These are the moats that open-source cannot easily replicate.^[56]

Section 07

Attention Mechanisms & Kernels

Attention computation is the single most expensive operation in transformer inference. The evolution from naive attention requiring O(n²) memory to IO-aware tiled computation with O(n) memory, and now to hardware-specialized pipelining, has delivered cumulative 15x+ speedups in four years. Understanding this stack is critical because every engine's performance ceiling is ultimately set by its attention kernel.

Mechanism	Year	Key Innovation	Performance	Hardware
FlashAttention-1	2022	IO-aware tiling, exact attention without materialization	2–4x speedup over PyTorch	A100
FlashAttention-2	2023	Better work partitioning, reduced non-matmul FLOPs	~2x over FA-1	A100, H100
FlashAttention-3	2024	Warp-specialization, async TMA, FP8 support	740 TFLOPS (75% H100 utilization); FP8: 1.2 PFLOPS^[57]	H100 (Hopper)
FlashAttention-4	Sep 2025	5-stage pipeline, online softmax (90% rescaling skip), CUDA-based softmax	20–22% over cuDNN attention; 15x over FA-1^[58]	B200 (Blackwell only)
PagedAttention	2023	Virtual memory for KV cache, non-contiguous block allocation	Near-zero memory waste^[59]	vLLM (any GPU)
RadixAttention	2024	Radix tree prefix caching, automatic KV reuse across requests	85–95% cache hit rate^[60]	SGLang
Multi-Head Latent Attention (MLA)	2024	Low-rank KV compression into shared latent space	57x KV cache reduction, 93.3% memory savings^[61]	DeepSeek V3/V3.2
FlashInfer	2024–2025	Unified attention kernel library, NVIDIA kernel release channel	Best Paper MLSys 2025^[62]	SGLang, vLLM, MLC-Engine

The FlashAttention Lineage

FlashAttention is the most consequential inference optimization of the 2020s. Created by Tri Dao (Princeton professor, Together AI Chief Scientist), it has won Outstanding Paper awards at ICML 2022, COLM 2024, and MLSys 2025 Honorable Mention. Every major inference engine uses FlashAttention or its derivatives. FA-4 is Blackwell-only and currently forward-pass-only (no backward pass, no GQA/MQA support), which limits training use but is sufficient for inference.^[63]

FlashMLA: DeepSeek's Open-Source Contribution

FlashMLA, open-sourced during DeepSeek's Open Source Week (February 2025), provides optimized CUDA kernels for Multi-Head Latent Attention. On H800 SXM5, it achieves up to 3,000 GB/s memory-bound throughput and 580–660 TFLOPS compute-bound performance. On B200, it reaches 1,460 TFLOPS forward and 1,000 TFLOPS backward.^[64] MLA's 93.3% KV cache memory savings means that models like DeepSeek V3 can serve dramatically more concurrent users on the same hardware. A recent ACL 2025 paper demonstrated "Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer," signaling that MLA will spread beyond DeepSeek to become an architectural standard.^[65]

For MARA, supporting MLA-based models efficiently via FlashMLA kernels will be table-stakes within 12 months as more model providers adopt this architecture for inference cost optimization.

Section 08

Quantization Landscape

Quantization is the single most impactful cost optimization lever for inference. Reducing precision from FP16 to FP8 cuts memory by 2x; to INT4, by 4x. The question is no longer whether to quantize, but which method delivers the best quality-speed-memory tradeoff for each deployment scenario.

Method	Bits	Type	Quality Retention	Speed Gain	Best For	Engine Support
FP8	8-bit float	Weight + Activation	~99.5%	~1.5–2x	Production default on Hopper/Blackwell	TRT-LLM, vLLM, SGLang
NVFP4	4-bit float	Weight + KV Cache	~99% w/calibration	~2–3x	Blackwell-native, MoE models	TRT-LLM, FireAttention^[66]
AWQ	4-bit int	Weight-only (activation-aware)	~95%	~2–3x (10.9x w/Marlin kernel)^[67]	Best INT4 quality for GPU serving	vLLM, TGI, SGLang
GPTQ	4-bit int	Weight-only (post-training)	~90%	~2–3x	Established, wide support	vLLM, TGI, TRT-LLM, SGLang
GGUF	1.5–8-bit	Weight-only (multi-format)	~92%	CPU/Apple optimized	Edge/local inference standard	llama.cpp, Ollama
INT8 SmoothQuant	8-bit int	Weight + Activation	~98%	~1.5–2x	Safe starting point for enterprise	TRT-LLM, DeepSpeed
Hybrid FP8+INT4	Mixed	Per-layer precision	~97%	Frontier	Attention FP8 + MLP INT4	Experimental (research)

99.5%

FP8 Quality Retention

3.5x

NVFP4 Memory Reduction vs FP16

10.9x

Marlin Kernel Speedup (AWQ)

<1%

NVFP4 Accuracy Loss (Calibrated)

Production Recommendation

Use FP8 as the production default for Hopper/Blackwell. NVFP4 for Blackwell cost optimization (uses block size 16 vs. MXFP4's block size 32, reducing quantization error; Blackwell performs FP4 at double the rate of FP8; Blackwell Ultra at 3x). AWQ for 4-bit GPU serving when you must fit on fewer GPUs. GGUF for edge/local deployments. Kernel optimization matters more than quantization method choice: Marlin kernels provide 2.6–10.9x speedup over standard GPTQ/AWQ kernels on identical quantized weights.^[68]

Recommended Method by Use Case

Use Case	Recommended Method	Rationale
Production cloud (quality-sensitive)	FP8	Near-lossless at 2x memory reduction; no calibration needed
Blackwell cost optimization	NVFP4	Native hardware support, <1% degradation with calibration
GPU serving on fewer GPUs	AWQ + Marlin kernel	Best INT4 quality; 10.9x with optimized kernels
Edge / local / Apple Silicon	GGUF Q4_K_M + iMatrix	Broadest hardware support; importance-matrix calibration
Enterprise (conservative)	INT8 SmoothQuant	Proven, safe; 98% quality with minimal risk
Research / frontier	Hybrid FP8+INT4	Per-layer precision; attention in FP8, MLP in INT4

The emerging pattern is format hybridization: different layers within a single model receive different quantization levels based on sensitivity analysis. A differentiated inference platform should offer automated quantization pipelines that profile models and select optimal per-layer precision. This is a real value-add over generic cloud inference APIs.^[69]

Section 09

Optimization Techniques

Beyond attention kernels and quantization, a constellation of optimization techniques determines real-world throughput and latency. These techniques stack multiplicatively: a provider combining continuous batching, prefix caching, speculative decoding, and disaggregated serving can achieve 50–100x throughput versus a naive implementation.

Technique	Speedup	How It Works	Engine Support	Production Status
Speculative Decoding (EAGLE-3)	2–6x^[70]	Draft model generates tokens ahead; target model verifies in single pass	TRT-LLM, vLLM	Production
Continuous Batching	10–24x vs naive	Dynamic request admission mid-batch; eliminates head-of-line blocking	All major engines	Standard
Prefix Caching (RadixAttention)	Up to 3.1x	Radix tree KV reuse for shared prompt prefixes	SGLang (85–95% hit rate)	Production
Prefix Caching (APC)	Moderate	Automatic prefix matching on hash-based lookup	vLLM (15–25% hit rate)	Production
Disaggregated P/D	Up to 30x^[71]	Separate prefill (compute-heavy) and decode (memory-bound) onto different hardware	Dynamo, vLLM, SGLang	Standard
Structured Generation (XGrammar)	10x over regex^[72]	Compressed FSM for constrained output decoding (JSON, EBNF)	SGLang (default)	Production
Wide Expert Parallelism	MoE-specific	Route MoE experts to different GPUs; optimized all-to-all communication	TRT-LLM, vLLM	Production

Deep Dive: Speculative Decoding Evolution

Speculative decoding has evolved from a research curiosity to a production-critical optimization in 18 months. The EAGLE family dominates because it preserves output distribution guarantees for both greedy and non-greedy sampling (Medusa and Lookahead do not).

Method	Speedup	Key Property	Status
EAGLE-1	1.5–2x	Auto-regressive draft with tree attention	Superseded
EAGLE-2	1.7–2.1x over Lookahead	Improved tree structure, better acceptance rates	vLLM, TRT-LLM
EAGLE-3	2–6x	Token-level prediction, multi-layer fusion, TRT-LLM native	Production
VSD (Variational)	~9.6% better acceptance than EAGLE-3	Variational approach to draft distribution	Research (Feb 2026)
QuantSpec	Up to 2.5x	Self-speculative with 4-bit quantized KV cache, >90% acceptance	Research

Critical insight for MARA: Speculative decoding benefits shrink at high batch sizes because the throughput bottleneck shifts from memory bandwidth to compute. The technique is most valuable for interactive, low-batch scenarios, which is exactly the target for low-latency interactive inference. Integrating EAGLE-3 for interactive workloads directly supports the core OKR.^[73]

Disaggregated Inference: The New Default

Disaggregated prefill-decode serving has gone from research paper (DistServe, 2024) to default architecture across every major framework in under 18 months. The insight is simple: prefill is compute-bound (benefits from high-FLOPS hardware) while decode is memory-bandwidth-bound (benefits from high-bandwidth memory). Separating them onto optimized hardware pools yields dramatic improvements.

NVIDIA Dynamo's implementation on DeepSeek-R1 with Blackwell demonstrates the ceiling: 30x more requests served compared to baseline. On 96 H100s, disaggregated vLLM achieves 52.3K input tok/s + 22.3K output tok/s per node.^[74] Every serious inference provider now either implements disaggregated serving or plans to. For MARA, this is not optional; it is a Day 1 architecture requirement.

Section 10

Edge & Local Inference

Edge inference is not MARA's target market. But understanding the edge landscape matters because enterprise customers will ask about hybrid edge+cloud architectures, edge devices will route complex tasks to cloud, and quantization techniques developed for edge (GGUF, iMatrix) directly apply to cloud cost optimization.

Engine	GitHub Stars	Target	Key Feature	Status
llama.cpp	95.2K	CPU / Apple / cross-platform	GGUF standard, 1.5–8-bit quantization. Not designed for multi-GPU cloud serving; use vLLM/SGLang/TRT-LLM for datacenter workloads.	Dominant (edge)
Ollama	163K	Desktop (Mac, Windows, Linux)	One-command deployment (`ollama run llama3.3`)	Most popular local
ExecuTorch	N/A (Meta)	Mobile / embedded	50KB base footprint, 12+ hardware backends	GA (Oct 2025)
Apple MLX	N/A (Apple)	Apple Silicon	Metal-optimized, Python-native, best M-series throughput	Growing fast
MLC LLM	19.5K	Cross-platform (iOS, Android, browser)	Compiler-driven (TVM), MLCEngine API	On-device focus
WebLLM	N/A (MLC AI)	Browser (WebGPU)	~80% native speed; Llama-3.1-8B at 41.1 tok/s in browser^[75]	Maturing
Exo	21.8K	Peer-to-peer distributed	Heterogeneous clustering, RDMA over Thunderbolt 5	Alpha

Oct 2025

ExecuTorch GA — Meta's on-device framework reaches general availability. 50KB base footprint with support for Apple, Qualcomm, Arm, MediaTek, and Vulkan backends. Meta's play to make Llama the default on-device model.^[76]

Dec 2025

llama.cpp Android/ChromeOS — Native app development via GUI binding. Brings CPU-first inference to mobile platforms with direct hardware access.

Jan 2026

Sub-1B models practical — Llama 3.2 (1B/3B), Gemma 3 (270M), Qwen2.5 (0.5–1.5B) handle many practical tasks. Combined with 4-bit quantization, these run on phones.

2026 Forecast

Hybrid edge+cloud becomes enterprise default — Edge handles simple tasks (classification, extraction); cloud handles complex reasoning. Privacy, latency, cost, and offline availability drive adoption.

Edge Is Complementary, Not Competitive

Edge inference does not threaten MARA's cloud inference business. The two are complementary: edge devices handle latency-sensitive, privacy-sensitive, or offline tasks while routing complex reasoning (70B+ models, long-context, multi-step) to cloud. The enterprise pattern emerging in 2026 is a routing layer that automatically selects edge vs. cloud based on task complexity, privacy requirements, and cost constraints. MARA should prepare for this hybrid architecture in its API design.^[77]

Section 11

Provider-Engine Matrix

Every managed inference provider makes engine-layer decisions that define their competitive position. This matrix maps providers to their engines, hardware, and capability scores across six dimensions critical for enterprise buyers.

Provider	Primary Engine	Hardware	Throughput	Latency	Structured Output	Disaggregated	Edge
Fireworks AI	FireAttention V4	H200, B200	High	High	Yes	Yes	No
Together AI	Together Engine + FA	H100, H200, GB200	High	High	Moderate	Yes	No
Baseten	TRT-LLM / Truss	NVIDIA (AWS)	High	Moderate	Moderate	Via TRT-LLM	No
Crusoe	MemoryAlloy	H100, H200	High	9.9x TTFT	Limited	Native	No
Nebius	Token Factory	H100, H200, Blackwell Ultra	MLPerf-validated	High	Moderate	Yes	No
Modal	vLLM (primary)	NVIDIA GPUs	Moderate	Moderate	Via vLLM	No	No
fal.ai	fal Engine	H100, H200	10x (diffusion)	High	N/A (media)	Partial	No
DeepInfra	TRT-LLM / Blackwell	H100, Blackwell	High	Moderate	Moderate	Yes	No
Inferact	vLLM (commercial)	Multi-hardware	High	High	Via Outlines	V1 Native	No
Groq	LPU Runtime	LPU v2 (NVIDIA)	1,600+ tok/s	Ultra-low	Moderate	N/A (ASIC)	No
Cerebras	WSE Runtime	WSE-3	High	Ultra-low	Limited	N/A (wafer)	No
SambaNova	RDU Runtime	SN40L RDU	High	Moderate	Limited	Dataflow arch	No
AWS (Inferentia/Neuron)	Neuron SDK / Transformers NeuronX	Inferentia2, Trainium	Moderate	Moderate	Limited	No	No

Note: Amazon Inferentia2/Trainium with the Neuron SDK represents the most significant non-NVIDIA inference hardware effort from a hyperscaler. While its model support is narrower and tooling less mature than NVIDIA's stack, AWS's scale and pricing (up to 40% cheaper than comparable GPU instances) make it relevant for cost-optimized batch workloads. Trainium2, expected in 2026, aims to close the performance gap.

Best Engine For Each Use Case

Use Case	Recommended Engine	Why
High-throughput batch processing	vLLM	Best concurrency scaling, broadest hardware support, PyTorch Foundation governance
Low-latency interactive serving	SGLang or TRT-LLM	RadixAttention cache efficiency (SGLang) or native NVIDIA kernel optimization (TRT-LLM)
Structured output (JSON/EBNF)	SGLang + XGrammar	10x faster constrained generation; compressed FSM approach
Edge / local deployment	llama.cpp + GGUF	Broadest hardware support (CPU, Metal, CUDA, Vulkan, WebGPU)
Diffusion / media generation	fal Engine or SGLang Diffusion	Specialized optimization for image/video generation workloads
DeepSeek MoE models	vLLM or SGLang	Wide-EP + MLA support; DeepSeek-specific optimizations in both engines
Maximum NVIDIA optimization	TensorRT-LLM	Deepest kernel optimization; best B200 performance; NVFP4 native

Section 12

Strategic Implications

~6 mo

Engine Moat Half-Life

10–20%

Open vs Proprietary Gap

Durable Moat Areas

5–7x

Self-Host Cost Advantage*

Reality Check: Engine Commoditization

Engine-only moats last approximately 6 months before open-source catches up. vLLM's PagedAttention (June 2023) was matched by SGLang and TRT-LLM within two release cycles. NVIDIA's Dynamo abstracts across engines, dissolving lock-in. TGI's death shows even well-funded engines lose. Meta's Llama has commoditized model access: open models achieve ~90% of closed model performance at 87% lower inference cost.^[78]^[79] See Section 06 for the full moat sustainability analysis.

Opportunity: Five Durable Moat Areas

Despite engine-layer commoditization, durable advantages exist where open-source cannot easily follow:

Hardware-software co-design — Custom silicon (Groq LPU, Etched ASIC) creates multi-year advantages. NVIDIA validated this by paying $20B for Groq.
End-to-end stack optimization — Fireworks AI demonstrates that thousands of micro-optimizations compound into a moat no single open-source project can replicate.
Customer-specific optimization pipelines — Automated quantization profiling, model distillation, and per-workload tuning create switching costs. Migration requires re-profiling everything.
Sovereign / air-gapped deployment — Data-residency-compliant inference requires integrated infrastructure, not just software.
Guaranteed latency SLAs — Contractual latency guarantees are a product, not infrastructure. Most providers offer best-effort only.^[80]

Engine Switching Costs

Switching inference engines is not trivial. Practitioners report that migrating from vLLM to TensorRT-LLM (or vice versa) typically requires weeks of engineering effort for model re-optimization, batching policy tuning, and integration testing. Quantization profiles must be rebuilt from scratch. Monitoring and alerting pipelines require reconfiguration. For enterprises running 10+ models in production, a full engine migration is a quarter-long project. This switching cost is itself a moat for platforms that lock in early.

Commercialization Risk

Both Inferact (vLLM) and RadixArk (SGLang) are now venture-backed companies with commercial interests. The risk: enterprise features may be gated behind paid tiers, or open-source release cadence may slow to protect commercial offerings. MARA's contingency: maintain the ability to run either engine, avoid deep coupling to Inferact-specific or RadixArk-specific APIs, and monitor their licensing decisions quarterly. The Apache 2.0 license protects the current codebase, but future innovations may not be open-sourced.

Dimension	Market Reality	MARA Implication
Engine layer	Commoditizing. vLLM/SGLang are "good enough" for 90% of workloads.	Do not build a custom engine. Build ON open-source engines. Invest in orchestration.
Quantization	FP8 default, NVFP4 emerging, AWQ for 4-bit. Kernel optimization > method choice.	Offer automated quantization-as-a-service. Differentiate on per-layer optimization.
Speculative decoding	EAGLE-3 at 2–6x. Best for low-batch interactive scenarios.	Key differentiator for low-latency targets. Integrate EAGLE-3 for Day 1.
Hardware dependency	NVIDIA controls the full stack. Blackwell dominates. Groq IP acquired.	Optimize for Blackwell first. Plan for Vera Rubin. Accept NVIDIA dependency with eyes open.
Attention architecture	MLA spreading beyond DeepSeek. FlashMLA open-sourced.	Must support MLA models efficiently. This is table-stakes within 12 months.
Deployment model	Sovereign/air-gapped is underserved. Hybrid edge+cloud emerging.	Core differentiator. Build sovereign deployment as a product, not an afterthought.
Pricing dynamics	Commoditizing fast. Self-hosting can be 5–7x cheaper than proprietary APIs at scale (varies by model size and utilization).	Compete on value (SLAs, ease, optimization), not raw price per token.

The MLA Question

Supporting DeepSeek-style Multi-Head Latent Attention efficiently will be table-stakes for any serious inference platform. MLA's 93.3% KV cache memory savings and 57x cache reduction mean that models using this architecture can serve dramatically more concurrent users on the same hardware. As more model providers adopt MLA (signaled by the ACL 2025 paper on enabling MLA in any transformer), platforms that cannot serve MLA models efficiently will lose on cost-per-token for the fastest-growing model family in the market. MARA's engine selection must include FlashMLA kernel support from day one.

For detailed analysis of where sustainable moats exist (and where they do not), see the five integration-layer moat categories in Section 06.^[81]

Section 13

Methodology & Sources

Research Methodology

This report synthesizes 75+ primary sources collected between February 15–20, 2026. Analysis covers the full inference engine landscape from low-level attention kernels to production serving frameworks and managed inference platforms. All GitHub star counts are snapshots from February 2026 and may vary by several hundred. Performance benchmarks are sourced from independent evaluations (Clarifai, Cerebrium) and official project blogs; proprietary engine claims (FireAttention, Together Engine, MemoryAlloy) are self-reported and not independently verified. Financial data comes from press reports and may not reflect final deal terms. Quantization quality retention percentages are averages across standard benchmarks (MMLU, HumanEval, GSM8K) and may differ significantly for specific enterprise tasks.

Source Categories

Source Type	Count	Examples
GitHub repositories & release notes	15+	vLLM, SGLang, TRT-LLM, llama.cpp, FlashAttention, FlashInfer, Ollama, Exo, WebLLM
Official project blogs	15+	vLLM blog, NVIDIA developer blog, PyTorch blog, LMSYS blog, Fireworks blog
Independent benchmarks & evaluations	10+	Clarifai, Cerebrium, JarvisLabs, NurbolSakenov, MarkTechPost, OpenRouter
Technical papers (arXiv, MLSys, ACL)	8+	FlashAttention-3, FlashInfer, EAGLE-3, VSD, MLA, on-device LLMs (Apple Silicon)
Financial & analyst reporting	10+	Sacra, Fortune, TechCrunch, SiliconAngle, SDxCentral, VentureBeat
Industry analysis & strategy	10+	California Management Review, WorkOS, Wing VC, IntuitionLabs, Edge AI Vision

All data points are tagged with footnote references to their primary source. Where multiple sources report conflicting figures (particularly around the NVIDIA-Groq deal structure and proprietary benchmark claims), we note the discrepancy and present the most conservative interpretation.

References & Footnotes

[1] MarkTechPost, "Comparing the Top 6 Inference Runtimes for LLM Serving in 2025," Nov 2025. marktechpost.com
[2] Clarifai, "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B," Feb 2026. clarifai.com
[3] SiliconAngle, "Inferact Launches with $150M Funding to Commercialize vLLM," Jan 2026. siliconangle.com; TechCrunch, "Project SGLang Spins Out as RadixArk with $400M Valuation," Jan 2026. techcrunch.com
[4] NVIDIA Developer Blog, "Introducing NVIDIA Dynamo," 2025. developer.nvidia.com
[5] vLLM Blog, "Large-Scale Serving with Disaggregated Inference," Dec 2025. blog.vllm.ai
[6] vLLM GitHub repository, star count as of Feb 2026. github.com/vllm-project/vllm
[7] SGLang GitHub repository, star count as of Feb 2026. github.com/sgl-project/sglang
[8] TensorRT-LLM GitHub repository. github.com/NVIDIA/TensorRT-LLM
[9] llama.cpp GitHub repository, star count as of Feb 2026. github.com/ggml-org/llama.cpp
[10] Ollama GitHub repository, star count as of Feb 2026. github.com/ollama/ollama
[11] TGI GitHub repository; maintenance mode announced Dec 2025. github.com/huggingface/text-generation-inference
[12] Hugging Face Blog, "TGI Multi-Backend Architecture," 2025. huggingface.co/blog/tgi-multi-backend
[13] SiliconAngle, Inferact $150M seed announcement, Jan 2026. siliconangle.com
[14] Cerebrium, "Benchmarking vLLM, SGLang, TensorRT for Llama 3.1 API," 2025. cerebrium.ai
[15] vLLM documentation, supported hardware backends. github.com/vllm-project/vllm
[16] SiliconAngle, "Inferact Launches with $150M Funding," Jan 2026. siliconangle.com
[17] LMSYS Blog, "SGLang v0.4 Release," Dec 2024. lmsys.org
[18] SGLang Documentation, "Structured Outputs with XGrammar." docs.sglang.io
[19] TechCrunch, "SGLang Spins Out as RadixArk," Jan 2026. techcrunch.com
[20] NVIDIA TensorRT-LLM Release Notes. nvidia.github.io/TensorRT-LLM
[21] llama.cpp GitHub, hardware support documentation. github.com/ggml-org/llama.cpp
[22] Ollama GitHub repository. github.com/ollama/ollama
[23] Hugging Face Blog, "TGI Multi-Backend," supporting NVIDIA, AMD, TPU, Neuron. huggingface.co
[24] NVIDIA Dynamo GitHub repository. github.com/ai-dynamo/dynamo
[25] Microsoft ONNX Runtime GitHub repository. github.com/microsoft/onnxruntime
[26] MLC LLM GitHub, WebGPU and cross-platform support. github.com/mlc-ai/web-llm
[27] ONNX Runtime, hardware acceleration providers. github.com/microsoft/onnxruntime
[28] Exo GitHub repository, peer-to-peer distributed inference. github.com/exo-explore/exo
[29] Clarifai, "Comparing SGLang, vLLM, and TensorRT-LLM with GPT-OSS-120B on 2x H100." clarifai.com
[30] Cerebrium, "SGLang maintains stable 4-21ms per-token latency across concurrency levels." cerebrium.ai
[31] LMSYS Blog, SGLang RadixAttention achieving 85-95% cache hit rates. lmsys.org
[32] NVIDIA TensorRT-LLM Speculative Decoding Documentation, EAGLE-3 up to 3.6x. nvidia.github.io
[33] vLLM Blog, "GPT-OSS Optimizations: DeepSeek MoE at 2.2K tok/s per H200 with Wide-EP," Feb 2026. blog.vllm.ai
[34] PyTorch Blog, "PyTorch Foundation Expands, Welcomes vLLM and DeepSpeed," May 2025. pytorch.org
[35] SiliconAngle, Inferact $150M seed at $800M valuation, a16z + Lightspeed. siliconangle.com
[36] TechCrunch, RadixArk $400M valuation, Accel-led, Ying Sheng as CEO. techcrunch.com
[37] vLLM original paper, PagedAttention for virtual memory KV cache management. github.com/vllm-project/vllm
[38] vLLM Blog, "V1 Alpha Release: mixed prefill+decode, cross-node KV cache reuse," Jan 2025. blog.vllm.ai
[39] LMSYS Blog, SGLang 3.1x higher throughput than vLLM on Llama-70B. lmsys.org
[40] SGLang documentation, GB200 NVL72 performance: 3.8x prefill, 4.8x decode throughput. github.com/sgl-project/sglang
[41] NVIDIA TensorRT-LLM Release Notes, B200 performance benchmarks. nvidia.github.io
[42] NVIDIA Developer Blog, "Dynamo: 30x throughput on DeepSeek-R1 with Blackwell." developer.nvidia.com
[43] Fortune, "After NVIDIA's Groq Deal, AI Chip Startups in Play," Jan 2026. fortune.com
[44] NVIDIA Dynamo GitHub, multi-backend support (vLLM, TRT-LLM, SGLang). github.com/ai-dynamo/dynamo
[45] Groq Newsroom, "Groq and NVIDIA Enter Non-Exclusive Inference Technology Licensing Agreement." groq.com; IntuitionLabs analysis. intuitionlabs.ai
[46] Fireworks AI Blog, "FireAttention V4: FP4 on B200." fireworks.ai
[47] Sacra, Fireworks AI company profile: $327M total funding, $4B valuation. sacra.com
[48] WorkOS, "Fireworks AI: The PyTorch Team's Bet on Inference as the New Runtime." workos.com
[49] Together AI, FlashAttention integration and Tri Dao as Chief Scientist. github.com/Dao-AILab/flash-attention
[50] Crusoe, MemoryAlloy: cluster-wide distributed KV cache fabric for 9.9x faster TTFT. Internal product documentation.
[51] Nebius, Token Factory launch with MLPerf-validated performance on Aether infrastructure.
[52] fal.ai, proprietary inference engine for diffusion/generative media, $140M raise at $4.5B valuation.
[53] California Management Review, "How Open-Source AI Will Challenge Closed-Model Giants," Jan 2026. cmr.berkeley.edu
[54] SDxCentral, "AI Inferencing Will Define 2026 -- And the Market's Wide Open." sdxcentral.com
[55] OpenRouter, "State of AI: 100T Token Study on Inference Patterns." openrouter.ai
[56] VentureBeat, "NVIDIA, Groq, and the Race to Real-Time AI: Why Enterprises Win." venturebeat.com
[57] PyTorch Blog, "FlashAttention-3: achieving 740 TFLOPS on H100 (75% utilization), 1.2 PFLOPS with FP8." pytorch.org; arXiv paper. arxiv.org
[58] Modal, "Reverse Engineering FlashAttention-4: 5-stage pipeline, 20-22% over cuDNN, Blackwell-only." modal.com
[59] vLLM original paper, PagedAttention for near-zero KV cache memory waste. github.com/vllm-project/vllm
[60] LMSYS Blog, SGLang v0.4: RadixAttention achieves 85-95% cache hit rates on few-shot workloads. lmsys.org
[61] DeepSeek FlashMLA GitHub, "Multi-Head Latent Attention: 57x KV cache reduction, 93.3% memory savings." github.com/deepseek-ai/FlashMLA
[62] FlashInfer GitHub, Best Paper Award MLSys 2025; NVIDIA kernel release channel. github.com/flashinfer-ai/flashinfer
[63] Dao-AILab FlashAttention GitHub, FA-4 forward-only limitation, BSD 3-Clause license. github.com/Dao-AILab/flash-attention
[64] DeepSeek FlashMLA GitHub, "H800: 3000 GB/s memory-bound, 580-660 TFLOPS compute-bound; B200: 1460 TFLOPS forward." github.com/deepseek-ai/FlashMLA
[65] ACL 2025, "Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer." aclanthology.org
[66] NVIDIA Developer Blog, "Introducing NVFP4 for Efficient and Accurate Low-Precision Inference." developer.nvidia.com; Fireworks AI FireAttention V4 with NVFP4. fireworks.ai
[67] JarvisLabs, "vLLM Quantization Complete Guide: Marlin kernel achieves 10.9x speedup for AWQ (741 tok/s vs 68 tok/s)." docs.jarvislabs.ai
[68] JarvisLabs, "Kernel optimization matters more than quantization method: Marlin provides 2.6-10.9x speedup." docs.jarvislabs.ai
[69] LocalAIMaster, "Quantization Explained: AWQ vs GPTQ vs GGUF format comparison." localaimaster.com
[70] E2E Networks, "Accelerating LLM Inference with EAGLE-3: 2-6x speedups." e2enetworks.com
[71] NVIDIA Developer Blog, "Dynamo disaggregated serving: up to 30x more requests on DeepSeek-R1." developer.nvidia.com
[72] SGLang Structured Outputs Documentation, "XGrammar: up to 10x faster than regex-based constrained generation." docs.sglang.io
[73] NVIDIA, "An Introduction to Speculative Decoding for Reducing Latency in AI Inference." developer.nvidia.com
[74] vLLM Blog, "Large-Scale Serving: 52.3K input tok/s + 22.3K output tok/s per node on 96 H100s." blog.vllm.ai
[75] WebLLM GitHub, "Browser-based inference: Llama-3.1-8B at 41.1 tok/s, ~80% native speed." github.com/mlc-ai/web-llm
[76] Edge AI Vision, "On-Device LLMs in 2026: ExecuTorch GA, 50KB footprint." edge-ai-vision.com; Vikas Chandra (Meta) on-device LLMs state of the union. v-chandra.github.io
[77] arxiv, "Production-Grade Local LLM Inference on Apple Silicon," Nov 2025. arxiv.org
[78] California Management Review, "The Coming Disruption: How Open-Source AI Will Challenge Closed-Model Giants," Jan 2026. cmr.berkeley.edu
[79] OpenRouter, "State of AI: open models achieve ~90% of closed model performance; 80% of tokens still through closed APIs." openrouter.ai
[80] SDxCentral, "AI Inferencing Will Define 2026 -- And the Market's Wide Open." sdxcentral.com
[81] WorkOS, "Fireworks AI: The PyTorch Team's Bet on Inference as the New Runtime." workos.com