Landscape Report — Managed Inference

Managed Inference Platform Landscape: Top 5 Competitive Analysis

Fireworks AI • Together AI • Baseten • Nebius • Crusoe — Engines, Pricing, Scale & Strategic Positioning

Feb 2026 MinjAI Agents 75 Sources 12 Sections
Internal — Strategic Intelligence
Section 01

Market Context: The Inference Inflection

$20.6B
Inference Spending 2026 (Gartner)
55%
Share of AI IaaS Going to Inference
$105B
Inference PaaS TAM by 2030
41.1%
CAGR (2025–2030)

2026 marks the inflection point where inference overtakes training as the dominant AI infrastructure workload. Gartner projects $37.5B in AI-optimized IaaS spending in 2026, with 55% ($20.6B) flowing to inference—up from $9.2B in 2025.1 Deloitte estimates inference will consume 67% of all AI compute by end of 2026, up from 50% in 2025.2

The broader AI inference platform-as-a-service market is projected to grow from $18.84B in 2025 to $105.22B by 2030 at a 41.1% CAGR.3 Three forces are accelerating this: agentic AI workflows multiplying token volume per task, reasoning models consuming 10–100x more tokens per query, and enterprise migration from proprietary APIs to open-weight models for cost and control.4

Capital Concentration in Inference

The investment thesis has shifted decisively toward inference. In H2 2025 alone:

Token Pricing Deflation

Per-token costs are declining at roughly 10x per year at equivalent model quality. GPT-3-equivalent inference fell from $60/M tokens in 2021 to $0.06/M tokens in 2025—a 1,000x reduction in three years.7 This deflation rewards platforms with proprietary engine optimizations that can maintain margins while prices compress.

Market Structure

The inference PaaS market is consolidated at the top: hyperscalers (AWS, GCP, Azure) hold 66–75% share. But the independent managed inference layer—the five platforms analyzed here—is where the fastest innovation is happening and where enterprises are increasingly deploying production workloads for speed, cost, and model flexibility advantages.

Section 02

Executive Summary: The Top 5

This report analyzes the five leading independent managed inference platforms by funding, revenue scale, and technical differentiation. Each operates a proprietary or optimized inference engine, offers per-token API pricing, and targets enterprise production workloads.

Platform Valuation Revenue Engine Models
Fireworks AI $4.0B >$280M ann. FireAttention V4 50+
Together AI $3.3B ~$300M ann. FlashAttention-4 + Kernel Collection 200+
Baseten $5.0B 10x YoY growth Custom C++ + TensorRT-LLM BYOM + APIs
Nebius ~$25B mkt cap $530M FY2025 Token Factory (vLLM+) 60+
Crusoe $10B+ 5x bookings growth MemoryAlloy 8+
Key Finding

The managed inference market has bifurcated into two tiers: API-first platforms (Fireworks, Together) competing on model breadth, developer experience, and token pricing; and infrastructure-first platforms (Baseten, Nebius, Crusoe) competing on custom deployment, BYOM, and cost-per-compute-hour. The platforms that bridge both—offering production APIs AND dedicated infrastructure—will capture the most enterprise value.

Section 03

Landscape Snapshot: Side-by-Side Comparison

Dimension Fireworks AI Together AI Baseten Nebius Crusoe
Founded 2022 2022 2019 2024 (ex-Yandex) 2018
HQ Redwood City, CA San Francisco, CA San Francisco, CA Amsterdam, NL Denver, CO
Total Funding ~$327M ~$534M ~$585M Public (NBIS) ~$3.9B
Employees ~166 ~320 ~100–150 ~1,371 ~1,000+
Inference Engine FireAttention (custom CUDA) FlashAttention + Kernel Collection TensorRT-LLM + Custom C++ Token Factory (vLLM+) MemoryAlloy (KV-cache)
GPU Support H100, H200, B200, MI300X H100, H200, B200, GB200 H100, H200, B200 H100, H200, GB300 H100, H200, B200, GB200, AMD (SkyPilot)
Llama 3.3 70B $/M $0.90 / $0.90 $0.88 / $0.88 Dedicated only $0.13 / $0.40 $0.25 / $0.75
Key Customers Cursor, Uber, Samsung, Notion Salesforce, Zoom, DuckDuckGo Cursor, Writer, Notion Microsoft, Meta Cursor, Fireworks, Together AI
Compliance SOC2, HIPAA, GDPR SOC2 Type II SOC2, HIPAA ISO 27001, SOC2 SOC2, ISO 27001, ISO 42001
BYOM Yes (On-Demand) Yes (Dedicated) Yes (Truss SDK) Yes (Enterprise) Yes (Contact Sales)
Fine-Tuning LoRA, DPO, RFT LoRA, Full FT Blueprint + Training Enterprise only Roadmap
Section 04

Technology Engine Comparison

Each platform has built or adopted a distinct inference optimization strategy. The engine choice defines their cost structure, performance ceiling, and hardware flexibility.

Fireworks: FireAttention V1–V4

Custom CUDA kernels written from scratch for each GPU generation. V4 introduces FP4 (NVFP4) precision on NVIDIA B200 Blackwell GPUs with TensorCore Gen 5 instructions. Achieves 3.5x throughput improvement versus SGLang on H200 and >250 tok/s sustained on B200.8 Speculative decoding enabled Cursor to reach ~1,000 tok/s on Llama 70B.9 Uniquely supports both NVIDIA (H100/H200/B200) and AMD (MI300X) hardware.

Together: FlashAttention-4 + Kernel Collection

Tri Dao's FlashAttention is the industry-standard attention kernel, used by virtually every LLM provider. FlashAttention-4 on Blackwell achieves 1,605 TFLOPS (71% of theoretical maximum), 22% faster than NVIDIA's own cuDNN library.10 The Together Kernel Collection provides up to 10% faster training and 75% faster inference on top of FlashAttention.11

Baseten: Custom C++ Server + TensorRT-LLM

Replaced Triton Inference Server with a custom C++ server integrating TensorRT-LLM at the executor API level. Builds TRT-LLM from source and contributes patches upstream. Adds custom CUDA kernels for structured output (via Outlines) and speculative decoding (EAGLE-3, Medusa). Engine Builder automates TRT-LLM engine creation in minutes.12 Deep NVIDIA partnership ($150M investment) ensures early access to optimizations.

Nebius: Token Factory (Managed vLLM+)

Token Factory runs on optimized vLLM with proprietary extensions: speculative decoding, PagedAttention, and KV-cache reuse achieving 4x cost reductions. Nebius designs their own server chassis and operates Europe's first GB300 NVL72 deployment in Finland.13 At ~70% gross margin, Token Factory demonstrates that managed vLLM can be a high-margin business at scale.

Crusoe: MemoryAlloy (Distributed KV-Cache)

MemoryAlloy is a distributed key-value cache architecture that decouples KV storage from GPU compute. Achieves 9.9x improvement in time-to-first-token (TTFT) and 5x throughput versus standard vLLM.14 This architecture is particularly effective for long-context and multi-turn workloads where KV-cache reuse creates compound performance gains.

Unified Performance Benchmarks

Direct head-to-head comparisons are limited by vendor-specific test conditions. The table below normalizes available benchmarks to the closest comparable workloads.

Metric Fireworks Together Baseten Nebius Crusoe
TTFT (70B-class) 0.30–0.40s ~0.25s (MinjAI est.) 0.13s (Mistral 7B)53 ~0.35s (MinjAI est.) 9.9x faster vs vLLM
Output Throughput >250 tok/s (B200) ~175 tok/s (H200) 650+ tok/s (GPT-OSS 120B)27 4x cost-perf via KV reuse 5x vs vLLM baseline
Peak Customer Deploy ~1,000 tok/s (Cursor) Claims 30%+ faster than Fireworks 78% lower latency (OpenEvidence) N/A (enterprise SLA) N/A (GA Nov 2025)
Speculative Decoding Yes (production) Yes (Medusa) Yes (EAGLE-3, Medusa) Yes (vLLM native) Roadmap
Multi-Hardware NVIDIA + AMD NVIDIA only NVIDIA only NVIDIA only NVIDIA + AMD (SkyPilot)64
Blackwell (B200/GB200) FP4 via FireAttention V4 FlashAttention-4 native 225% cost-perf gain28 First EU GB300 NVL72 B200 supported
Benchmark Methodology Note

These benchmarks are sourced from vendor claims, Artificial Analysis rankings, and customer case studies. No independent third-party has tested all five platforms under identical conditions. Baseten's TTFT benchmark is on Mistral 7B (not 70B); Crusoe's metrics are relative improvements vs. vLLM baseline. Treat as directional, not absolute.

Inference Platform Technology Stack

Layer 4: Application & API
API Gateway & Rate Limiting
Model Catalog / Marketplace
BYOM Portal
Fine-Tuning Pipeline
Observability & Logging
Layer 3: Inference Engine (Differentiator)
Fireworks: FireAttention V4
Together: FlashAttention-4
Baseten: TensorRT-LLM + C++
Nebius: Token Factory
Crusoe: MemoryAlloy
Layer 2: Compute Infrastructure
H100 / H200 / B200 / GB200 GPUs
NVLink / InfiniBand Fabric
KV-Cache Management
Auto-Scaling & Load Balancing
Multi-Region Routing
Layer 1: Physical Infrastructure
Owned DCs: Nebius (Finland, NJ, KS), Crusoe (TX, WY), Together (MD, TN, Sweden)
Colocated: Fireworks (multi-cloud), Baseten (AWS SCA)48
Energy: Crusoe ~$0.03/kWh (stranded gas + renewables), Nebius (Nordic hydro), Others (grid power)
Engine Differentiation Map

Custom CUDA kernels (Fireworks, Baseten) → Maximum per-GPU performance, hardware-specific optimization
Research-grade kernels (Together) → Deepest attention-layer optimization, cross-platform portability
vLLM-based + extensions (Nebius) → Ecosystem compatibility, proven at scale, lower R&D cost
Architecture innovation (Crusoe) → System-level optimization, unique multi-turn advantage

Section 05

Fireworks AI: The Production Inference Leader

$4.0B
Valuation (Oct 2025)
>$280M
Annualized Revenue
10K+
Customer Companies
10T+
Tokens/Day Processed

Fireworks AI is the highest-revenue independent managed inference platform. Founded by ex-Meta PyTorch engineers (CEO Lin Qiao led 300+ engineers building PyTorch), the company raised $250M in Series C at $4B valuation in October 2025.15 Revenue grew roughly 20x year-over-year from ~$6.5M ARR (May 2024) to >$280M annualized (Oct 2025).16

Leadership

NameRoleBackground
Lin QiaoCEO & Co-FounderHead of PyTorch at Meta, Sr. Director Engineering (300+ eng); PhD UCSB
Dmytro DzhulgakovCTO & Co-FounderPyTorch core maintainer at Meta
Chenyu ZhaoCo-FounderGoogle Vertex AI lead

Product Suite

Performance Benchmarks

MetricValueContext
TTFT0.30–0.40sAcross models, faster than Groq (0.45s)
gpt-oss-120b960 tok/sArtificial Analysis benchmark17
B200 Peak>250 tok/sFireAttention V4 with FP4
Cursor Deploy~1,000 tok/sSpeculative decoding on Llama 70B

Funding History

RoundDateAmountLead Investors
Seed2022UndisclosedBenchmark
Series AMar 2024$25MBenchmark
Series BJul 2024$52M at $552MSequoia, NVIDIA
Series COct 2025$250M at $4BLightspeed, Index, Evantic18

Customer Use Case Metrics

CustomerUse CaseVerified Outcome
CursorAI code completion (Llama 70B)~1,000 tok/s via speculative decoding; powers Tab autocomplete for millions of developers9
NotionAI writing assistant4x latency reduction vs. previous provider; sub-second response times54
UberCompound AI for ride operationsProduction-scale multi-model orchestration via FireFunction; specific metrics undisclosed
SamsungOn-device + cloud AI featuresGalaxy AI integration via Fireworks serverless API; specific metrics undisclosed
CrestaContact center AI~100x cost savings vs. proprietary API providers55
Deep Dive: FireAttention Architecture Evolution & Competitive Moat

V1 (2023): Initial custom CUDA kernels for H100, replacing standard vLLM serving. Achieved ~2x throughput improvement over stock PyTorch inference.

V2 (2024): Added continuous batching, speculative decoding, and H200 support. Multi-tenant GPU sharing enabled the serverless pricing model.

V3 (2024): AMD MI300X support added—making Fireworks the only platform in this group to run on non-NVIDIA hardware. PagedAttention optimization and prefix caching.

V4 (2025): FP4 (NVFP4) precision on B200 Blackwell with TensorCore Gen 5. 3.5x throughput gain over SGLang on H200. This generation targets the AI agent/creation market where sustained high throughput matters more than single-request latency.

Moat assessment: Fireworks' moat is engineering velocity: 4 major engine versions in 3 years, each generation-specific. The risk is that NVIDIA's own TensorRT-LLM narrows the gap with each release. The AMD support is a strategic hedge—if MI300X/MI400 gain traction, Fireworks is the only independent platform ready.

Key risk: Lin Qiao's PyTorch team culture means Fireworks optimizes at the kernel level, not the system level. MemoryAlloy (Crusoe) and Token Factory (Nebius) attack efficiency from the architecture layer—a different competitive angle that kernel optimization alone can't match.

Competitive Threat Assessment: Very High

Fireworks has the strongest combination of scale (10T tokens/day), revenue ($280M+), and customer logos (Cursor, Uber, Samsung, Notion). Their PyTorch founding team has deep inference optimization expertise. Multi-hardware support (NVIDIA + AMD) is unique. Primary weakness: not the cheapest on per-token pricing; competes on speed and reliability.

Section 06

Together AI: The Research-Driven Inference Cloud

$3.3B
Valuation
~$300M
Annualized Revenue (Sep 2025)
600K+
Developers
200+
Models Available

Together AI combines academic research credibility with production-scale infrastructure. Chief Scientist Tri Dao created FlashAttention, the industry-standard attention kernel used by virtually every LLM provider globally. The company raised $305M in Series B (Feb 2025) and has $534M total funding.19

Leadership

NameRoleBackground
Vipul Ved PrakashCEO & Co-FounderFounder of Topsy (acquired by Apple), serial entrepreneur
Tri DaoChief ScientistCreator of FlashAttention 1–4; Stanford/Princeton PhD
Ce ZhangCo-Founder & PresidentETH Zurich professor, data systems researcher

Product Suite

Business Model

Revenue splits approximately 30–40% API inference and 60–70% GPU cluster rentals. Gross margins are ~45%, with infrastructure ownership (data centers in Maryland, Memphis, Sweden) expected to improve unit economics.22 Claims 80% cheaper than hyperscalers on equivalent workloads.

Pricing Highlights

ModelInput $/MOutput $/M
Llama 3.1 8B$0.18$0.18
Llama 3.3 70B$0.88$0.88
DeepSeek-R1$3.00$7.00
Llama 4 Maverick (400B MoE)$0.27$0.27

Funding History

RoundDateAmountLead Investors
SeedMay 2023$20MLux Capital
Series ANov 2023$102.5MKleiner Perkins56
Series BMar 2024$106MSalesforce Ventures57
Series CFeb 2025$305M at $3.3BProsperity7, Coatue, a16z19

European Expansion

Partnered with Hypertec/5C for up to 100,000 GPUs in European data centers ($5B total investment). Positions Together for EU data residency requirements and sovereign AI demand.23

Customer Use Case Metrics

CustomerUse CaseVerified Outcome
SalesforceEnterprise AI features (Agentforce)Strategic investor ($106M Series B lead); Together powers inference workloads
ZoomAI Companion featuresMeeting summarization, real-time AI assistance at scale
DuckDuckGoAI-powered search answersPrivacy-first inference via Together API; open-weight models for data control
Pika LabsAI video generationGPU clusters for video model training and inference at scale
MetaLlama launch partnerDay-one availability of Llama 4 Maverick/Scout; co-marketing partnership20
Deep Dive: Research-to-Product Pipeline & Open-Source Influence

FlashAttention's industry impact: Tri Dao's FlashAttention is used by virtually every LLM provider—including Fireworks, Baseten, and Nebius from this report. This gives Together unparalleled visibility into attention kernel optimization requirements across the industry.

The Together Kernel Collection goes beyond FlashAttention: it includes optimized kernels for MLP layers, normalization, and embedding operations. Together claims 10% faster training and 75% faster inference vs. stock implementations. This collection is proprietary (unlike FlashAttention itself).

Acquisition strategy: The Refuel.ai acquisition (May 2025) added data quality/structuring capabilities, enabling a train→evaluate→infer loop. This is Together's answer to Baseten's Parsed acquisition—both racing to own the full model lifecycle.

Revenue composition risk: ~60-70% of revenue comes from GPU cluster rentals, not inference API. This means Together's inference margins are less proven at scale than Fireworks'. The shift to owned infrastructure (Maryland, Memphis, Sweden data centers) should improve unit economics but requires massive capex.

Open-source ecosystem leverage: Together's open models (RedPajama, OpenChatKit) and research papers (FlashAttention 1-4, Monarch Mixer) create developer mindshare that converts to paying API customers. This research-to-revenue flywheel is unique in this landscape.

Competitive Threat Assessment: Very High

Together's research moat (FlashAttention is literally the kernel everyone uses) gives them unique credibility. 200+ models is the broadest catalog among independents. European expansion addresses sovereignty demand. Primary weakness: training-heavy revenue mix means inference margins are still maturing. Aggressive pricing compresses margins.

Section 07

Baseten: The NVIDIA-Aligned Full-Lifecycle Platform

$5.0B
Valuation (Jan 2026)
$585M
Total Raised
10x
Revenue Growth (2025 YoY)
100x
Inference Volume Growth (2025)

Baseten is the highest-valued independent inference platform at $5B, driven by NVIDIA's $150M strategic investment as part of the $300M Series E (Jan 2026).24 Founded in 2019 by ex-Gumroad and ex-Clover Health engineers, Baseten pivoted from ML app building to production inference infrastructure and has seen explosive growth: 100x inference volume increase in 2025.

Leadership

NameRoleBackground
Tuhin SrivastavaCEO & Co-FounderEx-Gumroad (data scientist/fraud ML), ex-Macquarie (IB); USC
Amir HaghighatCTO & Co-FounderEx-Clover Health (ML engineering), ex-Yelp; MS CS UC Irvine

Product Suite

Performance Benchmarks

WorkloadMetricResult
GPT-OSS 120BThroughput650+ tok/s (Artificial Analysis #1 on OpenRouter)27
Mistral 7BTTFT130ms
Mistral 7BThroughput170 tok/s
Embeddings (B200)vs. vLLM3.3x higher throughput
B200 BlackwellCost-performance225% improvement (validated by Google Cloud)28

Funding History

RoundDateAmountLead
Series CFeb 2025$75M at $825MSpark Capital
Series DSep 2025$150M at $2.15BBOND
Series EJan 2026$300M at $5BIVP, CapitalG, NVIDIA ($150M)29

Customer Use Case Metrics

CustomerUse CaseVerified Outcome
CursorAI code editor inferencePrimary inference provider alongside Fireworks; production-scale code completion
WriterEnterprise AI writing platformCustom Palmyra model deployed via Truss; dedicated GPU deployment58
ZedAI-powered code editor45% lower latency vs. previous provider with dedicated B200 deployment
OpenEvidenceMedical AI platform78% lower latency, enabling real-time clinical decision support59
PatreonCreator platform AI features~$600K/year savings vs. proprietary APIs; migrated to open-weight models on Baseten
Deep Dive: NVIDIA Partnership Depth & Product Evolution

NVIDIA's $150M bet: This is NVIDIA's largest known investment in a managed inference startup. The strategic rationale: Baseten validates TensorRT-LLM as the enterprise inference standard. Every Baseten deployment runs NVIDIA's software stack, creating lock-in at the engine layer.

Product pivot history: Baseten started in 2019 as an ML app builder (think: Streamlit for ML). The pivot to inference infrastructure happened in 2023 when they realized the bottleneck for ML deployment wasn't the app layer but the serving layer. This pivot explains why their developer experience (Truss SDK, Chains) is best-in-class—they came from a developer tools background.

The Parsed acquisition (Dec 2025) is strategically important: it adds RL (reinforcement learning) and evaluation capabilities. Combined with Baseten Training (closed beta), this gives Baseten the only complete train→evaluate→deploy→improve loop among the five platforms.

AWS Strategic Collaboration Agreement (Dec 2025): Baseten is available on AWS Marketplace with Savings Plans support.48 This is unusual—most inference startups compete against AWS, not partner with them. It signals AWS sees Baseten as complementary (custom model serving) rather than competitive (they don't replicate SageMaker).

Key risk: Baseten's dedicated GPU model means they don't benefit from multi-tenant efficiency the way serverless platforms (Fireworks, Together) do. At small scale, customers pay for idle GPU time. This makes Baseten most compelling for customers with consistent, high-volume workloads.

Competitive Threat Assessment: High

Baseten's $150M NVIDIA investment creates a deep technical moat around TensorRT-LLM optimization. Three funding rounds in 12 months ($75M → $150M → $300M) shows exceptional velocity. The Parsed acquisition gives them the only end-to-end inference + training + RL pipeline among the five. Primary weakness: revenue scale likely smaller than Fireworks/Together; enterprise customer base still growing.

Section 08

Nebius: The Scale-First European Neocloud

~$25B
Market Cap (NASDAQ: NBIS)
$530M
FY2025 Revenue (+479% YoY)
$1.25B
Year-End ARR (2025)
~70%
Gross Margin

Nebius is the only publicly traded company in this comparison and the largest by market capitalization. Spun out of Yandex's cloud infrastructure business, led by Arkady Volozh (ex-Yandex CEO). Revenue grew 479% YoY to $530M in FY2025, with Q4 alone at $228M (+547% YoY).30

Leadership

NameRoleBackground
Arkady VolozhCEOFounded Yandex (Russia's Google); built $25B+ enterprise
Andrey KorolenkoChief Product & Infrastructure Officer28-year Yandex/Nebius veteran (since 1998); leads data center buildouts & capacity planning74
Roman CherninChief Business Officer & Co-Founder12 years heading Yandex digital services (Search, Maps); spearheading AI cloud business since 2023
Ophir NaveCOO & Executive DirectorM&A lawyer; ex-Arnon Tadmor-Levy, ex-Wachtell Lipton; JSD Harvard Law75

Financial Scale

MetricFY20252026 Guidance
Revenue$529.8M$3.0–3.4B31
ARR$1.25B$7–9B
EBITDA MarginImproving~40% target
Cash$3.7B

Strategic Partnerships

Infrastructure

Data CenterCapacityStatus
Finland (Mäntsälä)60,000 GPUs, 75 MWOperational + expanding
New JerseyOperationalLive
Kansas City35,000 GPUs, 40 MWComing online
IcelandPlannedUnder development

Token Factory Pricing

ModelInput $/MOutput $/M
Llama 3.1 8B$0.02$0.06
Llama 3.3 70B$0.13$0.40
DeepSeek-V3$0.50$1.50
DeepSeek-R1$0.80$2.40

Batch inference at 50% of base pricing.33

Capital Structure & Funding

EventDateAmount / Detail
Yandex RestructuringJul 2024Spun out of Yandex NV; listed on NASDAQ as NBIS
NVIDIA InvestmentDec 2024$350M from NVIDIA & Accel; earmarked for GPU procurement60
Secondary OfferingFeb 2025$700M raised; shares priced at $43
Cash PositionEnd FY2025$3.7B total cash & equivalents
Microsoft Deal2025$17.4B (up to $19.4B) five-year infrastructure agreement32
Meta Deal2025~$3B infrastructure partnership

Customer Use Case Metrics

CustomerUse CaseVerified Outcome
MicrosoftAI infrastructure capacity$17.4B five-year deal; largest known Nebius engagement32
MetaGPU cluster capacity~$3B deal for training and inference infrastructure
TavilyAI search & retrieval (acquired)Acquired to add agentic AI search capabilities to Nebius platform
Enterprise customersToken Factory APIDemand exceeded capacity in Q4 2025; sold out driving 547% YoY Q4 growth
Deep Dive: Yandex Heritage, European Positioning & Scale Economics

The Yandex advantage: Nebius inherited Yandex's 25+ years of large-scale infrastructure operations. Yandex was Russia's Google—search, cloud, self-driving cars, e-commerce. This means Nebius entered the AI infrastructure market with mature operational playbooks that startups lack: data center design, GPU procurement at scale, and network engineering.

European sovereign play: Nebius is headquartered in Amsterdam and operates Europe's largest GPU cluster in Finland (60K GPUs). The EU AI Act and GDPR create demand for European-domiciled inference. Nebius is the only platform in this group with production infrastructure in the EU, giving them first-mover advantage on $80B sovereign cloud market.39

Unit economics at scale: ~70% gross margin on $530M revenue ($371M gross profit) is remarkable for infrastructure. The economics work because Nebius owns their data centers, procures GPUs at hyperscaler volume, and runs Token Factory at high utilization. Guidance of 40% EBITDA margin on $3.0-3.4B 2026 revenue implies ~$1.2-1.4B EBITDA potential.

Capacity as the constraint: Nebius was sold out in Q4 2025. The Kansas City DC (35K GPUs, 40 MW) coming online in H1 2026 and Iceland expansion should relieve this, but demand from Microsoft/Meta absorbs most new capacity. Token Factory for external customers competes with hyperscaler contracts for GPU allocation.

Risk factors: Concentration risk (Microsoft = majority of revenue), geopolitical perception (Yandex heritage), and the Arkady Volozh single-founder dependency. EU sanctions compliance adds operational complexity.

Competitive Threat Assessment: Very High

Nebius operates at a fundamentally different scale: $3.7B cash, $17.4B Microsoft deal, publicly traded, ~70% gross margins. Token Factory pricing is the most aggressive in this group (Llama 70B at $0.13/$0.40). Their European infrastructure positions them for the $80B sovereign cloud opportunity. Primary weakness: capacity-constrained (sold out in Q4 2025), limited model catalog (60+ vs 200+ at Together).

Section 09

Crusoe: The Energy-Advantaged Inference Platform

$10B+
Valuation
$3.9B
Total Raised
9.9x
TTFT Improvement (MemoryAlloy)
$0.03
kWh Energy Cost

Crusoe is the most heavily capitalized company in this group ($3.9B total raised) and uniquely positioned as both a GPU cloud and a managed inference platform. Managed Inference reached general availability in November 2025, powered by the proprietary MemoryAlloy engine.34 Crusoe's structural energy cost advantage (~$0.03/kWh) underpins its long-term margin thesis.

Leadership

NameRoleBackground
Chase LochmillerCEO & Co-FounderStanford CS; former quant trader
Erwan MenardSVP ProductEx-Google Cloud AI (Vertex AI Director of PM); CEO of Elastifile (acquired by Google)35
Eesha PathakSr. Director PMEx-Google Cloud AI (Head of Product, Enterprise AI & International Expansion); 15+ years36
Aditya ShankerGPM, InferenceInference product lead
Omar LariSr. Director PM, IaaSInfrastructure product lead

Product Suite

Compliance & Certifications

SOC2 + ISO 27001 + ISO 42001 (Feb 2026). Crusoe achieved ISO 27001 (information security management) and ISO 42001 (AI governance) certifications, significantly closing the compliance gap with Fireworks and Baseten.68 ISO 42001 is notable—it's the first AI-specific governance standard, and Crusoe is the only platform in this group to hold it.

Pricing

ModelInput $/MOutput $/M
Llama 3.3 70B$0.25$0.75
DeepSeek R1$1.35$5.40
Qwen3 235B$0.22$0.80
Kimi-K2$0.60$2.50

Funding History

RoundDateAmountKey Investors / Notes
Series AApr 2022$128MValor Equity Partners
Series BSep 2022$350MG2 Venture Partners
Series CAug 2024$600M at ~$3BFidelity, NEA, Founders Fund69
Debt Facility2024$225MInfrastructure financing
Series D+E2025UndisclosedValuation reported at $10B+52
Total Raised~$3.9BIncludes equity + debt

Performance Benchmarks

MetricValueSource / Context
TTFT (MemoryAlloy)9.9x faster vs vLLMInternal benchmark, Nov 202514
Throughput5x vs vLLM baselineMemoryAlloy cluster-scale test
Llama 3.1 Fine-Tuning (GB200)3x faster vs H100GB200 NVL72 benchmark, Feb 202670
InferenceMAXBenchmark co-creatorPartnership with SemiAnalysis, Oct 202571

Customer Use Case Metrics

CustomerUse CaseVerified Outcome
CursorAI code editor infrastructureMulti-provider strategy; Crusoe as GPU infrastructure layer (shared with Fireworks/Baseten)37
Together AIGPU cloud customerRuns training & inference workloads on Crusoe H100/H200 clusters (metrics undisclosed)
Fireworks AIGPU cloud customerUses Crusoe infrastructure for compute capacity scaling (metrics undisclosed)
OdysseyGeneral-purpose world modelsPioneering world model training on Crusoe's scalable GPU cloud; featured case study Jan 202672
Decart (MirageLSD)Real-time AI video generationMirageLSD model deployed on Crusoe Cloud; real-time video synthesis73
Sony, Databricks, MITEnterprise AI / researchGPU cloud customers (specific metrics undisclosed)

Energy Infrastructure

Crusoe's foundational advantage is structural energy cost. Originally built on stranded natural gas, now transitioning to renewable sources. At ~$0.03/kWh, Crusoe operates at roughly 50–60% lower energy cost than hyperscaler data centers, creating a durable margin advantage that compounds as inference workloads scale.

Deep Dive: Energy Economics, MemoryAlloy Architecture & Recent Product Velocity

The energy moat quantified: At $0.03/kWh vs. ~$0.06-0.08/kWh for hyperscalers, Crusoe saves ~$0.03-0.05/kWh. A single H100 draws ~0.7 kW, running 24/7/365 = ~6,132 kWh/year. That's ~$184-307/year per GPU in energy savings. At 10,000 GPUs: $1.8-3.1M/year in structural cost advantage. At 100,000 GPUs: $18-31M/year. This advantage scales linearly and compounds as GPU power draw increases with each generation (B200 draws ~1kW, GB200 even higher).

MemoryAlloy architecture: Unlike other engines that optimize per-GPU efficiency, MemoryAlloy operates at the system level by decoupling KV-cache storage from GPU compute. In multi-turn conversations or long-context workloads, KV-cache data is persisted across requests, eliminating redundant prefill computation. This is why the 9.9x TTFT improvement is on time-to-first-token specifically—it's the prefill step that benefits most from cache reuse.

Product velocity (Nov 2025 – Feb 2026): Crusoe shipped an extraordinary amount in 90 days: Managed Inference GA (Nov 20), MemoryAlloy engine paper (Nov 20), Run:ai certification (Nov 17), BYOM formal launch (Feb 6), Command Center (Feb 18), AutoClusters (Feb 3), MCP Server (Feb 11), GB200 NVL72 fine-tuning benchmarks (Feb 6), AMD GPU support (Jan 13), and ISO 27001+42001 (Feb 13). This cadence suggests a well-staffed product org executing at startup speed despite 1,000+ employees.

Compliance leapfrog: The ISO 42001 certification is strategic. It's the world's first AI governance standard (ISO/IEC 42001:2023). No other platform in this group holds it. For enterprises evaluating AI risk governance, this is a differentiator—particularly in regulated industries and government contracts where AI-specific compliance frameworks are emerging requirements.

Platform customer dynamics: Crusoe's most interesting competitive dynamic is that two of its biggest competitors (Fireworks and Together) are also customers of its GPU cloud. This creates an unusual relationship: Crusoe provides the infrastructure that powers competing managed inference APIs. Erwan Menard's Feb 2026 blog framing ("Building the world's favorite AI cloud") suggests Crusoe sees this as a feature, not a conflict—the IaaS revenue from competitors funds managed inference R&D.

Go-to-market evolution: With only 8 models in the catalog vs. 200+ at Together, Crusoe is leaning into BYOM + Command Center as the enterprise play. The combination of "bring your fine-tuned model + run it on MemoryAlloy + monitor via Command Center" creates an end-to-end value proposition for enterprises that want performance without managing infrastructure. The InferenceMAX benchmark partnership with SemiAnalysis also positions Crusoe as a thought leader on inference performance measurement.

Leadership signal: Hiring Erwan Menard (ex-Vertex AI Director of PM) and Eesha Pathak (ex-Google Cloud AI, Head of Product) signals Crusoe is serious about building a Google Cloud-caliber product organization. The shipping velocity since their arrival validates this thesis.

Competitive Position Assessment (Updated Feb 2026)

Crusoe is uniquely positioned as the only platform in this group that owns its energy infrastructure AND holds ISO 42001 (AI governance) certification. The product velocity since Nov 2025 has been exceptional: 10+ major launches in 90 days. ISO 27001+42001 closes the compliance gap significantly. Command Center + MCP Server address the developer experience gap. GB200 NVL72 and AMD GPU support via SkyPilot expand hardware flexibility. The remaining gaps: model catalog depth (8 vs. 200+ at Together) and proven production scale at token volume comparable to Fireworks' 10T tokens/day.

Section 10

Pricing Benchmarks: Token-Level Comparison

Pricing is the most visible competitive dimension in managed inference. The table below normalizes per-token costs across the five platforms for comparable models.

Llama 3.3 70B ($/M Tokens)

Platform Input Output Blended (1:1) vs. Cheapest
Nebius $0.13 $0.40 $0.265 Cheapest
Crusoe $0.25 $0.75 $0.50 +89%
Together AI $0.88 $0.88 $0.88 +232%
Fireworks AI $0.90 $0.90 $0.90 +240%
Baseten Dedicated GPU deployments only (not per-token)

DeepSeek-R1 ($/M Tokens)

Platform Input Output Blended (1:1)
Nebius $0.80 $2.40 $1.60
Together AI $3.00 $7.00 $5.00
Crusoe $1.35 $5.40 $3.38
Fireworks AI ~$8.00 ~$8.00 $8.00

GPU Hourly Rates (Where Available)

GPUFireworksBaseten (per-min)Crusoe
H100 80GB$4.00/hr$6.48/hr ($0.108/min)$3.90/hr
B200 180GB$9.00/hr$9.96/hr ($0.166/min)TBD
Pricing Intelligence

Nebius is the price leader on per-token models, leveraging scale (30K+ GPUs) and ~70% gross margins. Crusoe is positioned mid-market, 89% above Nebius but 44% below Fireworks/Together on Llama 70B. Fireworks and Together compete on speed/reliability, not price. Baseten avoids per-token comparison by focusing on dedicated deployments where customers control cost-per-GPU-hour. Token pricing deflation of ~10x/year means today's prices will be tomorrow's floor.

Section 11

Head-to-Head Competitive Matrix

This matrix rates each platform across eight dimensions critical to enterprise managed inference buyers. Ratings are relative within this five-platform set (5 = best-in-class, 1 = weakest).

Dimension Fireworks Together Baseten Nebius Crusoe
Engine Performance 5 4 4 3 4
Model Catalog 4 5 3 3 2
Per-Token Pricing 2 3 N/A 5 4
Enterprise Compliance 5 3 4 4 4
Developer Experience 4 4 5 3 3
BYOM / Customization 3 4 5 3 3
Infrastructure Scale 3 4 3 5 4
Cost Structure Moat 2 3 3 4 5

SLA & Reliability Comparison

Dimension Fireworks Together Baseten Nebius Crusoe
Uptime SLA 99.9% (Enterprise) Best effort 99.9% (dedicated)61 99.95% (cloud SLA) TBD (new GA)
Production Validation 10T tok/day proven 600K+ developers 100x volume growth '25 Sold out Q4 2025 GA Nov 2025; 10+ launches in 90 days
Compliance SOC2 + HIPAA + GDPR SOC2 Type II SOC2 + HIPAA ISO 27001 + SOC2 SOC2 + ISO 27001 + ISO 42001
Dedicated Capacity On-Demand deployments GPU clusters Per-GPU dedicated Enterprise tiers BYOM (contact sales)
Multi-Region 18+ regions, 8+ clouds US + EU (expanding) US (AWS SCA) Finland, NJ, KS US (TX, WY)
Rate Limits Custom (enterprise) Tier-based Dedicated = unlimited Custom quotas Contact sales
Enterprise Readiness Assessment

Fireworks leads on enterprise compliance (SOC2+HIPAA+GDPR trifecta) and multi-region availability. Baseten offers the strongest dedicated SLA for custom model deployments. Nebius has the highest uptime target (99.95%) backed by their infrastructure ownership. Crusoe has closed its compliance gap significantly with ISO 27001+42001 (the only AI governance certification in this group). Together is still maturing its enterprise compliance posture (SOC2 only), which limits uptake in regulated industries.

Winner By Use Case

Use CaseBest PlatformWhy
High-volume production APIFireworks10T tokens/day proven scale, fastest engines, SOC2+HIPAA+GDPR
Research & experimentationTogether200+ models, FlashAttention pedigree, broadest catalog
Custom model deploymentBasetenTruss SDK, Chains for pipelines, Engine Builder, best DX
Cost-optimized at scaleNebiusLowest per-token pricing, 70% gross margins, owned DCs
Energy-advantaged inferenceCrusoeStructural $0.03/kWh cost, MemoryAlloy architecture, BYOM
Section 12

Market Outlook & Strategic Positioning

Three Forces Reshaping the Landscape (2026–2027)

1. NVIDIA's Inference Ecosystem Play

The $20B Groq acquisition (Dec 2025) and $150M Baseten investment signal NVIDIA is building a vertically integrated inference ecosystem.38 Platforms aligned with NVIDIA (Baseten, Nebius) gain preferential access to TensorRT-LLM optimizations, Blackwell/Vera Rubin early access, and co-marketing. Platforms with custom engines (Fireworks, Crusoe) must maintain parity independently.

2. Sovereign AI Demand Explosion

The sovereign cloud market is projected to reach $80B in 2026 and $823B by 2032.39 65% of governments will introduce sovereignty requirements by 2028 (Gartner). Platforms with physical infrastructure (Nebius, Crusoe) have an inherent advantage over API-only providers. Together's European expansion with 100K GPUs addresses this but through colocation, not owned infrastructure.

3. Full-Lifecycle Convergence

The market is converging toward platforms that own the complete model lifecycle: inference + fine-tuning + evaluation + post-training (RL). Baseten (via Parsed) and Together (via Refuel) have made acquisitions specifically to close this loop. Platforms offering inference-only will face pressure to expand.

Competitive Dynamics to Watch

Q1 2026
Baseten Series E ($300M) deployment—watch for enterprise logo acceleration
H1 2026
Nebius Kansas City DC comes online (35K GPUs)—capacity constraints ease
H1 2026
Together European GPUs deploy via Hypertec—sovereign AI offering matures
H2 2026
NVIDIA Vera Rubin early access—watch which platforms get first allocation
2026
Fireworks 3–4x infrastructure expansion + full AI creation toolchain40
Strategic Assessment

The managed inference market is large enough ($20.6B in 2026) and growing fast enough (41% CAGR) to support multiple winners. No single platform dominates all dimensions. The sustainable winners will be those that combine proprietary engine optimization (Fireworks, Crusoe) with infrastructure scale (Nebius, Crusoe) and full-lifecycle capabilities (Baseten, Together). The next 12 months will determine whether the market consolidates around 2–3 platforms or remains pluralistic.

Where Crusoe Fits

Crusoe occupies a unique position as the only platform in this group with both proprietary engine technology (MemoryAlloy) and owned energy infrastructure. This creates a structural cost advantage that scales with inference volume. Since the Managed Inference GA in November 2025, Crusoe has shipped at extraordinary velocity: 10+ major product launches in 90 days, including ISO 27001+42001 certifications, Command Center, BYOM, AutoClusters, MCP Server, and GB200 NVL72 benchmarks.

The compliance picture has changed significantly. ISO 27001 + ISO 42001 now puts Crusoe ahead of Together (SOC2 only) and at parity with Nebius (ISO 27001 + SOC2) on security certifications. The ISO 42001 AI governance certification is unique in this landscape—a differentiator for regulated enterprises and government contracts.

The hiring of ex-Google Cloud AI leadership (Erwan Menard, Eesha Pathak) signaled a deliberate pivot from infrastructure-company-that-does-inference to inference-platform-that-owns-infrastructure. The 90-day shipping cadence since validates this strategy is executing.

The Crusoe Opportunity (Updated Feb 2026)

With compliance gaps largely closed (ISO 27001+42001) and developer experience improving (Command Center, MCP Server), Crusoe's remaining strategic priorities narrow to two: (1) expand the Intelligence Foundry model catalog from 8 to 30+ models to compete with Together/Fireworks on breadth, and (2) prove production token volume at scale comparable to Fireworks' 10T tokens/day. The combination of MemoryAlloy performance + $0.03/kWh energy + ISO 42001 + owned infrastructure creates a defensible position that no other platform in this landscape can replicate.

Sources & Footnotes

  1. [1] Gartner, "AI-Optimized IaaS Poised to Become Next Growth Engine," Oct 2025. gartner.com
  2. [2] Deloitte, "Compute Power AI Predictions 2026." deloitte.com
  3. [3] MarketsandMarkets, "AI Inference Platform-as-a-Service Market, $105.22B by 2030." marketsandmarkets.com
  4. [4] a16z, "State of AI / 100T Token Study," Jan 2026. a16z.com
  5. [5] CNBC, "Nvidia buying Groq's assets for about $20 billion," Dec 2025. cnbc.com
  6. [6] CoreWeave IPO pricing announcement, Mar 2025. coreweave.com
  7. [7] a16z, "LLMflation: LLM Inference Cost Trends." a16z.com
  8. [8] Fireworks AI, "FireAttention V4: FP4 on B200." fireworks.ai
  9. [9] Fireworks AI, "Cursor Case Study: 1,000 tok/s." fireworks.ai
  10. [10] SemiAnalysis via X, "FlashAttention v4 at HotChips: 22% faster than cuDNN." x.com
  11. [11] Together AI, "Tri Dao and FlashAttention." together.ai
  12. [12] Baseten Docs, "Model Deployment Overview." docs.baseten.co
  13. [13] Nebius, "Token Factory: Managed Inference." nebius.com
  14. [14] Crusoe, "Managed Inference: MemoryAlloy." crusoe.ai
  15. [15] Fireworks AI, "Series C Announcement." fireworks.ai
  16. [16] Sacra, "Fireworks AI Revenue & Funding." sacra.com
  17. [17] LLM Benchmarks, "Fireworks Provider Performance." llm-benchmarks.com
  18. [18] BusinessWire, "Fireworks AI Raises $250M Series C." businesswire.com
  19. [19] Together AI, "Series A: $102.5M." together.ai; Crunchbase: $534M total. crunchbase.com
  20. [20] Together AI, "Llama 4 Launch Partner." together.ai
  21. [21] Together AI, "Acquires Refuel.ai," May 2025. yahoo.com
  22. [22] Sacra, "Together AI Revenue & Valuation." sacra.com
  23. [23] Together AI European GPU expansion with Hypertec/5C. together.ai
  24. [24] Yahoo Finance, "NVIDIA Invests $150M in Baseten." yahoo.com
  25. [25] Baseten, "Chains for Production Compound AI Systems." baseten.co
  26. [26] BusinessWire, "Baseten Acquires Parsed," Dec 2025. businesswire.com
  27. [27] Baseten, "GPT-OSS 120B at 500+ TPS." baseten.co
  28. [28] Google Cloud Blog, "Baseten 225% Cost-Performance." cloud.google.com
  29. [29] SiliconANGLE, "Baseten hits $5B valuation, $300M round." siliconangle.com
  30. [30] Nebius FY2025 earnings: $529.8M revenue, +479% YoY. nebius.com
  31. [31] Nebius 2026 guidance: $3.0-3.4B revenue, $7-9B ARR, 40% EBITDA. nebius.com
  32. [32] Microsoft-Nebius $17.4B infrastructure agreement. nebius.com
  33. [33] Nebius, "Token Factory Pricing." nebius.com
  34. [34] Crusoe, "Managed Inference GA," Nov 2025. crusoe.ai
  35. [35] Erwan Menard, SVP Product, Crusoe. Ex-Google Cloud AI (Vertex AI). linkedin.com
  36. [36] Eesha Pathak, Sr. Director PM, Crusoe. Ex-Google Cloud AI. zoominfo.com
  37. [37] Crusoe customer announcements: Series E, inference launch. crusoe.ai
  38. [38] CNBC, "Nvidia-Groq deal structured to keep 'fiction of competition alive.'" cnbc.com
  39. [39] Gartner, "Sovereign Cloud IaaS Spending $80B in 2026." gartner.com
  40. [40] Fireworks AI, "Series C: 3-4x Infrastructure Expansion." fireworks.ai
  41. [41] MarketsandMarkets, "AI Inference Market Size, $255B by 2030." marketsandmarkets.com
  42. [42] Epoch AI, "LLM Inference Price Trends." epoch.ai
  43. [43] Menlo Ventures, "2025 State of Generative AI in the Enterprise." menlovc.com
  44. [44] Sequoia, "AI's $600B Question." sequoiacap.com
  45. [45] Introl, "Sovereign Cloud AI Infrastructure Data Residency." introl.com
  46. [46] IDC, "AI Infrastructure Spending $86B in Q3 2025." idc.com
  47. [47] SDxCentral, "AI Inferencing Will Define 2026." sdxcentral.com
  48. [48] Baseten, "AWS Strategic Collaboration Agreement," Dec 2025. businesswire.com
  49. [49] Sacra, "Baseten Valuation & Funding." sacra.com
  50. [50] Index Ventures, "Inference is the New Runtime." indexventures.com
  51. [51] CapitalG, "Baseten: The Foundation for Production AI." capitalg.com
  52. [52] Crusoe, "Series E and Bookings Growth." crusoe.ai
  53. [53] Baseten, "Performance Benchmarks: Mistral 7B at 130ms TTFT." docs.baseten.co
  54. [54] Fireworks AI, "Notion Case Study: 4x Latency Reduction." fireworks.ai
  55. [55] Fireworks AI, "Cresta Case Study: 100x Cost Savings." fireworks.ai
  56. [56] Together AI, "Series A: $102.5M led by Kleiner Perkins." together.ai
  57. [57] Salesforce Ventures, "Together AI Series B: $106M." salesforce.com
  58. [58] Baseten, "Writer Case Study: Custom Palmyra Model Deployment." baseten.co
  59. [59] Baseten, "OpenEvidence: 78% Lower Latency for Clinical AI." baseten.co
  60. [60] Nebius, "$350M Investment from NVIDIA & Accel, Dec 2024." nebius.com
  61. [61] Baseten, "Enterprise SLA: 99.9% Uptime for Dedicated Deployments." docs.baseten.co
  62. [62] Erwan Menard, "Building the world's favorite AI cloud" (BYOM announcement), Feb 6, 2026. crusoe.ai
  63. [63] Crusoe, "Introducing Command Center: Unified operations platform for AI workloads," Feb 18, 2026. crusoe.ai
  64. [64] Crusoe, "Running AI workloads on AMD GPUs with SkyPilot," Jan 13, 2026. crusoe.ai
  65. [65] Crusoe, "AutoClusters: Minimizing impact of hardware failures in large GPU clusters," Feb 3, 2026. crusoe.ai
  66. [66] Crusoe, "Introducing the Crusoe Cloud MCP server," Feb 11, 2026. crusoe.ai
  67. [67] Crusoe, "Crusoe Managed Kubernetes (CMK) now partner-certified for NVIDIA Run:ai," Nov 17, 2025. crusoe.ai
  68. [68] Crusoe, "Security you can trust: Crusoe Cloud achieves ISO 27001 and 42001 certifications," Feb 13, 2026. crusoe.ai
  69. [69] TechCrunch, "Crusoe Energy raises $600M Series C at ~$3B valuation," 2024. techcrunch.com
  70. [70] Crusoe, "Up to 3X faster: Benchmarking Llama 3.1 fine-tuning on NVIDIA GB200 NVL72," Feb 6, 2026. crusoe.ai
  71. [71] Crusoe, "The new AI benchmark: Unlocking real-world performance with InferenceMAX by SemiAnalysis," Oct 16, 2025. crusoe.ai
  72. [72] Crusoe, "Odyssey is pioneering general-purpose world models with Crusoe's AI cloud," Jan 21, 2026. crusoe.ai
  73. [73] Crusoe, "MirageLSD: Decart's real-time AI video model now available on Crusoe Cloud," Oct 8, 2025. crusoe.ai
  74. [74] Andrey Korolenko, Chief Product & Infrastructure Officer, Nebius. linkedin.com
  75. [75] Ophir Nave, COO & Executive Director, Nebius. nebius.com

MinjAI Competitive Intelligence Platform • Managed Inference Landscape Report • February 2026

75 Sources • 12 Sections • 5 Companies Analyzed

This report is for strategic intelligence purposes. Market data and pricing are subject to change.