2026 marks the inflection point where inference overtakes training as the dominant AI infrastructure workload. Gartner projects $37.5B in AI-optimized IaaS spending in 2026, with 55% ($20.6B) flowing to inference—up from $9.2B in 2025.1 Deloitte estimates inference will consume 67% of all AI compute by end of 2026, up from 50% in 2025.2
The broader AI inference platform-as-a-service market is projected to grow from $18.84B in 2025 to $105.22B by 2030 at a 41.1% CAGR.3 Three forces are accelerating this: agentic AI workflows multiplying token volume per task, reasoning models consuming 10–100x more tokens per query, and enterprise migration from proprietary APIs to open-weight models for cost and control.4
The investment thesis has shifted decisively toward inference. In H2 2025 alone:
Per-token costs are declining at roughly 10x per year at equivalent model quality. GPT-3-equivalent inference fell from $60/M tokens in 2021 to $0.06/M tokens in 2025—a 1,000x reduction in three years.7 This deflation rewards platforms with proprietary engine optimizations that can maintain margins while prices compress.
The inference PaaS market is consolidated at the top: hyperscalers (AWS, GCP, Azure) hold 66–75% share. But the independent managed inference layer—the five platforms analyzed here—is where the fastest innovation is happening and where enterprises are increasingly deploying production workloads for speed, cost, and model flexibility advantages.
This report analyzes the five leading independent managed inference platforms by funding, revenue scale, and technical differentiation. Each operates a proprietary or optimized inference engine, offers per-token API pricing, and targets enterprise production workloads.
| Platform | Valuation | Revenue | Engine | Models |
|---|---|---|---|---|
| Fireworks AI | $4.0B | >$280M ann. | FireAttention V4 | 50+ |
| Together AI | $3.3B | ~$300M ann. | FlashAttention-4 + Kernel Collection | 200+ |
| Baseten | $5.0B | 10x YoY growth | Custom C++ + TensorRT-LLM | BYOM + APIs |
| Nebius | ~$25B mkt cap | $530M FY2025 | Token Factory (vLLM+) | 60+ |
| Crusoe | $10B+ | 5x bookings growth | MemoryAlloy | 8+ |
The managed inference market has bifurcated into two tiers: API-first platforms (Fireworks, Together) competing on model breadth, developer experience, and token pricing; and infrastructure-first platforms (Baseten, Nebius, Crusoe) competing on custom deployment, BYOM, and cost-per-compute-hour. The platforms that bridge both—offering production APIs AND dedicated infrastructure—will capture the most enterprise value.
| Dimension | Fireworks AI | Together AI | Baseten | Nebius | Crusoe |
|---|---|---|---|---|---|
| Founded | 2022 | 2022 | 2019 | 2024 (ex-Yandex) | 2018 |
| HQ | Redwood City, CA | San Francisco, CA | San Francisco, CA | Amsterdam, NL | Denver, CO |
| Total Funding | ~$327M | ~$534M | ~$585M | Public (NBIS) | ~$3.9B |
| Employees | ~166 | ~320 | ~100–150 | ~1,371 | ~1,000+ |
| Inference Engine | FireAttention (custom CUDA) | FlashAttention + Kernel Collection | TensorRT-LLM + Custom C++ | Token Factory (vLLM+) | MemoryAlloy (KV-cache) |
| GPU Support | H100, H200, B200, MI300X | H100, H200, B200, GB200 | H100, H200, B200 | H100, H200, GB300 | H100, H200, B200, GB200, AMD (SkyPilot) |
| Llama 3.3 70B $/M | $0.90 / $0.90 | $0.88 / $0.88 | Dedicated only | $0.13 / $0.40 | $0.25 / $0.75 |
| Key Customers | Cursor, Uber, Samsung, Notion | Salesforce, Zoom, DuckDuckGo | Cursor, Writer, Notion | Microsoft, Meta | Cursor, Fireworks, Together AI |
| Compliance | SOC2, HIPAA, GDPR | SOC2 Type II | SOC2, HIPAA | ISO 27001, SOC2 | SOC2, ISO 27001, ISO 42001 |
| BYOM | Yes (On-Demand) | Yes (Dedicated) | Yes (Truss SDK) | Yes (Enterprise) | Yes (Contact Sales) |
| Fine-Tuning | LoRA, DPO, RFT | LoRA, Full FT | Blueprint + Training | Enterprise only | Roadmap |
Each platform has built or adopted a distinct inference optimization strategy. The engine choice defines their cost structure, performance ceiling, and hardware flexibility.
Custom CUDA kernels written from scratch for each GPU generation. V4 introduces FP4 (NVFP4) precision on NVIDIA B200 Blackwell GPUs with TensorCore Gen 5 instructions. Achieves 3.5x throughput improvement versus SGLang on H200 and >250 tok/s sustained on B200.8 Speculative decoding enabled Cursor to reach ~1,000 tok/s on Llama 70B.9 Uniquely supports both NVIDIA (H100/H200/B200) and AMD (MI300X) hardware.
Tri Dao's FlashAttention is the industry-standard attention kernel, used by virtually every LLM provider. FlashAttention-4 on Blackwell achieves 1,605 TFLOPS (71% of theoretical maximum), 22% faster than NVIDIA's own cuDNN library.10 The Together Kernel Collection provides up to 10% faster training and 75% faster inference on top of FlashAttention.11
Replaced Triton Inference Server with a custom C++ server integrating TensorRT-LLM at the executor API level. Builds TRT-LLM from source and contributes patches upstream. Adds custom CUDA kernels for structured output (via Outlines) and speculative decoding (EAGLE-3, Medusa). Engine Builder automates TRT-LLM engine creation in minutes.12 Deep NVIDIA partnership ($150M investment) ensures early access to optimizations.
Token Factory runs on optimized vLLM with proprietary extensions: speculative decoding, PagedAttention, and KV-cache reuse achieving 4x cost reductions. Nebius designs their own server chassis and operates Europe's first GB300 NVL72 deployment in Finland.13 At ~70% gross margin, Token Factory demonstrates that managed vLLM can be a high-margin business at scale.
MemoryAlloy is a distributed key-value cache architecture that decouples KV storage from GPU compute. Achieves 9.9x improvement in time-to-first-token (TTFT) and 5x throughput versus standard vLLM.14 This architecture is particularly effective for long-context and multi-turn workloads where KV-cache reuse creates compound performance gains.
Direct head-to-head comparisons are limited by vendor-specific test conditions. The table below normalizes available benchmarks to the closest comparable workloads.
| Metric | Fireworks | Together | Baseten | Nebius | Crusoe |
|---|---|---|---|---|---|
| TTFT (70B-class) | 0.30–0.40s | ~0.25s (MinjAI est.) | 0.13s (Mistral 7B)53 | ~0.35s (MinjAI est.) | 9.9x faster vs vLLM |
| Output Throughput | >250 tok/s (B200) | ~175 tok/s (H200) | 650+ tok/s (GPT-OSS 120B)27 | 4x cost-perf via KV reuse | 5x vs vLLM baseline |
| Peak Customer Deploy | ~1,000 tok/s (Cursor) | Claims 30%+ faster than Fireworks | 78% lower latency (OpenEvidence) | N/A (enterprise SLA) | N/A (GA Nov 2025) |
| Speculative Decoding | Yes (production) | Yes (Medusa) | Yes (EAGLE-3, Medusa) | Yes (vLLM native) | Roadmap |
| Multi-Hardware | NVIDIA + AMD | NVIDIA only | NVIDIA only | NVIDIA only | NVIDIA + AMD (SkyPilot)64 |
| Blackwell (B200/GB200) | FP4 via FireAttention V4 | FlashAttention-4 native | 225% cost-perf gain28 | First EU GB300 NVL72 | B200 supported |
These benchmarks are sourced from vendor claims, Artificial Analysis rankings, and customer case studies. No independent third-party has tested all five platforms under identical conditions. Baseten's TTFT benchmark is on Mistral 7B (not 70B); Crusoe's metrics are relative improvements vs. vLLM baseline. Treat as directional, not absolute.
Custom CUDA kernels (Fireworks, Baseten) → Maximum per-GPU performance, hardware-specific optimization
Research-grade kernels (Together) → Deepest attention-layer optimization, cross-platform portability
vLLM-based + extensions (Nebius) → Ecosystem compatibility, proven at scale, lower R&D cost
Architecture innovation (Crusoe) → System-level optimization, unique multi-turn advantage
Fireworks AI is the highest-revenue independent managed inference platform. Founded by ex-Meta PyTorch engineers (CEO Lin Qiao led 300+ engineers building PyTorch), the company raised $250M in Series C at $4B valuation in October 2025.15 Revenue grew roughly 20x year-over-year from ~$6.5M ARR (May 2024) to >$280M annualized (Oct 2025).16
| Name | Role | Background |
|---|---|---|
| Lin Qiao | CEO & Co-Founder | Head of PyTorch at Meta, Sr. Director Engineering (300+ eng); PhD UCSB |
| Dmytro Dzhulgakov | CTO & Co-Founder | PyTorch core maintainer at Meta |
| Chenyu Zhao | Co-Founder | Google Vertex AI lead |
| Metric | Value | Context |
|---|---|---|
| TTFT | 0.30–0.40s | Across models, faster than Groq (0.45s) |
| gpt-oss-120b | 960 tok/s | Artificial Analysis benchmark17 |
| B200 Peak | >250 tok/s | FireAttention V4 with FP4 |
| Cursor Deploy | ~1,000 tok/s | Speculative decoding on Llama 70B |
| Round | Date | Amount | Lead Investors |
|---|---|---|---|
| Seed | 2022 | Undisclosed | Benchmark |
| Series A | Mar 2024 | $25M | Benchmark |
| Series B | Jul 2024 | $52M at $552M | Sequoia, NVIDIA |
| Series C | Oct 2025 | $250M at $4B | Lightspeed, Index, Evantic18 |
| Customer | Use Case | Verified Outcome |
|---|---|---|
| Cursor | AI code completion (Llama 70B) | ~1,000 tok/s via speculative decoding; powers Tab autocomplete for millions of developers9 |
| Notion | AI writing assistant | 4x latency reduction vs. previous provider; sub-second response times54 |
| Uber | Compound AI for ride operations | Production-scale multi-model orchestration via FireFunction; specific metrics undisclosed |
| Samsung | On-device + cloud AI features | Galaxy AI integration via Fireworks serverless API; specific metrics undisclosed |
| Cresta | Contact center AI | ~100x cost savings vs. proprietary API providers55 |
V1 (2023): Initial custom CUDA kernels for H100, replacing standard vLLM serving. Achieved ~2x throughput improvement over stock PyTorch inference.
V2 (2024): Added continuous batching, speculative decoding, and H200 support. Multi-tenant GPU sharing enabled the serverless pricing model.
V3 (2024): AMD MI300X support added—making Fireworks the only platform in this group to run on non-NVIDIA hardware. PagedAttention optimization and prefix caching.
V4 (2025): FP4 (NVFP4) precision on B200 Blackwell with TensorCore Gen 5. 3.5x throughput gain over SGLang on H200. This generation targets the AI agent/creation market where sustained high throughput matters more than single-request latency.
Moat assessment: Fireworks' moat is engineering velocity: 4 major engine versions in 3 years, each generation-specific. The risk is that NVIDIA's own TensorRT-LLM narrows the gap with each release. The AMD support is a strategic hedge—if MI300X/MI400 gain traction, Fireworks is the only independent platform ready.
Key risk: Lin Qiao's PyTorch team culture means Fireworks optimizes at the kernel level, not the system level. MemoryAlloy (Crusoe) and Token Factory (Nebius) attack efficiency from the architecture layer—a different competitive angle that kernel optimization alone can't match.
Fireworks has the strongest combination of scale (10T tokens/day), revenue ($280M+), and customer logos (Cursor, Uber, Samsung, Notion). Their PyTorch founding team has deep inference optimization expertise. Multi-hardware support (NVIDIA + AMD) is unique. Primary weakness: not the cheapest on per-token pricing; competes on speed and reliability.
Together AI combines academic research credibility with production-scale infrastructure. Chief Scientist Tri Dao created FlashAttention, the industry-standard attention kernel used by virtually every LLM provider globally. The company raised $305M in Series B (Feb 2025) and has $534M total funding.19
| Name | Role | Background |
|---|---|---|
| Vipul Ved Prakash | CEO & Co-Founder | Founder of Topsy (acquired by Apple), serial entrepreneur |
| Tri Dao | Chief Scientist | Creator of FlashAttention 1–4; Stanford/Princeton PhD |
| Ce Zhang | Co-Founder & President | ETH Zurich professor, data systems researcher |
Revenue splits approximately 30–40% API inference and 60–70% GPU cluster rentals. Gross margins are ~45%, with infrastructure ownership (data centers in Maryland, Memphis, Sweden) expected to improve unit economics.22 Claims 80% cheaper than hyperscalers on equivalent workloads.
| Model | Input $/M | Output $/M |
|---|---|---|
| Llama 3.1 8B | $0.18 | $0.18 |
| Llama 3.3 70B | $0.88 | $0.88 |
| DeepSeek-R1 | $3.00 | $7.00 |
| Llama 4 Maverick (400B MoE) | $0.27 | $0.27 |
| Round | Date | Amount | Lead Investors |
|---|---|---|---|
| Seed | May 2023 | $20M | Lux Capital |
| Series A | Nov 2023 | $102.5M | Kleiner Perkins56 |
| Series B | Mar 2024 | $106M | Salesforce Ventures57 |
| Series C | Feb 2025 | $305M at $3.3B | Prosperity7, Coatue, a16z19 |
Partnered with Hypertec/5C for up to 100,000 GPUs in European data centers ($5B total investment). Positions Together for EU data residency requirements and sovereign AI demand.23
| Customer | Use Case | Verified Outcome |
|---|---|---|
| Salesforce | Enterprise AI features (Agentforce) | Strategic investor ($106M Series B lead); Together powers inference workloads |
| Zoom | AI Companion features | Meeting summarization, real-time AI assistance at scale |
| DuckDuckGo | AI-powered search answers | Privacy-first inference via Together API; open-weight models for data control |
| Pika Labs | AI video generation | GPU clusters for video model training and inference at scale |
| Meta | Llama launch partner | Day-one availability of Llama 4 Maverick/Scout; co-marketing partnership20 |
FlashAttention's industry impact: Tri Dao's FlashAttention is used by virtually every LLM provider—including Fireworks, Baseten, and Nebius from this report. This gives Together unparalleled visibility into attention kernel optimization requirements across the industry.
The Together Kernel Collection goes beyond FlashAttention: it includes optimized kernels for MLP layers, normalization, and embedding operations. Together claims 10% faster training and 75% faster inference vs. stock implementations. This collection is proprietary (unlike FlashAttention itself).
Acquisition strategy: The Refuel.ai acquisition (May 2025) added data quality/structuring capabilities, enabling a train→evaluate→infer loop. This is Together's answer to Baseten's Parsed acquisition—both racing to own the full model lifecycle.
Revenue composition risk: ~60-70% of revenue comes from GPU cluster rentals, not inference API. This means Together's inference margins are less proven at scale than Fireworks'. The shift to owned infrastructure (Maryland, Memphis, Sweden data centers) should improve unit economics but requires massive capex.
Open-source ecosystem leverage: Together's open models (RedPajama, OpenChatKit) and research papers (FlashAttention 1-4, Monarch Mixer) create developer mindshare that converts to paying API customers. This research-to-revenue flywheel is unique in this landscape.
Together's research moat (FlashAttention is literally the kernel everyone uses) gives them unique credibility. 200+ models is the broadest catalog among independents. European expansion addresses sovereignty demand. Primary weakness: training-heavy revenue mix means inference margins are still maturing. Aggressive pricing compresses margins.
Baseten is the highest-valued independent inference platform at $5B, driven by NVIDIA's $150M strategic investment as part of the $300M Series E (Jan 2026).24 Founded in 2019 by ex-Gumroad and ex-Clover Health engineers, Baseten pivoted from ML app building to production inference infrastructure and has seen explosive growth: 100x inference volume increase in 2025.
| Name | Role | Background |
|---|---|---|
| Tuhin Srivastava | CEO & Co-Founder | Ex-Gumroad (data scientist/fraud ML), ex-Macquarie (IB); USC |
| Amir Haghighat | CTO & Co-Founder | Ex-Clover Health (ML engineering), ex-Yelp; MS CS UC Irvine |
| Workload | Metric | Result |
|---|---|---|
| GPT-OSS 120B | Throughput | 650+ tok/s (Artificial Analysis #1 on OpenRouter)27 |
| Mistral 7B | TTFT | 130ms |
| Mistral 7B | Throughput | 170 tok/s |
| Embeddings (B200) | vs. vLLM | 3.3x higher throughput |
| B200 Blackwell | Cost-performance | 225% improvement (validated by Google Cloud)28 |
| Round | Date | Amount | Lead |
|---|---|---|---|
| Series C | Feb 2025 | $75M at $825M | Spark Capital |
| Series D | Sep 2025 | $150M at $2.15B | BOND |
| Series E | Jan 2026 | $300M at $5B | IVP, CapitalG, NVIDIA ($150M)29 |
| Customer | Use Case | Verified Outcome |
|---|---|---|
| Cursor | AI code editor inference | Primary inference provider alongside Fireworks; production-scale code completion |
| Writer | Enterprise AI writing platform | Custom Palmyra model deployed via Truss; dedicated GPU deployment58 |
| Zed | AI-powered code editor | 45% lower latency vs. previous provider with dedicated B200 deployment |
| OpenEvidence | Medical AI platform | 78% lower latency, enabling real-time clinical decision support59 |
| Patreon | Creator platform AI features | ~$600K/year savings vs. proprietary APIs; migrated to open-weight models on Baseten |
NVIDIA's $150M bet: This is NVIDIA's largest known investment in a managed inference startup. The strategic rationale: Baseten validates TensorRT-LLM as the enterprise inference standard. Every Baseten deployment runs NVIDIA's software stack, creating lock-in at the engine layer.
Product pivot history: Baseten started in 2019 as an ML app builder (think: Streamlit for ML). The pivot to inference infrastructure happened in 2023 when they realized the bottleneck for ML deployment wasn't the app layer but the serving layer. This pivot explains why their developer experience (Truss SDK, Chains) is best-in-class—they came from a developer tools background.
The Parsed acquisition (Dec 2025) is strategically important: it adds RL (reinforcement learning) and evaluation capabilities. Combined with Baseten Training (closed beta), this gives Baseten the only complete train→evaluate→deploy→improve loop among the five platforms.
AWS Strategic Collaboration Agreement (Dec 2025): Baseten is available on AWS Marketplace with Savings Plans support.48 This is unusual—most inference startups compete against AWS, not partner with them. It signals AWS sees Baseten as complementary (custom model serving) rather than competitive (they don't replicate SageMaker).
Key risk: Baseten's dedicated GPU model means they don't benefit from multi-tenant efficiency the way serverless platforms (Fireworks, Together) do. At small scale, customers pay for idle GPU time. This makes Baseten most compelling for customers with consistent, high-volume workloads.
Baseten's $150M NVIDIA investment creates a deep technical moat around TensorRT-LLM optimization. Three funding rounds in 12 months ($75M → $150M → $300M) shows exceptional velocity. The Parsed acquisition gives them the only end-to-end inference + training + RL pipeline among the five. Primary weakness: revenue scale likely smaller than Fireworks/Together; enterprise customer base still growing.
Nebius is the only publicly traded company in this comparison and the largest by market capitalization. Spun out of Yandex's cloud infrastructure business, led by Arkady Volozh (ex-Yandex CEO). Revenue grew 479% YoY to $530M in FY2025, with Q4 alone at $228M (+547% YoY).30
| Name | Role | Background |
|---|---|---|
| Arkady Volozh | CEO | Founded Yandex (Russia's Google); built $25B+ enterprise |
| Andrey Korolenko | Chief Product & Infrastructure Officer | 28-year Yandex/Nebius veteran (since 1998); leads data center buildouts & capacity planning74 |
| Roman Chernin | Chief Business Officer & Co-Founder | 12 years heading Yandex digital services (Search, Maps); spearheading AI cloud business since 2023 |
| Ophir Nave | COO & Executive Director | M&A lawyer; ex-Arnon Tadmor-Levy, ex-Wachtell Lipton; JSD Harvard Law75 |
| Metric | FY2025 | 2026 Guidance |
|---|---|---|
| Revenue | $529.8M | $3.0–3.4B31 |
| ARR | $1.25B | $7–9B |
| EBITDA Margin | Improving | ~40% target |
| Cash | $3.7B | — |
| Data Center | Capacity | Status |
|---|---|---|
| Finland (Mäntsälä) | 60,000 GPUs, 75 MW | Operational + expanding |
| New Jersey | Operational | Live |
| Kansas City | 35,000 GPUs, 40 MW | Coming online |
| Iceland | Planned | Under development |
| Model | Input $/M | Output $/M |
|---|---|---|
| Llama 3.1 8B | $0.02 | $0.06 |
| Llama 3.3 70B | $0.13 | $0.40 |
| DeepSeek-V3 | $0.50 | $1.50 |
| DeepSeek-R1 | $0.80 | $2.40 |
Batch inference at 50% of base pricing.33
| Event | Date | Amount / Detail |
|---|---|---|
| Yandex Restructuring | Jul 2024 | Spun out of Yandex NV; listed on NASDAQ as NBIS |
| NVIDIA Investment | Dec 2024 | $350M from NVIDIA & Accel; earmarked for GPU procurement60 |
| Secondary Offering | Feb 2025 | $700M raised; shares priced at $43 |
| Cash Position | End FY2025 | $3.7B total cash & equivalents |
| Microsoft Deal | 2025 | $17.4B (up to $19.4B) five-year infrastructure agreement32 |
| Meta Deal | 2025 | ~$3B infrastructure partnership |
| Customer | Use Case | Verified Outcome |
|---|---|---|
| Microsoft | AI infrastructure capacity | $17.4B five-year deal; largest known Nebius engagement32 |
| Meta | GPU cluster capacity | ~$3B deal for training and inference infrastructure |
| Tavily | AI search & retrieval (acquired) | Acquired to add agentic AI search capabilities to Nebius platform |
| Enterprise customers | Token Factory API | Demand exceeded capacity in Q4 2025; sold out driving 547% YoY Q4 growth |
The Yandex advantage: Nebius inherited Yandex's 25+ years of large-scale infrastructure operations. Yandex was Russia's Google—search, cloud, self-driving cars, e-commerce. This means Nebius entered the AI infrastructure market with mature operational playbooks that startups lack: data center design, GPU procurement at scale, and network engineering.
European sovereign play: Nebius is headquartered in Amsterdam and operates Europe's largest GPU cluster in Finland (60K GPUs). The EU AI Act and GDPR create demand for European-domiciled inference. Nebius is the only platform in this group with production infrastructure in the EU, giving them first-mover advantage on $80B sovereign cloud market.39
Unit economics at scale: ~70% gross margin on $530M revenue ($371M gross profit) is remarkable for infrastructure. The economics work because Nebius owns their data centers, procures GPUs at hyperscaler volume, and runs Token Factory at high utilization. Guidance of 40% EBITDA margin on $3.0-3.4B 2026 revenue implies ~$1.2-1.4B EBITDA potential.
Capacity as the constraint: Nebius was sold out in Q4 2025. The Kansas City DC (35K GPUs, 40 MW) coming online in H1 2026 and Iceland expansion should relieve this, but demand from Microsoft/Meta absorbs most new capacity. Token Factory for external customers competes with hyperscaler contracts for GPU allocation.
Risk factors: Concentration risk (Microsoft = majority of revenue), geopolitical perception (Yandex heritage), and the Arkady Volozh single-founder dependency. EU sanctions compliance adds operational complexity.
Nebius operates at a fundamentally different scale: $3.7B cash, $17.4B Microsoft deal, publicly traded, ~70% gross margins. Token Factory pricing is the most aggressive in this group (Llama 70B at $0.13/$0.40). Their European infrastructure positions them for the $80B sovereign cloud opportunity. Primary weakness: capacity-constrained (sold out in Q4 2025), limited model catalog (60+ vs 200+ at Together).
Crusoe is the most heavily capitalized company in this group ($3.9B total raised) and uniquely positioned as both a GPU cloud and a managed inference platform. Managed Inference reached general availability in November 2025, powered by the proprietary MemoryAlloy engine.34 Crusoe's structural energy cost advantage (~$0.03/kWh) underpins its long-term margin thesis.
| Name | Role | Background |
|---|---|---|
| Chase Lochmiller | CEO & Co-Founder | Stanford CS; former quant trader |
| Erwan Menard | SVP Product | Ex-Google Cloud AI (Vertex AI Director of PM); CEO of Elastifile (acquired by Google)35 |
| Eesha Pathak | Sr. Director PM | Ex-Google Cloud AI (Head of Product, Enterprise AI & International Expansion); 15+ years36 |
| Aditya Shanker | GPM, Inference | Inference product lead |
| Omar Lari | Sr. Director PM, IaaS | Infrastructure product lead |
SOC2 + ISO 27001 + ISO 42001 (Feb 2026). Crusoe achieved ISO 27001 (information security management) and ISO 42001 (AI governance) certifications, significantly closing the compliance gap with Fireworks and Baseten.68 ISO 42001 is notable—it's the first AI-specific governance standard, and Crusoe is the only platform in this group to hold it.
| Model | Input $/M | Output $/M |
|---|---|---|
| Llama 3.3 70B | $0.25 | $0.75 |
| DeepSeek R1 | $1.35 | $5.40 |
| Qwen3 235B | $0.22 | $0.80 |
| Kimi-K2 | $0.60 | $2.50 |
| Round | Date | Amount | Key Investors / Notes |
|---|---|---|---|
| Series A | Apr 2022 | $128M | Valor Equity Partners |
| Series B | Sep 2022 | $350M | G2 Venture Partners |
| Series C | Aug 2024 | $600M at ~$3B | Fidelity, NEA, Founders Fund69 |
| Debt Facility | 2024 | $225M | Infrastructure financing |
| Series D+E | 2025 | Undisclosed | Valuation reported at $10B+52 |
| Total Raised | — | ~$3.9B | Includes equity + debt |
| Metric | Value | Source / Context |
|---|---|---|
| TTFT (MemoryAlloy) | 9.9x faster vs vLLM | Internal benchmark, Nov 202514 |
| Throughput | 5x vs vLLM baseline | MemoryAlloy cluster-scale test |
| Llama 3.1 Fine-Tuning (GB200) | 3x faster vs H100 | GB200 NVL72 benchmark, Feb 202670 |
| InferenceMAX | Benchmark co-creator | Partnership with SemiAnalysis, Oct 202571 |
| Customer | Use Case | Verified Outcome |
|---|---|---|
| Cursor | AI code editor infrastructure | Multi-provider strategy; Crusoe as GPU infrastructure layer (shared with Fireworks/Baseten)37 |
| Together AI | GPU cloud customer | Runs training & inference workloads on Crusoe H100/H200 clusters (metrics undisclosed) |
| Fireworks AI | GPU cloud customer | Uses Crusoe infrastructure for compute capacity scaling (metrics undisclosed) |
| Odyssey | General-purpose world models | Pioneering world model training on Crusoe's scalable GPU cloud; featured case study Jan 202672 |
| Decart (MirageLSD) | Real-time AI video generation | MirageLSD model deployed on Crusoe Cloud; real-time video synthesis73 |
| Sony, Databricks, MIT | Enterprise AI / research | GPU cloud customers (specific metrics undisclosed) |
Crusoe's foundational advantage is structural energy cost. Originally built on stranded natural gas, now transitioning to renewable sources. At ~$0.03/kWh, Crusoe operates at roughly 50–60% lower energy cost than hyperscaler data centers, creating a durable margin advantage that compounds as inference workloads scale.
The energy moat quantified: At $0.03/kWh vs. ~$0.06-0.08/kWh for hyperscalers, Crusoe saves ~$0.03-0.05/kWh. A single H100 draws ~0.7 kW, running 24/7/365 = ~6,132 kWh/year. That's ~$184-307/year per GPU in energy savings. At 10,000 GPUs: $1.8-3.1M/year in structural cost advantage. At 100,000 GPUs: $18-31M/year. This advantage scales linearly and compounds as GPU power draw increases with each generation (B200 draws ~1kW, GB200 even higher).
MemoryAlloy architecture: Unlike other engines that optimize per-GPU efficiency, MemoryAlloy operates at the system level by decoupling KV-cache storage from GPU compute. In multi-turn conversations or long-context workloads, KV-cache data is persisted across requests, eliminating redundant prefill computation. This is why the 9.9x TTFT improvement is on time-to-first-token specifically—it's the prefill step that benefits most from cache reuse.
Product velocity (Nov 2025 – Feb 2026): Crusoe shipped an extraordinary amount in 90 days: Managed Inference GA (Nov 20), MemoryAlloy engine paper (Nov 20), Run:ai certification (Nov 17), BYOM formal launch (Feb 6), Command Center (Feb 18), AutoClusters (Feb 3), MCP Server (Feb 11), GB200 NVL72 fine-tuning benchmarks (Feb 6), AMD GPU support (Jan 13), and ISO 27001+42001 (Feb 13). This cadence suggests a well-staffed product org executing at startup speed despite 1,000+ employees.
Compliance leapfrog: The ISO 42001 certification is strategic. It's the world's first AI governance standard (ISO/IEC 42001:2023). No other platform in this group holds it. For enterprises evaluating AI risk governance, this is a differentiator—particularly in regulated industries and government contracts where AI-specific compliance frameworks are emerging requirements.
Platform customer dynamics: Crusoe's most interesting competitive dynamic is that two of its biggest competitors (Fireworks and Together) are also customers of its GPU cloud. This creates an unusual relationship: Crusoe provides the infrastructure that powers competing managed inference APIs. Erwan Menard's Feb 2026 blog framing ("Building the world's favorite AI cloud") suggests Crusoe sees this as a feature, not a conflict—the IaaS revenue from competitors funds managed inference R&D.
Go-to-market evolution: With only 8 models in the catalog vs. 200+ at Together, Crusoe is leaning into BYOM + Command Center as the enterprise play. The combination of "bring your fine-tuned model + run it on MemoryAlloy + monitor via Command Center" creates an end-to-end value proposition for enterprises that want performance without managing infrastructure. The InferenceMAX benchmark partnership with SemiAnalysis also positions Crusoe as a thought leader on inference performance measurement.
Leadership signal: Hiring Erwan Menard (ex-Vertex AI Director of PM) and Eesha Pathak (ex-Google Cloud AI, Head of Product) signals Crusoe is serious about building a Google Cloud-caliber product organization. The shipping velocity since their arrival validates this thesis.
Crusoe is uniquely positioned as the only platform in this group that owns its energy infrastructure AND holds ISO 42001 (AI governance) certification. The product velocity since Nov 2025 has been exceptional: 10+ major launches in 90 days. ISO 27001+42001 closes the compliance gap significantly. Command Center + MCP Server address the developer experience gap. GB200 NVL72 and AMD GPU support via SkyPilot expand hardware flexibility. The remaining gaps: model catalog depth (8 vs. 200+ at Together) and proven production scale at token volume comparable to Fireworks' 10T tokens/day.
Pricing is the most visible competitive dimension in managed inference. The table below normalizes per-token costs across the five platforms for comparable models.
| Platform | Input | Output | Blended (1:1) | vs. Cheapest |
|---|---|---|---|---|
| Nebius | $0.13 | $0.40 | $0.265 | Cheapest |
| Crusoe | $0.25 | $0.75 | $0.50 | +89% |
| Together AI | $0.88 | $0.88 | $0.88 | +232% |
| Fireworks AI | $0.90 | $0.90 | $0.90 | +240% |
| Baseten | Dedicated GPU deployments only (not per-token) | |||
| Platform | Input | Output | Blended (1:1) |
|---|---|---|---|
| Nebius | $0.80 | $2.40 | $1.60 |
| Together AI | $3.00 | $7.00 | $5.00 |
| Crusoe | $1.35 | $5.40 | $3.38 |
| Fireworks AI | ~$8.00 | ~$8.00 | $8.00 |
| GPU | Fireworks | Baseten (per-min) | Crusoe |
|---|---|---|---|
| H100 80GB | $4.00/hr | $6.48/hr ($0.108/min) | $3.90/hr |
| B200 180GB | $9.00/hr | $9.96/hr ($0.166/min) | TBD |
Nebius is the price leader on per-token models, leveraging scale (30K+ GPUs) and ~70% gross margins. Crusoe is positioned mid-market, 89% above Nebius but 44% below Fireworks/Together on Llama 70B. Fireworks and Together compete on speed/reliability, not price. Baseten avoids per-token comparison by focusing on dedicated deployments where customers control cost-per-GPU-hour. Token pricing deflation of ~10x/year means today's prices will be tomorrow's floor.
This matrix rates each platform across eight dimensions critical to enterprise managed inference buyers. Ratings are relative within this five-platform set (5 = best-in-class, 1 = weakest).
| Dimension | Fireworks | Together | Baseten | Nebius | Crusoe |
|---|---|---|---|---|---|
| Engine Performance | 5 | 4 | 4 | 3 | 4 |
| Model Catalog | 4 | 5 | 3 | 3 | 2 |
| Per-Token Pricing | 2 | 3 | N/A | 5 | 4 |
| Enterprise Compliance | 5 | 3 | 4 | 4 | 4 |
| Developer Experience | 4 | 4 | 5 | 3 | 3 |
| BYOM / Customization | 3 | 4 | 5 | 3 | 3 |
| Infrastructure Scale | 3 | 4 | 3 | 5 | 4 |
| Cost Structure Moat | 2 | 3 | 3 | 4 | 5 |
| Dimension | Fireworks | Together | Baseten | Nebius | Crusoe |
|---|---|---|---|---|---|
| Uptime SLA | 99.9% (Enterprise) | Best effort | 99.9% (dedicated)61 | 99.95% (cloud SLA) | TBD (new GA) |
| Production Validation | 10T tok/day proven | 600K+ developers | 100x volume growth '25 | Sold out Q4 2025 | GA Nov 2025; 10+ launches in 90 days |
| Compliance | SOC2 + HIPAA + GDPR | SOC2 Type II | SOC2 + HIPAA | ISO 27001 + SOC2 | SOC2 + ISO 27001 + ISO 42001 |
| Dedicated Capacity | On-Demand deployments | GPU clusters | Per-GPU dedicated | Enterprise tiers | BYOM (contact sales) |
| Multi-Region | 18+ regions, 8+ clouds | US + EU (expanding) | US (AWS SCA) | Finland, NJ, KS | US (TX, WY) |
| Rate Limits | Custom (enterprise) | Tier-based | Dedicated = unlimited | Custom quotas | Contact sales |
Fireworks leads on enterprise compliance (SOC2+HIPAA+GDPR trifecta) and multi-region availability. Baseten offers the strongest dedicated SLA for custom model deployments. Nebius has the highest uptime target (99.95%) backed by their infrastructure ownership. Crusoe has closed its compliance gap significantly with ISO 27001+42001 (the only AI governance certification in this group). Together is still maturing its enterprise compliance posture (SOC2 only), which limits uptake in regulated industries.
| Use Case | Best Platform | Why |
|---|---|---|
| High-volume production API | Fireworks | 10T tokens/day proven scale, fastest engines, SOC2+HIPAA+GDPR |
| Research & experimentation | Together | 200+ models, FlashAttention pedigree, broadest catalog |
| Custom model deployment | Baseten | Truss SDK, Chains for pipelines, Engine Builder, best DX |
| Cost-optimized at scale | Nebius | Lowest per-token pricing, 70% gross margins, owned DCs |
| Energy-advantaged inference | Crusoe | Structural $0.03/kWh cost, MemoryAlloy architecture, BYOM |
The $20B Groq acquisition (Dec 2025) and $150M Baseten investment signal NVIDIA is building a vertically integrated inference ecosystem.38 Platforms aligned with NVIDIA (Baseten, Nebius) gain preferential access to TensorRT-LLM optimizations, Blackwell/Vera Rubin early access, and co-marketing. Platforms with custom engines (Fireworks, Crusoe) must maintain parity independently.
The sovereign cloud market is projected to reach $80B in 2026 and $823B by 2032.39 65% of governments will introduce sovereignty requirements by 2028 (Gartner). Platforms with physical infrastructure (Nebius, Crusoe) have an inherent advantage over API-only providers. Together's European expansion with 100K GPUs addresses this but through colocation, not owned infrastructure.
The market is converging toward platforms that own the complete model lifecycle: inference + fine-tuning + evaluation + post-training (RL). Baseten (via Parsed) and Together (via Refuel) have made acquisitions specifically to close this loop. Platforms offering inference-only will face pressure to expand.
The managed inference market is large enough ($20.6B in 2026) and growing fast enough (41% CAGR) to support multiple winners. No single platform dominates all dimensions. The sustainable winners will be those that combine proprietary engine optimization (Fireworks, Crusoe) with infrastructure scale (Nebius, Crusoe) and full-lifecycle capabilities (Baseten, Together). The next 12 months will determine whether the market consolidates around 2–3 platforms or remains pluralistic.
Crusoe occupies a unique position as the only platform in this group with both proprietary engine technology (MemoryAlloy) and owned energy infrastructure. This creates a structural cost advantage that scales with inference volume. Since the Managed Inference GA in November 2025, Crusoe has shipped at extraordinary velocity: 10+ major product launches in 90 days, including ISO 27001+42001 certifications, Command Center, BYOM, AutoClusters, MCP Server, and GB200 NVL72 benchmarks.
The compliance picture has changed significantly. ISO 27001 + ISO 42001 now puts Crusoe ahead of Together (SOC2 only) and at parity with Nebius (ISO 27001 + SOC2) on security certifications. The ISO 42001 AI governance certification is unique in this landscape—a differentiator for regulated enterprises and government contracts.
The hiring of ex-Google Cloud AI leadership (Erwan Menard, Eesha Pathak) signaled a deliberate pivot from infrastructure-company-that-does-inference to inference-platform-that-owns-infrastructure. The 90-day shipping cadence since validates this strategy is executing.
With compliance gaps largely closed (ISO 27001+42001) and developer experience improving (Command Center, MCP Server), Crusoe's remaining strategic priorities narrow to two: (1) expand the Intelligence Foundry model catalog from 8 to 30+ models to compete with Together/Fireworks on breadth, and (2) prove production token volume at scale comparable to Fireworks' 10T tokens/day. The combination of MemoryAlloy performance + $0.03/kWh energy + ISO 42001 + owned infrastructure creates a defensible position that no other platform in this landscape can replicate.
MinjAI Competitive Intelligence Platform • Managed Inference Landscape Report • February 2026
75 Sources • 12 Sections • 5 Companies Analyzed
This report is for strategic intelligence purposes. Market data and pricing are subject to change.