Fireworks AI is The platform's most direct competitor in the inference-as-a-service market. Founded in 2022 by seven ex-Meta engineers who built PyTorch,[1] the company has raised $327M+ across four rounds, achieved a $4.0B valuation,[2] and grown to >$280M ARR in under 3 years.[3] With 10K+ customers including Cursor, Uber, DoorDash, Samsung, and Shopify,[4] Fireworks processes 10 trillion tokens per day[3] through a proprietary FireAttention engine that delivers 3-4x throughput improvements over open-source inference stacks.[5]
Fireworks has a 3+ year head start, proven product-market fit at $280M revenue, and the PyTorch team's inference expertise. They serve the exact same customer segment The platform is targeting (AI-native startups, enterprise developers) with a mature, battle-tested platform. the platform must differentiate on sovereign deployment, energy cost advantage, and regulated verticals where Fireworks' hyperscaler-style cloud model is weakest. Do NOT compete head-to-head on serverless API pricing.
Fireworks AI was founded in 2022 by seven Meta/PyTorch engineers who saw the inference bottleneck coming before the ChatGPT moment.[1] The team brings 20+ combined years of deep learning systems experience and literally built the PyTorch framework used by millions of ML engineers worldwide.[8]
| Name | Role | Background |
|---|---|---|
| Lin Qiao | CEO & Co-Founder | Head of PyTorch at Meta (2019-2022), built PyTorch to 1M+ users[8] |
| Dmytro Dzhulgakov | CTO & Co-Founder | PyTorch core maintainer at Meta, inference optimization expert[1] |
| Benny Chen | Co-Founder | Meta Ads ML Infrastructure (2017-2022)[1] |
| Chenyu Zhao | Co-Founder | Google Vertex AI, ML platform engineering[1] |
| Dmytro Ivchenko | Co-Founder | PyTorch ranking systems at Meta[1] |
| James Reed | Co-Founder | PyTorch compiler team at Meta[1] |
| Pawel Garbacki | Co-Founder | Meta Newsfeed ML Infrastructure[1] |
| Attribute | Details |
|---|---|
| Headquarters | Redwood City, California[1] |
| Founded | 2022[1] |
| Employees | 166 (as of January 2026)[1] |
| Mission | "Make generative AI accessible, fast, and reliable"[7] |
| Core Technology | FireAttention engine (proprietary CUDA kernels, NOT vLLM)[5] |
This is not a cloud infrastructure company pivoting to AI inference. The founding team literally built PyTorch, the dominant framework for deep learning. Their expertise in model architecture, GPU optimization, and distributed systems gives Fireworks a compounding technical moat that the platform cannot easily replicate through engineering hiring alone.
Fireworks has raised $327M+ across four rounds, growing from Seed (2022) to a $4.0B Series C valuation in October 2025.[2] The company's valuation tripled in 5 months (from $1.5B in May 2025 to $4.0B in October 2025), driven by explosive revenue growth and strategic partnerships with NVIDIA and AMD.[2]
| Round | Date | Amount | Valuation | Lead Investors |
|---|---|---|---|---|
| Seed | 2022 | Undisclosed | -- | -- |
| Series A[9] | March 2024 | $25M | -- | Benchmark |
| Series B[10] | July 2024 | $52M | $552M | Sequoia Capital |
| Series C[2] | October 2025 | $250M | $4.0B | Lightspeed Venture Partners, Index Ventures |
| Total | $327M+ |
| Investor | Round | Strategic Value |
|---|---|---|
| NVIDIA[10] | Series B | Early access to H100/B200 GPUs, co-marketing, enterprise distribution |
| AMD[10] | Series B | MI300X GPU access, multi-hardware strategy validation |
| MongoDB Ventures[2] | Series C | Enterprise sales channel, database integration |
Fireworks attracted high-profile angel investors with deep AI/infrastructure expertise:[2]
The combination of Sequoia + Benchmark backing (top-tier VCs), NVIDIA/AMD strategic investment (preferential hardware access), and Frank Slootman/Sheryl Sandberg angels (enterprise sales expertise) gives Fireworks:
Fireworks' core technical moat is the FireAttention engine, a fully proprietary inference stack built from custom CUDA kernels.[5] Unlike most competitors who use vLLM or TensorRT-LLM, Fireworks writes GPU kernels optimized for each hardware generation (H100, H200, B200, AMD MI300X).[11]
| Version | Launch | Improvement | Key Feature |
|---|---|---|---|
| FireAttention V1[12] | Q1 2023 | 4x vs vLLM | Fused attention kernels, dynamic batching |
| FireAttention V2[13] | Q3 2023 | 12x for long-context | Paged attention for 32K+ tokens |
| FireAttention V3[11] | Q2 2024 | AMD MI300X support | Multi-hardware abstraction layer |
| FireAttention V4[5] | Q4 2025 | 3.5x on NVIDIA B200 | FP4 quantization, tensor parallelism |
| Product | Description | Target Customer |
|---|---|---|
| Serverless Inference[6] | Pay-per-token API, 50+ models, OpenAI-compatible | Developers, AI-native startups |
| On-Demand Deployments[6] | Dedicated clusters (1-1000 GPUs), reserved capacity | Scale-ups, production workloads |
| Virtual Cloud[6] | 18+ global regions, multi-cloud (AWS/GCP/Azure) | Latency-sensitive apps |
| Fine-Tuning[6] | LoRA, DPO, RFT (Reinforcement Fine-Tuning) | Custom model development |
| Compound AI[6] | Multi-model workflows, prompt chaining, agents | Complex AI applications |
| Batch Inference[6] | 50% discount for async workloads | Data processing, embeddings |
| Enterprise BYOC[7] | Deploy Fireworks into customer's cloud account | Regulated industries, data sovereignty |
FireAttention is fully proprietary — NOT built on vLLM or TensorRT-LLM. Custom CUDA kernels optimized per GPU generation give Fireworks:
The engineering team should carefully evaluate build vs. partner for inference optimization. Catching up to 3 years of CUDA kernel development is a multi-year, high-risk investment.
Fireworks reported >$280M annualized recurring revenue (ARR) at its Series C announcement in October 2025,[3] representing 20x year-over-year growth from $6.5M in May 2024.[14] Independent estimates from Sacra placed ARR at $130M in May 2025,[6] suggesting the true figure is likely between $130M-$280M.
| Date | ARR | Source | Growth Rate |
|---|---|---|---|
| May 2024 | $6.5M | Sacra estimate[14] | -- |
| May 2025 | $130M | Sacra estimate[6] | 20x YoY |
| October 2025 | $280M+ | Company-reported[3] | 2.2x in 5 months |
| Metric | Value | Date |
|---|---|---|
| Total Customers | 10,000+[4] | Oct 2025 |
| Daily Token Throughput | 10 trillion[3] | Oct 2025 |
| Peak Requests/Sec | 100,000[3] | Oct 2025 |
| Models Available | 50+[6] | Feb 2026 |
| Cloud Regions | 18+[6] | Feb 2026 |
Fireworks operates a multi-tier revenue model:[6]
The $280M ARR figure is self-reported annualized revenue from the Series C announcement, not audited financials. Sacra's independent estimate was $130M ARR 5 months earlier (May 2025).[6] The true ARR is likely between $130M-$280M. Even at the low end, this represents 20x YoY growth and validates strong product-market fit.
| Metric | Estimate | Basis |
|---|---|---|
| Gross Margin | 40-50% | Typical for cloud inference services[15] |
| CAC Payback | 6-12 months | Developer-led PLG motion |
| Net Revenue Retention | 120-150% | Cursor case study (1000 tok/s growth)[4] |
Fireworks serves three primary customer segments: AI-native companies (Cursor, Genspark, Retell AI), enterprise tech (Uber, DoorDash, Samsung), and developer tools (GitLab, Notion, Upwork).[4] The company's case studies reveal aggressive land-and-expand patterns, with customers like Cursor growing from pilot to 1,000 tokens/sec throughput in 12 months.[4]
| Customer | Use Case | Scale/Impact |
|---|---|---|
| Cursor[4] | AI-powered code editor | 1,000 tok/s throughput with speculative decoding |
| Uber[7] | Customer support automation | Multi-region deployment for low latency |
| DoorDash[7] | Restaurant recommendation engine | Real-time inference at scale |
| Samsung[7] | Device-side AI features | Fine-tuned models for mobile deployment |
| Shopify[7] | E-commerce personalization | Batch inference for product recommendations |
| Notion[7] | AI writing assistant | Low-latency text generation |
| Upwork[7] | Freelancer matching | Semantic search and embeddings |
| Genspark[7] | AI search engine | Multi-model orchestration |
| Retell AI[7] | Voice AI for call centers | Sub-100ms latency requirements |
| GitLab[7] | Code completion (DevSecOps) | Secure on-prem deployment |
| Segment | Customer Type | Go-to-Market |
|---|---|---|
| AI-Native Startups | Cursor, Genspark, Retell AI | Product-led growth, developer community |
| Enterprise Tech | Uber, DoorDash, Samsung, Shopify | Direct sales, NVIDIA co-marketing |
| Developer Tools | Notion, GitLab, Upwork | API-first, OpenAI migration path |
Cursor's growth illustrates Fireworks' land-and-expand motion:[4]
Fireworks competes on developer experience (OpenAI-compatible API, fast onboarding) and performance (FireAttention), not price. Their customers are willing to pay 10-20% more than hyperscalers for:
Fireworks uses a tiered pricing model based on model size and complexity.[16] Notably, Fireworks is NOT the cheapest provider in the market — they compete on speed, reliability, and enterprise features rather than rock-bottom pricing.
| Model Tier | Input Price | Output Price | Example Models |
|---|---|---|---|
| Small (<4B params)[16] | $0.10 | $0.10 | Llama 3.2 3B, Gemma 2 2B |
| Medium (4-16B)[16] | $0.20 | $0.20 | Llama 3.1 8B, Mistral 7B |
| Large (>16B)[16] | $0.90 | $0.90 | Llama 3.3 70B, Mixtral 8x7B |
| MoE 0-56B[16] | $0.50 | $0.50 | Mixtral 8x7B, DeepSeek V3 |
| MoE 56-176B[16] | $1.20 | $1.20 | Qwen2.5 72B, Mixtral 8x22B |
| Model | Input | Output | Notes |
|---|---|---|---|
| DeepSeek V3[16] | $0.56 | $1.68 | 671B total params, 37B active |
| DeepSeek R1[16] | ~$8.00 | ~$8.00 | Reasoning model (o1 competitor) |
| Kimi K2[16] | $0.60 | $2.50 | Long-context specialist (128K tokens) |
| GPU | Hourly Rate | Use Case |
|---|---|---|
| NVIDIA A100 80GB[16] | $2.90 | Training, batch inference |
| NVIDIA H100 80GB[16] | $4.00 | Real-time inference |
| NVIDIA H200[16] | $6.00 | Large-scale training |
| NVIDIA B200[16] | $9.00 | Next-gen inference (2026) |
| Provider | Llama 3.1 8B (per 1M tok) | Positioning |
|---|---|---|
| Fireworks[16] | $0.20/$0.20 | Performance + reliability |
| Together AI[17] | $0.18/$0.18 | Price + open-source models |
| Groq[17] | $0.05/$0.08 | Ultra-low latency (custom chips) |
| Cerebras[17] | $0.10/$0.10 | Fast inference (wafer-scale) |
| AWS Bedrock[18] | $0.30/$0.60 | Enterprise ecosystem |
Fireworks is NOT the cheapest inference provider. Groq, Cerebras, and Together AI all undercut Fireworks on price.[17] Fireworks competes on:
Strategic Takeaway: Do NOT compete on serverless API pricing. Fireworks' $0.20/M pricing for Llama 8B is sustainable because of their strong gross margins and premium positioning. The platform should lead with sovereign/compliance value proposition.
Fireworks' technical performance is the foundation of its competitive advantage. Independent benchmarks from Artificial Analysis[17] and LLM Benchmarks[19] consistently rank Fireworks in the top 3 for latency and throughput.
| Metric | Value | Model | Source |
|---|---|---|---|
| Time-to-First-Token (TTFT) | 0.4s | Llama 3.1 70B | Artificial Analysis[17] |
| Median Throughput | 66.49 tok/s | Llama 3.1 70B | Artificial Analysis[17] |
| Llama 3.1 8B Throughput | 127 tok/s | Llama 3.1 8B | LLM Benchmarks[19] |
| B200 Peak Throughput | >250 tok/s | DeepSeek V3 | Company blog[5] |
| Speculative Decoding (Cursor) | ~1,000 tok/s | Custom model | Cursor case study[4] |
| Metric | Value | Date |
|---|---|---|
| Daily Token Throughput | 10 trillion[3] | Oct 2025 |
| Peak Requests/Sec | 100,000[3] | Oct 2025 |
| Cloud Regions | 18+[6] | Feb 2026 |
| Cloud Providers | 8+ (AWS, GCP, Azure, OCI, etc.)[6] | Feb 2026 |
| GPU Fleet | NVIDIA H100/H200, AMD MI300X[11] | Feb 2026 |
Fireworks' marketing materials claim:[7]
| Provider | TTFT (Llama 70B) | Throughput (tok/s) | Ranking |
|---|---|---|---|
| Groq[17] | 0.2s | >200 | 1st (custom chip) |
| Cerebras[17] | 0.3s | ~150 | 2nd (wafer-scale) |
| Fireworks[17] | 0.4s | 66 | 3rd (GPU-based) |
| Together AI[17] | 0.6s | 55 | 4th |
| AWS Bedrock[18] | 1.2s | 45 | 8th |
Fireworks is the fastest GPU-based inference provider, but they're beaten by custom hardware (Groq's LPUs, Cerebras' wafer-scale chips). This validates A multi-chip strategy:
The platform should target sub-500ms TTFT and 100+ tok/s throughput as minimum viable performance. Anything slower will not be competitive with Fireworks.
| Competitor | Differentiation | Overlap with Fireworks |
|---|---|---|
| Together AI[21] | Open-source models, lower pricing | High (same customer segment) |
| Groq[22] | Custom LPU chips, ultra-low latency | Medium (developer-first) |
| Cerebras[23] | Wafer-scale chips, fast inference | Medium (enterprise focus) |
| Nebius AI[24] | Ex-Yandex team, GPU cloud | Low (Europe/Asia market) |
| Baseten | Model deployment platform | Low (MLOps focus) |
| Modal | Serverless Python runtime | Low (developer tooling) |
| Provider | Product | Fireworks' Advantage |
|---|---|---|
| AWS | Bedrock[18] | Faster, cheaper, more models |
| Vertex AI | Better DX, open-source models | |
| Azure | AI Inference | No vendor lock-in, multi-cloud |
| OpenAI | API | Llama/Mixtral access, lower cost |
| Dimension | Fireworks AI | the inference platform |
|---|---|---|
| Core Offering | Serverless + On-Demand Inference[6] | Sovereign AI Infrastructure |
| Target Market | AI-native startups, enterprise tech | Regulated verticals (finance, gov, defense) |
| Hardware | NVIDIA H100/H200/B200, AMD MI300X[11] | NVIDIA H100/H200, alternative silicon |
| Deployment | 18+ cloud regions, BYOC[6] | On-prem, air-gapped, modular containers |
| Revenue | $280M+ ARR[3] | TBD (MVP Aug-Sep 2025) |
| Scale | 10T tokens/day, 10K customers[3] | TBD |
| Funding | $327M, $4B valuation[2] | TBD |
| Capability | Fireworks Benchmark | Target Minimum |
|---|---|---|
| Time-to-First-Token | 0.4s (Llama 70B)[17] | <0.5s |
| Throughput | 66 tok/s (Llama 70B)[17] | >100 tok/s (with alternative silicon) |
| Availability | 99.9%+[7] | 99.9%+ (enterprise SLA) |
| Compliance | SOC 2, HIPAA, GDPR[7] | SOC 2, HIPAA, FedRAMP, ITAR |
| API Compatibility | OpenAI-compatible[6] | OpenAI-compatible (drop-in) |
| Model Selection | 50+ models[6] | 10+ models (curated for verticals) |
| Fine-Tuning | LoRA, DPO, RFT[6] | LoRA minimum |
| Autoscaling | Zero-to-inference in <1s[6] | Dedicated deployments (no cold start) |
| Risk | Impact | Mitigation |
|---|---|---|
| Timing | Fireworks has 3+ year head start[2] | Focus on underserved verticals (sovereign AI) |
| Talent | PyTorch founding team is hard to replicate[1] | Partner with inference optimization vendors |
| Scale | 10T tokens/day creates data flywheel[3] | Lead with quality (compliance, SLAs) over scale |
| Ecosystem | NVIDIA/AMD strategic investors[10] | Leverage alternative silicon partnerships |
| Pricing Pressure | Fireworks can afford to undercut on price[16] | Do NOT compete on serverless API pricing |
| Execution Speed | 166 employees, mature product org[1] | Hire experienced PM/Eng leads ASAP |
Do NOT compete head-to-head with Fireworks on serverless API pricing or developer cloud features. Fireworks has a 3+ year head start, $327M in funding, the PyTorch founding team, and $280M ARR proving product-market fit.
The platform's winning strategy: Lead with sovereign deployment, energy cost advantage, and regulated vertical positioning. Target customers who cannot use Fireworks due to compliance, data sovereignty, or on-prem requirements. Charge a premium for these capabilities — do NOT race to the bottom on price.