Baseten is a serverless AI inference platform that deploys and runs machine learning models in production.[1] Founded in 2019 by Tuhin Srivastava (CEO), Amir Haghighat (CTO), Philip Howes (Chief Scientist), and Pankaj Gupta,[2] the company has evolved from a no-code ML app builder into the leading independent inference-as-a-service provider. With NVIDIA's $150M strategic investment in January 2026[3] and a custom C++ inference engine replacing the standard Triton server,[4] Baseten directly targets the same enterprise inference workloads The platform is pursuing.
Baseten is a direct, high-threat competitor to The platform's inference-as-a-service ambitions. Key concerns: (1) NVIDIA's $150M strategic investment signals intent to make Baseten a preferred inference partner;[3] (2) their custom C++ engine and TensorRT-LLM integration achieve 2-3x throughput improvements over vLLM;[4] (3) expansion into training creates a full-stack competitor capturing the entire model lifecycle;[11] (4) multi-cloud capacity management across 10+ providers gives them geographic reach the platform lacks.[9]
| Name | Title | Background |
|---|---|---|
| Tuhin Srivastava | CEO, Co-Founder[2] | Former Data Scientist at Gumroad; ML fraud detection and content moderation[14] |
| Amir Haghighat | CTO, Co-Founder[2] | Led ML teams at Clover Health; population health management[14] |
| Philip Howes | Chief Scientist, Co-Founder[2] | Former Data Scientist at Gumroad; ML infrastructure[14] |
| Pankaj Gupta | Co-Founder[2] | Engineering leadership |
Baseten raised three rounds totaling $525M in under 12 months (Feb 2025 to Jan 2026), going from ~$1B valuation to $5B. This velocity signals extreme investor confidence in the inference market and Baseten's position within it. The platform's fundraising narrative should emphasize the differentiated sovereign/multi-chip angle that Baseten cannot replicate.
| Round | Date | Amount | Valuation | Lead Investors |
|---|---|---|---|---|
| Seed[5] | 2021 | $2.5M | -- | First Round Capital |
| Series A[5] | 2023 | $13.5M | -- | Sequoia |
| Series A+ (Greylock)[14] | 2023 | $20M | -- | Greylock |
| Series B[5] | Mar 2024 | $40M | ~$400M (est.) | IVP, Spark Capital |
| Series C[5] | Feb 2025 | $75M | ~$1B (est.) | IVP, Spark Capital |
| Series D[17] | Sep 2025 | $150M | $2.15B | BOND |
| Series E[3] | Jan 2026 | $300M | $5B | IVP, CapitalG |
| Total | ~$585M |
| Investor | Type | Strategic Significance |
|---|---|---|
| NVIDIA[3] | Strategic ($150M) | Secures GPU supply + TensorRT-LLM integration. Part of NVIDIA's inference ecosystem strategy. |
| CapitalG (Alphabet)[3] | Strategic | Google Cloud partnership. Baseten on Google Cloud Marketplace.[18] |
| IVP[3] | Financial (multi-round) | Led or co-led Series B, C, and E. Deep conviction. |
| Spark Capital[5] | Financial (multi-round) | Series B and C lead. Early growth investor. |
| Greylock[14] | Financial | $20M follow-on. Enterprise SaaS expertise. |
| BOND[17] | Financial | Series D lead. Late-stage growth. |
| Conviction[5] | Financial | Sarah Guo. AI infrastructure thesis. |
NVIDIA's $150M investment in Baseten is not just financial. It complements NVIDIA's $20B acquisition of Groq and signals a deliberate strategy to control the inference stack.[19] For the first time, inference surpassed training in NVIDIA's total data center revenue in late 2025. Baseten becomes NVIDIA's preferred software layer for enterprise inference deployment. A multi-chip strategy (alternative silicon) directly challenges this NVIDIA lock-in, which is both The platform's biggest risk and biggest opportunity.
Baseten has evolved from a model deployment tool into a full inference platform with four product tiers.[1] The architecture is designed around serverless GPU orchestration with multi-cloud capacity management.
Baseten's May 2025 launch of Training[11] transforms them from an inference-only provider into a full model lifecycle platform: train, fine-tune, optimize, deploy, serve. Their strategy: customers own model weights, Baseten captures the inference revenue. This is the same value-chain play The platform needs to execute.
Baseten's core technical differentiator is replacing the standard NVIDIA Triton Inference Server with a custom C++ server built directly on TensorRT-LLM.[4] This gives them tighter control over streaming output, structured generation, and request scheduling. The custom GRPC-based server eliminates Triton overhead while maintaining compatibility with TensorRT-LLM's kernel optimizations.
| Metric | Improvement | Benchmark Context |
|---|---|---|
| Throughput (tokens/sec) | 2-3x vs. vLLM[4] | TensorRT-LLM Engine Builder |
| Time-to-First-Token | 30% faster vs. vLLM[4] | Engine Builder deployments |
| LLM Inference Speed | 33% faster[12] | TensorRT-LLM vs. default |
| SDXL Inference | 40% faster[12] | Image generation workloads |
| Cost-Performance (Blackwell) | 225% better[18] | Google Cloud A4 VMs, high-throughput |
| Cost-Per-Token (Blackwell) | Up to 10x reduction[12] | vs. Hopper platform |
| Writer (Customer) | 60% higher tok/s[12] | Palmyra LLMs on Baseten |
Truss is Baseten's open-source standard for packaging models.[15] It creates containerized model servers without requiring Docker knowledge, supporting any framework (PyTorch, TensorFlow, TensorRT, Triton). With 6,000+ GitHub stars,[15] Truss serves as the developer acquisition funnel: open-source users convert to paid Baseten deployments.
Baseten's custom C++ engine is a meaningful technical moat. By replacing Triton and building directly on TensorRT-LLM, they achieve performance gains that cannot be replicated by competitors using off-the-shelf serving stacks. However, this architecture is NVIDIA-locked. Baseten has no equivalent optimization for AMD, Intel, or custom ASIC chips. A multi-chip architecture (H100/H200 + alternative silicon) creates a fundamentally different, and potentially more defensible, competitive position.
Baseten serves 100+ enterprise customers along with hundreds of smaller companies.[8] Their customer base is concentrated in AI-native companies building production applications, with a growing presence in regulated industries.
| Customer | Segment | Use Case | Relationship |
|---|---|---|---|
| Cursor[23] | AI-Native (Code) | AI code editor inference | Production inference |
| Notion[23] | Enterprise SaaS | AI features in productivity suite | Production inference |
| Writer[8] | Enterprise AI | Custom 70B LLM inference, 100% of inference on Baseten | Deep partnership |
| Abridge[8] | Healthcare AI | Clinical documentation, 100% inference on Baseten | Deep partnership |
| Clay[23] | Sales Tech | AI-powered sales intelligence | Production inference |
| Descript[8] | Media/Creative | Audio/video AI processing | Inference at scale |
| Superhuman[24] | Enterprise SaaS | 80% faster embedding inference | Custom model deployment |
| Sully AI[25] | Healthcare | Clinical AI, open-source model migration | Full inference stack |
| Patreon[8] | Creator Economy | Content moderation, recommendations | ML deployment |
Baseten's customer base skews toward AI startups and tech companies. They lack sovereign, air-gapped deployment capabilities. The platform should target defense, government, financial services, and healthcare enterprises that require physically isolated inference. Baseten's public cloud architecture cannot serve these customers without fundamental redesign.
Baseten operates two pricing models: pay-per-token for Model APIs (serverless) and pay-per-minute for dedicated GPU deployments.[13] This dual approach captures both bursty startup workloads and sustained enterprise demand.
| Model | Input ($/1M tokens) | Output ($/1M tokens) | Notes |
|---|---|---|---|
| DeepSeek V3 | Pay-per-use | Pay-per-use | Served on Blackwell A4 VMs[18] |
| DeepSeek R1 | Pay-per-use | Pay-per-use | Reasoning model[18] |
| Llama 4 Maverick | Pay-per-use | Pay-per-use | Latest Meta model[18] |
| Custom/Fine-tuned | GPU-minute billing | GPU-minute billing | Dedicated deployment[13] |
| GPU Type | Pricing Model | Key Features |
|---|---|---|
| NVIDIA A10G | Per-minute billing | Entry-level, image/audio models |
| NVIDIA A100 (40/80 GB) | Per-minute billing | Standard LLM serving |
| NVIDIA H100 | Per-minute billing | High-performance LLMs |
| NVIDIA B200 | Per-minute billing | Latest Blackwell, training + inference[11] |
Baseten, Together AI, Fireworks AI, and DeepInfra are in an active price war on per-token inference.[26] NVIDIA Blackwell is enabling up to 10x cost-per-token reductions.[12] The platform should not compete on per-token pricing for open-source models. Instead, differentiate on: (1) dedicated, predictable pricing for enterprise; (2) multi-chip cost optimization; (3) sovereign deployment premiums for regulated industries.
The AI inference market is projected to account for two-thirds of all AI compute by end of 2026, up from one-third in 2023.[6] The market exceeds $100B and is one of the largest and fastest-growing in tech history.[6]
| Metric | Baseten | Together AI | Fireworks AI | Replicate |
|---|---|---|---|---|
| Valuation | $5B[3] | ~$3.3B[26] | ~$2B (est.)[26] | ~$1B (est.) |
| Total Funding | $585M[5] | $400M+[26] | $250M+ | $100M+ |
| Employees | ~191[7] | ~200 | ~100 | ~80 |
| Key Differentiator | Custom C++ engine[4] | 200+ models[26] | Flash-Attention v2[26] | Easy prototyping |
| Custom Models | Yes (Truss)[15] | Yes | Yes | Limited |
| Training | Yes (GA)[11] | Yes | No | No |
| Multi-Cloud | 10+ providers[9] | Limited | Limited | Single |
| Self-Hosted | Yes (VPC)[9] | No | No | No |
| NVIDIA Investment | $150M[3] | No | No | No |
| Uptime SLA | 99.99%[10] | 99.9% | 99.9% | Best effort |
| Enterprise Focus | High | Medium | Medium | Low (prototyping) |
| Architecture | Serverless, multi-cloud[9] |
| Chip Strategy | NVIDIA only (TensorRT-LLM)[12] |
| Inference Engine | Custom C++[4] |
| Training | GA (B200)[11] |
| Infrastructure | No owned DCs; uses public cloud[9] |
| Sovereignty | No |
| Revenue | 10x YoY growth[6] |
| Customers | 100+ enterprise[8] |
| Architecture | Dedicated, sovereign-ready |
| Chip Strategy | Multi-chip architecture |
| Inference Engine | In Development |
| Training | Planned |
| Infrastructure | Owned DCs + air-cooled containers |
| Sovereignty | Yes (Air-gapped) |
| Revenue | Pre-revenue (inference) |
| Customers | Design partner stage |
| # | Decision | Impact |
|---|---|---|
| 1 | Built a custom inference engine[4] | 2-3x throughput improvement over vLLM. Real technical moat vs. commoditized serving stacks. |
| 2 | Secured NVIDIA as a strategic investor[3] | $150M investment + preferential access to Blackwell GPUs + TensorRT-LLM co-development. |
| 3 | Developer-first GTM via Truss[15] | 6,000+ GitHub stars. Open-source funnel converts developers to paid customers. Zero enterprise sales friction. |
| 4 | Multi-cloud capacity management[9] | 10+ cloud providers = resilience + cost optimization + 99.99% uptime SLA. |
| 5 | Expanded from inference to training[11] | Full lifecycle capture. Customers who train on Baseten have zero friction deploying inference. |
| 6 | Raised aggressively in a hot market[3] | $525M in 12 months. Capital advantage over smaller competitors. Runway to sustain price wars. |
| # | Vulnerability | Opportunity |
|---|---|---|
| 1 | No owned infrastructure[9] | The platform's owned DCs and energy assets = 30-50% lower cost. Structural advantage at scale. |
| 2 | NVIDIA-only GPU dependency[12] | A multi-chip strategy hedges supply risk and offers workload-optimal routing. |
| 3 | No sovereign/air-gapped capability | Defense, government, healthcare enterprises need physically isolated inference. Baseten cannot serve them. |
| 4 | ~191 employees = thin engineering bench[7] | Rapid expansion creates execution risk. The platform can target Baseten's underserved customer segments. |
| 5 | Pricing compression from competitors[26] | Baseten in price war with Together AI, Fireworks, DeepInfra. the platform competes on value, not price. |
| 6 | Public cloud cost structure | Every GPU-minute includes cloud provider markup. The platform's owned infrastructure avoids this margin drag. |
Baseten proved that replacing Triton with a custom C++ server delivers 2-3x gains.[4] The platform needs equivalent proprietary optimization, ideally across multiple chip architectures.
Baseten cannot offer air-gapped, physically isolated inference. The platform should make sovereign deployment the primary differentiator for enterprise sales.
Baseten's serverless model is expensive at sustained volume.[13] The platform should offer flat-rate, dedicated pricing that undercuts Baseten by 30-50% for enterprise workloads.
Position multi-chip architecture as enterprise risk mitigation. NVIDIA supply constraints and Baseten's single-vendor dependency are real customer concerns.[12]
Baseten's training launch[11] captures the full model lifecycle. The platform needs integrated fine-tuning-to-inference before Baseten matures this capability.
Baseten's healthcare customers (Abridge, Sully AI)[8] run on public cloud. The platform can win these verticals with HIPAA-ready, air-gapped inference.
| Capability | Baseten | Platform (Target) |
|---|---|---|
| Serverless Inference | Live | Planned |
| Dedicated Inference | Live | In Dev |
| Custom Model Deployment | Truss Framework | Planned |
| Training (Fine-Tuning) | GA | Planned |
| Multi-Chip Support | NVIDIA Only | multi-chip |
| Sovereign/Air-Gapped | No | Yes |
| Owned Data Centers | No | Yes |
| Energy Assets | No | Yes |
| OpenAI-Compatible API | Yes | Planned |
| Self-Hosted (Customer VPC) | Yes | Planned |
| Multi-Cloud | 10+ Providers | Owned Infra Only |
| Speculative Decoding | Production | Evaluating |
| Disaggregated Serving | Production | Evaluating |
| Segment | Customers | Inference Use Case |
|---|---|---|
| AI-Native | Cursor, Writer, Descript, Clay[8][23] | Core product inference at scale |
| Enterprise SaaS | Notion, Superhuman, Patreon[23][24] | AI feature integration |
| Healthcare | Abridge, Sully AI[8][25] | Clinical documentation, patient AI |
| Developer Tools | Oxen AI[27] | Dataset-to-model pipeline |
| Metric | Value | Source |
|---|---|---|
| Inference share of AI compute (2023) | ~33% | Industry estimates[6] |
| Inference share of AI compute (end 2026) | ~66% | Analyst projections[6] |
| Total inference market size | $100B+ | Industry estimates[6] |
| NVIDIA inference vs. training revenue | Inference surpassed training (late 2025) | Deloitte[19] |
| Blackwell cost-per-token improvement | Up to 10x vs. Hopper | NVIDIA[12] |
This report was compiled from 28 primary sources including Baseten's corporate website, blog posts, customer stories, pricing page, official press releases (BusinessWire), investor announcements (CapitalG, Greylock), third-party research (Tracxn, Fortune, CNBC, SiliconANGLE, VentureBeat, TechFundingNews, AInvest), NVIDIA official blog and case study, Google Cloud blog, AWS partner success story, Vultr blog, ZenML analysis, and GitHub repository data. Revenue figures are estimated from investor disclosures and press reports. Employee count from Tracxn (Jan 31, 2026). Performance claims are self-reported by Baseten unless otherwise noted. Report accessed and compiled February 16, 2026.