Together AI is a research-driven AI cloud company[1] that combines open-source AI research with a commercial inference and training platform. Founded in June 2022 by a team of Stanford professors and serial entrepreneurs,[2] the company's core technical moat is FlashAttention, an IO-aware attention algorithm created by Chief Scientist Tri Dao that is now used by virtually every major AI lab in the world, including OpenAI, Anthropic, Meta, Google, and DeepSeek.[3]
Together AI operates an "AI Native Cloud" that serves 200+ open-source models via serverless API endpoints, GPU cluster rentals, and enterprise fine-tuning services.[4] The company has grown rapidly from ~$44M revenue in 2024 to an annualized run rate of ~$300M by September 2025,[5] and reached a $3.3B valuation after its $305M Series B in February 2025.[6]
Together AI validates the market for inference-as-a-service built on open-source models, which is exactly The platform's target. Their pricing strategy prices at roughly breakeven with FlashAttention optimization, proving that software-level kernel optimizations alone do not create sustainable margins. The platform's energy cost advantage is the key to sustainable margins that Together AI cannot replicate. However, Together AI could also be a potential integration partner for model serving, given their deep FlashAttention expertise and model optimization stack.
Together AI was founded in June 2022 by five Stanford-affiliated researchers and entrepreneurs who believed open-source foundation models represented a generational shift in technology.[2] The founding team combines deep academic AI research (Chris Re, Percy Liang, Ce Zhang, Tri Dao) with proven entrepreneurial track records (Vipul Ved Prakash).
| Name | Title | Background |
|---|---|---|
| Vipul Ved Prakash | Co-Founder & CEO[10] | Founded Topsy (acquired by Apple for $200M+ in 2013), co-founded Cloudmark (acquired by Proofpoint for $110M in 2017). Director of Engineering at Apple post-acquisition.[10] |
| Ce Zhang | Co-Founder & CTO[11] | Post-doc at Stanford advised by Chris Re. Professor at ETH Zurich. ML systems researcher.[2] |
| Tri Dao | Co-Founder & Chief Scientist[3] | PhD Stanford (co-advised by Chris Re & Stefano Ermon). BS Mathematics, Stanford. Assistant Professor at Princeton. Creator of FlashAttention and Mamba. AI2050 Fellow (Schmidt Sciences).[12] |
| Chris Re | Co-Founder[2] | Stanford Professor of CS. MacArthur Fellow. Founded data/ML startup acquired by Apple in 2017. Advisor to multiple AI companies.[2] |
| Percy Liang | Co-Founder[2] | Stanford Professor of CS. Director of Stanford CRFM (Center for Research on Foundation Models). Led HELM benchmark.[2] |
| Kai Mak | Chief Revenue Officer[11] | Enterprise sales leadership |
| Charles Srisuwananukorn | Founding VP Engineering[11] | Engineering leadership |
| James Barker | VP EMEA[11] | European expansion lead |
Together AI's leadership is uniquely research-heavy compared to competitors like Fireworks AI or Groq. Three of five co-founders are Stanford professors. Tri Dao's FlashAttention is arguably the single most impactful open-source contribution to LLM inference performance in the last three years. This research DNA drives their core technical advantage but may also explain a weaker enterprise GTM motion compared to more commercially-oriented competitors.
| Round | Date | Amount | Valuation | Lead Investors |
|---|---|---|---|---|
| Seed[6] | 2022 | Undisclosed | -- | Lux Capital, others |
| Series A[6] | Nov 2023 | ~$100M (est.) | Undisclosed | Prosperity7, NVIDIA |
| Series A Extension[6] | 2024 | ~$129M (est.) | ~$1.25B | Emergence Capital, Kleiner Perkins |
| Series B[6] | Feb 2025 | $305M | $3.3B | General Catalyst, Prosperity7 |
| Total | $534M |
| Investor | Type | Strategic Significance |
|---|---|---|
| NVIDIA | Strategic[6] | GPU supply priority, technical partnership, Blackwell early access |
| Prosperity7 (Saudi Aramco) | Strategic[6] | Middle East sovereign AI ambitions, led Series A and co-led Series B |
| Salesforce Ventures | Strategic[6] | Enterprise distribution, CRM integration potential |
| General Catalyst | Financial[6] | Led Series B, strong AI portfolio (Anduril, Stripe) |
| Kleiner Perkins | Financial[6] | Tier-1 VC validation |
| SK Telecom | Strategic[6] | APAC distribution, telecom AI workloads |
| DAMAC Capital | Strategic[6] | Middle East infrastructure and data center capacity |
| Period | Revenue | Notes |
|---|---|---|
| 2024 (Full Year) | ~$44M[7] | Early commercialization stage |
| End of 2024 | $130M ARR[5] | Rapid acceleration in H2 2024 |
| Sep 2025 | $300M ARR[5] | 130% growth from year-end 2024 |
Together AI generates revenue through two primary lines: per-token API usage (~30-40% of revenue) and GPU server rentals (~60-70% of revenue).[5] The GPU rental business is lower-margin and more commodity-like. The API/inference business is higher-margin but faces intense pricing pressure from competitors. This revenue mix matters for understanding their true competitive position in inference specifically.
Together AI's $300M ARR on ~320 employees implies ~$937K revenue per employee, a strong efficiency metric. However, their heavy reliance on GPU rental revenue (60-70%) means their inference-specific revenue is closer to ~$90-120M ARR. The platform should benchmark against the inference slice, not total revenue, when sizing the opportunity.
Together AI positions as an "AI Native Cloud" with four core product pillars: Inference, Fine-Tuning, Training, and GPU Clusters.[4] Below is the full product architecture.
Unlike Crusoe or CoreWeave, Together AI does not own data centers or energy assets. They partner with infrastructure providers (Hypertec, 5C Group) for GPU capacity and focus their engineering on software-level optimization (FlashAttention, inference engine, model serving). This is a fundamentally different strategy from Vertically integrated approach. Together AI's margin is in the software layer; The platform's is in the energy layer.
FlashAttention is the single most important open-source contribution to LLM inference optimization in the last three years. Created by Tri Dao, it is used by OpenAI, Anthropic, Meta, Google, NVIDIA, Mistral, DeepSeek, Tencent, and Alibaba.[3] Understanding FlashAttention is critical for The platform's inference engine strategy.
| Version | Date | Target Hardware | Key Innovation | Performance |
|---|---|---|---|---|
| FlashAttention-1[3] | May 2022 | NVIDIA A100 | IO-aware tiling: reduces HBM reads/writes from quadratic to linear | 2-4x speedup vs. standard attention |
| FlashAttention-2[3] | Jul 2023 | NVIDIA A100 | Better work partitioning, reduced non-matmul FLOPs, parallelism over sequence length | Up to 4x over FA-1; 72% FLOPs utilization on A100 |
| FlashAttention-3[14] | Jul 2024 | NVIDIA H100 (Hopper) | Warp specialization, WGMMA tensor cores, TMA async data movement, FP8 support | 1.5-2x over FA-2; 740 TFLOPS FP16 (75% util); ~1.2 PFLOPS FP8 |
| FlashAttention-4[20] | 2025 (research) | NVIDIA Blackwell (SM100) | 5-stage warp-specialized pipeline, software exp2() on CUDA cores, adaptive rescaling | First attention kernel to exceed 1 PFLOPS on single GPU (target) |
FlashAttention-3 exploits three key features of NVIDIA Hopper architecture:[14]
FlashAttention is freely available under BSD license.[21] It is already integrated into PyTorch, Hugging Face Transformers, and vLLM. The platform can and should use FlashAttention in its inference engine without any licensing cost. Together AI's competitive advantage is not FlashAttention itself (it is open source), but rather their proprietary optimizations built on top of it (Together Inference Engine, custom kernels, speculative decoding integration). The platform should integrate FlashAttention-3 immediately and build proprietary optimizations for its multi-chip architecture.
FlashAttention is currently optimized for NVIDIA GPUs only (CUDA kernels). A multi-chip strategy (H100/H200 + alternative silicon) creates an opportunity to build hardware-aware attention kernels optimized for non-NVIDIA accelerators. This would be a genuine technical differentiator that Together AI's NVIDIA-only approach cannot match.
| Model | Input | Output | Notes |
|---|---|---|---|
| Llama 3.1 8B Instruct | $0.18 | $0.18 | Entry-level, high-volume workloads |
| Llama 3.3 70B Instruct | $0.88 | $0.88 | Most popular mid-tier model |
| Llama 3.1 405B | $3.50 | $3.50 | Largest open-source model |
| Llama 4 Maverick | $0.27 | $0.85 | MoE: 17B of 400B params active |
| DeepSeek-R1-0528 | $3.00 | $7.00 | Reasoning model, premium pricing |
| DeepSeek-V3.1 | $0.60 | $1.70 | Latest general-purpose |
| Qwen3.5-397B-A17B | $0.60 | $3.60 | Alibaba's flagship MoE |
| Mistral 7B v0.2 | $0.20 | $0.20 | Budget inference option |
| GPT-OSS 120B | $0.15 | $0.60 | OpenAI's open-source release |
| GPU | Memory | Pricing Model | Notes |
|---|---|---|---|
| NVIDIA GB200 NVL72 | 186 GB HBM3e | Contact Sales | Via Hypertec partnership[8] |
| NVIDIA B200 HGX | 180 GB | Contact Sales | Blackwell generation |
| NVIDIA H200 | 141 GB HBM3e | Contact Sales | Available in clusters |
| NVIDIA H100 | 80 GB HBM3 | Contact Sales | Existing fleet |
| Method | Cost | Notes |
|---|---|---|
| LoRA Fine-Tuning | $0.80 - $2.00 per 1M training tokens | Most cost-effective option |
| Full Fine-Tuning | GPU-hour based | For maximum customization |
Together AI's pricing on small models (Llama 3.1 8B at $0.18/M tokens, Mistral 7B at $0.20/M) is near break-even or below. This is a deliberate developer acquisition strategy: attract developers with ultra-low prices on commodity models, then monetize through GPU cluster rentals and enterprise contracts. The inference API is a loss leader for the GPU rental business. This means the platform cannot compete on price alone for commodity inference; instead, focus on larger models, enterprise SLAs, and compliance-ready environments where pricing power exists.
| Segment | Details | Revenue Contribution |
|---|---|---|
| Individual Developers | 450K+ developers using API[5] | Long-tail, low ARPU |
| AI Startups | Fine-tuning, custom model hosting | Medium ARPU, high volume |
| Enterprise | Dedicated endpoints, VPC, SSO[19] | High ARPU, growing segment |
| GPU Cluster Buyers | Training/inference at scale (Hypertec infra)[8] | ~60-70% of total revenue[5] |
Together AI's inference performance comes from the Together Inference Engine, a proprietary serving stack that combines FlashAttention with custom optimizations.[18]
The engine is built on CUDA and combines multiple optimization techniques:[18]
| Metric | Claim | Benchmark Context |
|---|---|---|
| vs. vLLM (open source) | 3-4x faster | Same hardware, same models[18] |
| vs. Serverless APIs | 2x faster | vs. Perplexity, Anyscale, Fireworks AI, Mosaic ML[18] |
| Llama 3.1 8B Throughput | High tokens/sec | Turbo endpoint optimized for speed |
| FlashAttention-3 on H100 | 740 TFLOPS FP16 | 75% GPU utilization[14] |
| FlashAttention-3 FP8 | ~1.2 PFLOPS | H100 with block quantization[14] |
| Tier | Optimization | Use Case | Pricing |
|---|---|---|---|
| Turbo | Maximum speed, quantized[18] | Real-time applications, chatbots | Lower per-token |
| Lite | Cost-optimized[18] | Batch processing, background tasks | Lowest per-token |
| Standard | Balanced | General-purpose inference | Standard per-token |
| Dedicated | Reserved GPU capacity[19] | Enterprise with SLA requirements | Hourly + token |
Independent benchmarks (Artificial Analysis, LLMPerf) show Together AI in the second tier of speed behind Cerebras and Groq but ahead of Fireworks AI and Perplexity.[22] Groq's LPU achieves up to 18x faster processing for latency-critical applications, while Cerebras' wafer-scale chip also outperforms GPU-based solutions on raw speed. Together AI's advantage is in the breadth of model support (200+ models) rather than raw speed on any single model.
| Metric | Current | Planned (2026-2028) |
|---|---|---|
| GPU Fleet | 36K+ NVIDIA GPUs (GB200, B200, H200, H100)[8] | 100K+ GPUs via Hypertec/5C[16] |
| DC Capacity | 800+ MW (leased, not owned)[16] | 2+ GW across NA, Europe, Asia[16] |
| Regions | North America (primary) | Europe (via 5C Group), Asia[16] |
| Models Served | 200+ open-source models[4] | Expanding catalog |
Together AI's open-source strategy is core to their business model: publish foundational research freely to build developer adoption, then monetize through the optimized commercial platform.[1]
| Project | Year | Impact |
|---|---|---|
| FlashAttention (1-4)[3] | 2022-2025 | Used by every major AI lab. Stanford Open Source Software Prize. Foundational to all LLM inference. |
| RedPajama[13] | 2023 | 1.2T-token open training dataset. 500+ models built on it. Used by Snowflake Arctic, Salesforce XGen, AI2 OLMo. |
| RedPajama-V2[13] | 2024 | 100T+ token web dataset with quality signals. NeurIPS 2024 Datasets Track publication. |
| Mamba[12] | 2023 | State-space model alternative to Transformers. Created by Tri Dao. Linear-time sequence modeling. |
| Together Inference Engine[18] | 2024 | Proprietary (not open source). Combines FA, Flash-Decoding, Medusa. Commercial moat. |
| CodeSandbox SDK[15] | 2024 | Code execution API acquired and integrated for code interpreter capability. |
| Period | Employees | YoY Growth |
|---|---|---|
| 2023 | ~107[7] | -- |
| End of 2024 | ~287[7] | 165% |
| Jan 2026 | ~320[7] | ~11% |
Together AI's headcount growth decelerated sharply from 165% (2023-2024) to ~11% (2024-2026). With $300M ARR on ~320 employees, the company is prioritizing revenue efficiency over headcount growth. This suggests either capital discipline or difficulty hiring in a competitive AI talent market. The platform should track whether this signals a mature operating model or a constraint on growth.
Together AI competes in the inference API market alongside several specialized providers and hyperscalers.[22]
| Metric | Together AI | Fireworks AI | Groq | Cerebras |
|---|---|---|---|---|
| Revenue (ARR) | ~$300M[5] | ~$100M (est.) | ~$50M (est.) | Pre-revenue (inference) |
| Valuation | $3.3B[6] | $3.2B (est.) | $14B+ | $8.3B (public filing) |
| Employees | ~320[7] | ~150 | ~400 | ~450 |
| Technical Moat | FlashAttention[3] | FireAttention[22] | LPU (custom chip)[22] | Wafer-scale chip |
| Models Supported | 200+[4] | 100+ | 30+ | Limited |
| Own Hardware | No (leased via Hypertec) | No | Yes (LPU) | Yes (Wafer) |
| Own Data Centers | No | No | No | No |
| GPU Cluster Rental | Yes | No | No | No |
| Fine-Tuning | Yes[17] | Yes | No | No |
| Compliance (SOC2/HIPAA) | Limited | SOC2 + HIPAA[22] | Limited | Limited |
| Raw Speed (TTFT) | Middle tier[22] | Middle tier | Fastest | Near-fastest |
| Business Model | Software-layer optimization (leased infra)[8] |
| Energy | No owned energy assets |
| Technical Moat | FlashAttention, Inference Engine[18] |
| Model Breadth | 200+ models[4] |
| Developer Base | 450K+ developers[5] |
| Compliance | Limited (no SOC2, no HIPAA) |
| Pricing Strategy | Near-breakeven on commodity models[9] |
| Infrastructure | Leased GPU capacity (Hypertec)[8] |
| Business Model | Vertically integrated (owned infra) |
| Energy | Owned energy assets (structural cost advantage) |
| Technical Moat | Multi-chip architecture, energy cost |
| Model Breadth | In development (targeting 3+ LLMs) |
| Developer Base | Building |
| Compliance | Sovereign-ready positioning |
| Pricing Strategy | Margin-sustainable via energy advantage |
| Infrastructure | Owned data centers, modular containers |
| # | Decision | Impact |
|---|---|---|
| 1 | Research-first positioning[3] | FlashAttention gives them credibility with every AI developer on the planet. Open-source builds trust. |
| 2 | Developer-first GTM[4] | 450K developers on platform. OpenAI-compatible API. Simple pricing. Frictionless onboarding. |
| 3 | Broad model catalog[4] | 200+ models = one-stop shop. Developers try models and stay on platform. |
| 4 | Strategic investor mix[6] | NVIDIA (GPU access), Prosperity7 (sovereign AI), Salesforce (enterprise distribution). |
| 5 | Acquisitions for capabilities[15] | CodeSandbox (code interpreter) and Refuel (data labeling) expand platform without building from scratch. |
| # | Vulnerability | Opportunity |
|---|---|---|
| 1 | No owned infrastructure (all leased)[8] | The platform's owned energy and DCs create structural cost advantage impossible to replicate via leasing |
| 2 | Near-breakeven pricing[9] | Together AI cannot sustain low pricing without margin compression. The platform can price competitively AND maintain strong gross margins |
| 3 | NVIDIA-only chip dependency | A multi-chip strategy hedges against NVIDIA supply constraints and enables workload-optimal routing |
| 4 | No sovereign/compliance positioning | The platform can win regulated enterprise, government, and healthcare deals that Together AI cannot serve |
| 5 | GPU rental revenue concentration (~60-70%)[5] | GPU rental is commoditizing. The platform should focus on managed inference (higher margin, stickier) from day one |
| 6 | Slowing headcount growth (~11% YoY)[7] | May signal execution constraints. The platform can recruit aggressively from the talent pool |
Together AI proves inference pricing trends toward breakeven. The platform's energy ownership is the only path to sustainable strong gross marginss. Make this the centerpiece of every sales conversation and investor pitch.
Together AI's model serving expertise + The platform's cost-optimized infrastructure could create a mutually beneficial integration. Explore Together AI as a model serving layer on the platform hardware. They need cheap GPUs; The platform has them.
FlashAttention-3 is open source (BSD license).[21] The platform's inference engine must include it on day one. Build additional optimizations for alternative silicon accelerators on top.
Together AI has no compliance story. The platform should ship SOC 2, HIPAA, and sovereign-ready inference environments before Together AI builds them. This is a segment they are structurally unable to serve with leased infrastructure.
Together AI is a MEDIUM threat to the platform. They are not a direct competitor in the sovereign/enterprise inference segment The platform is targeting, but they set the pricing floor for open-source model inference that the platform must be aware of. Their FlashAttention technology is freely available and should be adopted, not competed against. The greatest risk is that Together AI's developer ecosystem and 200+ model catalog become the default API for inference, making it harder for the platform to win developer mindshare. The greatest opportunity is a potential partnership where Together AI's software optimization runs on The platform's cost-advantaged hardware.
This report was compiled from 24 primary sources including Together AI's corporate website, product pages, engineering blog posts, press releases, investor announcements, third-party research (Contrary Research, Sacra, Tracxn, Growjo), academic publications (ArXiv), independent benchmarks (Helicone, Artificial Analysis), and industry publications (Data Center Dynamics, SiliconANGLE, PRNewswire). Revenue estimates rely on Sacra Research and Tracxn data. Organizational structure is inferred from the official about page and The Org. All performance claims are self-reported by Together AI unless otherwise noted. Report accessed and compiled February 16, 2026.