Competitive Intelligence Report

Together AI Strategy Analysis

How a research-first AI cloud leverages FlashAttention to build a full-stack inference platform — and what it means for the platform

February 16, 2026 Analyst: MinjAI Agents For: AI Infrastructure Strategy & Product Leaders
24 Footnoted Sources
Page 1 of 10

Executive Summary

Together AI is a research-driven AI cloud company[1] that combines open-source AI research with a commercial inference and training platform. Founded in June 2022 by a team of Stanford professors and serial entrepreneurs,[2] the company's core technical moat is FlashAttention, an IO-aware attention algorithm created by Chief Scientist Tri Dao that is now used by virtually every major AI lab in the world, including OpenAI, Anthropic, Meta, Google, and DeepSeek.[3]

Together AI operates an "AI Native Cloud" that serves 200+ open-source models via serverless API endpoints, GPU cluster rentals, and enterprise fine-tuning services.[4] The company has grown rapidly from ~$44M revenue in 2024 to an annualized run rate of ~$300M by September 2025,[5] and reached a $3.3B valuation after its $305M Series B in February 2025.[6]

$3.3B[6]
Valuation (Feb 2025)
~$300M[5]
Annualized Revenue
$534M[6]
Total Funding Raised
450K+[5]
Developers on Platform
200+[4]
Models Available
~320[7]
Employees (Jan 2026)
36K[8]
NVIDIA GB200 GPUs (Hypertec)
MEDIUM
Threat Level to the platform
Strategic Implications

Together AI validates the market for inference-as-a-service built on open-source models, which is exactly The platform's target. Their pricing strategy prices at roughly breakeven with FlashAttention optimization, proving that software-level kernel optimizations alone do not create sustainable margins. The platform's energy cost advantage is the key to sustainable margins that Together AI cannot replicate. However, Together AI could also be a potential integration partner for model serving, given their deep FlashAttention expertise and model optimization stack.

Five Things Action Items

  1. Exploit the energy cost gap. Together AI prices near breakeven.[9] The platform's owned energy infrastructure creates structural margin advantage they cannot match.
  2. Evaluate FlashAttention integration. FlashAttention-3/4 is used by every major AI lab.[3] The platform's inference engine should integrate these optimizations on day one.
  3. Target sovereign/compliance use cases. Together AI is developer-first with no sovereign positioning. The platform can win regulated enterprise deals they cannot serve.
  4. Consider partnership before competition. Together AI's model serving expertise + The platform's cost-optimized infrastructure = potential integration opportunity.
  5. Match their developer experience. Together AI's 450K developer base[5] was won with excellent DX: OpenAI-compatible API, 200+ models, simple pricing. the platform must match this.
Page 2 of 10

Company Overview and Leadership

Founding Story

Together AI was founded in June 2022 by five Stanford-affiliated researchers and entrepreneurs who believed open-source foundation models represented a generational shift in technology.[2] The founding team combines deep academic AI research (Chris Re, Percy Liang, Ce Zhang, Tri Dao) with proven entrepreneurial track records (Vipul Ved Prakash).

Leadership Team

NameTitleBackground
Vipul Ved PrakashCo-Founder & CEO[10]Founded Topsy (acquired by Apple for $200M+ in 2013), co-founded Cloudmark (acquired by Proofpoint for $110M in 2017). Director of Engineering at Apple post-acquisition.[10]
Ce ZhangCo-Founder & CTO[11]Post-doc at Stanford advised by Chris Re. Professor at ETH Zurich. ML systems researcher.[2]
Tri DaoCo-Founder & Chief Scientist[3]PhD Stanford (co-advised by Chris Re & Stefano Ermon). BS Mathematics, Stanford. Assistant Professor at Princeton. Creator of FlashAttention and Mamba. AI2050 Fellow (Schmidt Sciences).[12]
Chris ReCo-Founder[2]Stanford Professor of CS. MacArthur Fellow. Founded data/ML startup acquired by Apple in 2017. Advisor to multiple AI companies.[2]
Percy LiangCo-Founder[2]Stanford Professor of CS. Director of Stanford CRFM (Center for Research on Foundation Models). Led HELM benchmark.[2]
Kai MakChief Revenue Officer[11]Enterprise sales leadership
Charles SrisuwananukornFounding VP Engineering[11]Engineering leadership
James BarkerVP EMEA[11]European expansion lead
Leadership Insight

Together AI's leadership is uniquely research-heavy compared to competitors like Fireworks AI or Groq. Three of five co-founders are Stanford professors. Tri Dao's FlashAttention is arguably the single most impactful open-source contribution to LLM inference performance in the last three years. This research DNA drives their core technical advantage but may also explain a weaker enterprise GTM motion compared to more commercially-oriented competitors.

Timeline: Key Milestones

May 2022
Tri Dao publishes FlashAttention v1, achieving 2-4x speedup in attention computation with IO-aware memory management.[3]
Jun 2022
Together AI incorporated by Vipul Ved Prakash, Ce Zhang, Chris Re, Percy Liang, and Tri Dao.[2]
Apr 2023
Launched RedPajama, an open-source 1.2T-token training dataset replicating Meta's LLaMA recipe. Over 500 models later built on it.[13]
Jul 2023
FlashAttention-2 released. Up to 4x speedup over FlashAttention-1, 72% model FLOPs utilization on A100s.[3]
Nov 2023
Series A funding round. Investors include Prosperity7 Ventures (Saudi Aramco) and NVIDIA.[6]
Jul 2024
FlashAttention-3 released. 1.5-2x faster on H100 FP16 (740 TFLOPS, 75% utilization). FP8 reaches ~1.2 PFLOPS.[14]
Nov 2024
Partnership with Hypertec Cloud to co-build 36,000 NVIDIA GB200 NVL72 GPU cluster.[8]
Dec 2024
Acquired CodeSandbox for built-in code interpretation on inference platform.[15]
Feb 2025
$305M Series B at $3.3B valuation. Led by General Catalyst & Prosperity7. Launched GPU Clusters powered by NVIDIA Blackwell.[6]
May 2025
Acquired Refuel (data labeling/processing startup) to strengthen enterprise data pipeline.[15]
Jun 2025
Partnership with Hypertec & 5C Group to deploy up to 100,000 GPUs across Europe through 2028.[16]
Sep 2025
Hit $300M ARR milestone. Fine-tuning platform upgraded with larger models and longer contexts.[5]
Page 3 of 10

Funding History and Financial Profile

Funding Rounds

RoundDateAmountValuationLead Investors
Seed[6]2022Undisclosed--Lux Capital, others
Series A[6]Nov 2023~$100M (est.)UndisclosedProsperity7, NVIDIA
Series A Extension[6]2024~$129M (est.)~$1.25BEmergence Capital, Kleiner Perkins
Series B[6]Feb 2025$305M$3.3BGeneral Catalyst, Prosperity7
Total$534M

Notable Investors

InvestorTypeStrategic Significance
NVIDIAStrategic[6]GPU supply priority, technical partnership, Blackwell early access
Prosperity7 (Saudi Aramco)Strategic[6]Middle East sovereign AI ambitions, led Series A and co-led Series B
Salesforce VenturesStrategic[6]Enterprise distribution, CRM integration potential
General CatalystFinancial[6]Led Series B, strong AI portfolio (Anduril, Stripe)
Kleiner PerkinsFinancial[6]Tier-1 VC validation
SK TelecomStrategic[6]APAC distribution, telecom AI workloads
DAMAC CapitalStrategic[6]Middle East infrastructure and data center capacity

Revenue Growth Trajectory

PeriodRevenueNotes
2024 (Full Year)~$44M[7]Early commercialization stage
End of 2024$130M ARR[5]Rapid acceleration in H2 2024
Sep 2025$300M ARR[5]130% growth from year-end 2024
Revenue Composition Warning

Together AI generates revenue through two primary lines: per-token API usage (~30-40% of revenue) and GPU server rentals (~60-70% of revenue).[5] The GPU rental business is lower-margin and more commodity-like. The API/inference business is higher-margin but faces intense pricing pressure from competitors. This revenue mix matters for understanding their true competitive position in inference specifically.

Strategic Implication

Together AI's $300M ARR on ~320 employees implies ~$937K revenue per employee, a strong efficiency metric. However, their heavy reliance on GPU rental revenue (60-70%) means their inference-specific revenue is closer to ~$90-120M ARR. The platform should benchmark against the inference slice, not total revenue, when sizing the opportunity.

Page 4 of 10

Product Architecture and Platform Stack

Together AI positions as an "AI Native Cloud" with four core product pillars: Inference, Fine-Tuning, Training, and GPU Clusters.[4] Below is the full product architecture.

Layer 4: Managed AI Services[4]
Serverless Inference (200+ models, OpenAI-compatible API)[4]
Together Fine-Tuning (LoRA, QLoRA, full fine-tuning)[17]
Together Inference Engine (FlashAttention + proprietary optimizations)[18]
Code Interpreter (via CodeSandbox acquisition)[15]
Enterprise Platform (SSO, VPC, dedicated endpoints)[19]
Layer 3: Platform & Developer Tools[4]
Together API (Chat, Completions, Embeddings, Images)[4]
Python & TypeScript SDKs[4]
AWS Marketplace Integration[6]
Weights & Biases Integration[17]
Hugging Face Model Hub Integration[17]
Layer 2: Inference & Training Infrastructure
FlashAttention-3 (IO-aware, 75% GPU utilization on H100)[14]
FlashAttention-4 (Blackwell-first, >1 PFLOPS target)[20]
Flash-Decoding[18]
Speculative Decoding (Medusa)[18]
Custom CUDA Kernels[18]
FP8 / Low-Precision Inference[14]
Layer 1: GPU Infrastructure[8]
NVIDIA GB200 NVL72 (36K GPUs via Hypertec)[8]
NVIDIA HGX B200 (Blackwell clusters)[16]
NVIDIA H200 / H100 (existing fleet)[8]
800 MW+ capacity (NA, Europe, Asia)[16]
Hypertec Cloud Partnership (infrastructure)[8]
European Expansion (100K GPUs by 2029 via 5C Group)[16]
Key Architecture Insight

Unlike Crusoe or CoreWeave, Together AI does not own data centers or energy assets. They partner with infrastructure providers (Hypertec, 5C Group) for GPU capacity and focus their engineering on software-level optimization (FlashAttention, inference engine, model serving). This is a fundamentally different strategy from Vertically integrated approach. Together AI's margin is in the software layer; The platform's is in the energy layer.

Page 5 of 10

FlashAttention: The Technical Moat

FlashAttention is the single most important open-source contribution to LLM inference optimization in the last three years. Created by Tri Dao, it is used by OpenAI, Anthropic, Meta, Google, NVIDIA, Mistral, DeepSeek, Tencent, and Alibaba.[3] Understanding FlashAttention is critical for The platform's inference engine strategy.

FlashAttention Evolution

VersionDateTarget HardwareKey InnovationPerformance
FlashAttention-1[3] May 2022 NVIDIA A100 IO-aware tiling: reduces HBM reads/writes from quadratic to linear 2-4x speedup vs. standard attention
FlashAttention-2[3] Jul 2023 NVIDIA A100 Better work partitioning, reduced non-matmul FLOPs, parallelism over sequence length Up to 4x over FA-1; 72% FLOPs utilization on A100
FlashAttention-3[14] Jul 2024 NVIDIA H100 (Hopper) Warp specialization, WGMMA tensor cores, TMA async data movement, FP8 support 1.5-2x over FA-2; 740 TFLOPS FP16 (75% util); ~1.2 PFLOPS FP8
FlashAttention-4[20] 2025 (research) NVIDIA Blackwell (SM100) 5-stage warp-specialized pipeline, software exp2() on CUDA cores, adaptive rescaling First attention kernel to exceed 1 PFLOPS on single GPU (target)

How FlashAttention-3 Works (Technical Architecture)

FlashAttention-3 exploits three key features of NVIDIA Hopper architecture:[14]

  1. Asynchronous Computation via Warp Specialization: Overlaps computation and data movement by assigning different warps to producer (data loading) and consumer (computation) roles. Exploits WGMMA (Warpgroup Matrix Multiply-Accumulate) for higher tensor core throughput.
  2. TMA (Tensor Memory Accelerator): Hardware unit that accelerates global-to-shared memory transfers, handling all index calculation and out-of-bound predication automatically.
  3. FP8 Low-Precision: Doubles tensor core throughput (989 TFLOPS FP16 vs. 1978 TFLOPS FP8). Uses block quantization and incoherent processing to maintain accuracy.
Strategic Implication: FlashAttention is Open Source

FlashAttention is freely available under BSD license.[21] It is already integrated into PyTorch, Hugging Face Transformers, and vLLM. The platform can and should use FlashAttention in its inference engine without any licensing cost. Together AI's competitive advantage is not FlashAttention itself (it is open source), but rather their proprietary optimizations built on top of it (Together Inference Engine, custom kernels, speculative decoding integration). The platform should integrate FlashAttention-3 immediately and build proprietary optimizations for its multi-chip architecture.

Opportunity: Multi-Chip FlashAttention

FlashAttention is currently optimized for NVIDIA GPUs only (CUDA kernels). A multi-chip strategy (H100/H200 + alternative silicon) creates an opportunity to build hardware-aware attention kernels optimized for non-NVIDIA accelerators. This would be a genuine technical differentiator that Together AI's NVIDIA-only approach cannot match.

Page 6 of 10

Pricing Strategy and Customer Analysis

Serverless Inference Pricing (Per 1M Tokens)[9]

ModelInputOutputNotes
Llama 3.1 8B Instruct$0.18$0.18Entry-level, high-volume workloads
Llama 3.3 70B Instruct$0.88$0.88Most popular mid-tier model
Llama 3.1 405B$3.50$3.50Largest open-source model
Llama 4 Maverick$0.27$0.85MoE: 17B of 400B params active
DeepSeek-R1-0528$3.00$7.00Reasoning model, premium pricing
DeepSeek-V3.1$0.60$1.70Latest general-purpose
Qwen3.5-397B-A17B$0.60$3.60Alibaba's flagship MoE
Mistral 7B v0.2$0.20$0.20Budget inference option
GPT-OSS 120B$0.15$0.60OpenAI's open-source release

GPU Cluster Pricing

GPUMemoryPricing ModelNotes
NVIDIA GB200 NVL72186 GB HBM3eContact SalesVia Hypertec partnership[8]
NVIDIA B200 HGX180 GBContact SalesBlackwell generation
NVIDIA H200141 GB HBM3eContact SalesAvailable in clusters
NVIDIA H10080 GB HBM3Contact SalesExisting fleet

Fine-Tuning Pricing[17]

MethodCostNotes
LoRA Fine-Tuning$0.80 - $2.00 per 1M training tokensMost cost-effective option
Full Fine-TuningGPU-hour basedFor maximum customization
Pricing Strategy Analysis

Together AI's pricing on small models (Llama 3.1 8B at $0.18/M tokens, Mistral 7B at $0.20/M) is near break-even or below. This is a deliberate developer acquisition strategy: attract developers with ultra-low prices on commodity models, then monetize through GPU cluster rentals and enterprise contracts. The inference API is a loss leader for the GPU rental business. This means the platform cannot compete on price alone for commodity inference; instead, focus on larger models, enterprise SLAs, and compliance-ready environments where pricing power exists.

Customer Segments

SegmentDetailsRevenue Contribution
Individual Developers450K+ developers using API[5]Long-tail, low ARPU
AI StartupsFine-tuning, custom model hostingMedium ARPU, high volume
EnterpriseDedicated endpoints, VPC, SSO[19]High ARPU, growing segment
GPU Cluster BuyersTraining/inference at scale (Hypertec infra)[8]~60-70% of total revenue[5]
Page 7 of 10

Together Inference Engine: Performance Benchmarks

Together AI's inference performance comes from the Together Inference Engine, a proprietary serving stack that combines FlashAttention with custom optimizations.[18]

Together Inference Engine Architecture

The engine is built on CUDA and combines multiple optimization techniques:[18]

  1. FlashAttention-2/3: IO-aware attention computation
  2. Flash-Decoding: Optimized autoregressive decoding
  3. Medusa (Speculative Decoding): Multiple candidate tokens generated per forward pass
  4. Custom CUDA Kernels: Proprietary optimizations beyond open-source FlashAttention
  5. Dynamic Batching: Configurable tradeoff between latency and throughput

Performance Claims[18]

MetricClaimBenchmark Context
vs. vLLM (open source)3-4x fasterSame hardware, same models[18]
vs. Serverless APIs2x fastervs. Perplexity, Anyscale, Fireworks AI, Mosaic ML[18]
Llama 3.1 8B ThroughputHigh tokens/secTurbo endpoint optimized for speed
FlashAttention-3 on H100740 TFLOPS FP1675% GPU utilization[14]
FlashAttention-3 FP8~1.2 PFLOPSH100 with block quantization[14]

Model Serving Tiers

TierOptimizationUse CasePricing
TurboMaximum speed, quantized[18]Real-time applications, chatbotsLower per-token
LiteCost-optimized[18]Batch processing, background tasksLowest per-token
StandardBalancedGeneral-purpose inferenceStandard per-token
DedicatedReserved GPU capacity[19]Enterprise with SLA requirementsHourly + token
Third-Party Benchmark Context

Independent benchmarks (Artificial Analysis, LLMPerf) show Together AI in the second tier of speed behind Cerebras and Groq but ahead of Fireworks AI and Perplexity.[22] Groq's LPU achieves up to 18x faster processing for latency-critical applications, while Cerebras' wafer-scale chip also outperforms GPU-based solutions on raw speed. Together AI's advantage is in the breadth of model support (200+ models) rather than raw speed on any single model.

Infrastructure Scale

MetricCurrentPlanned (2026-2028)
GPU Fleet36K+ NVIDIA GPUs (GB200, B200, H200, H100)[8]100K+ GPUs via Hypertec/5C[16]
DC Capacity800+ MW (leased, not owned)[16]2+ GW across NA, Europe, Asia[16]
RegionsNorth America (primary)Europe (via 5C Group), Asia[16]
Models Served200+ open-source models[4]Expanding catalog
Page 8 of 10

Organization and Open-Source Strategy

Organizational Structure

Vipul Ved Prakash[10]
CEO & Co-Founder
Ce Zhang[11]
CTO & Co-Founder
Tri Dao[12]
Chief Scientist & Co-Founder
Kai Mak[11]
Chief Revenue Officer
Charles Srisuwananukorn[11]
Founding VP Engineering
James Barker[11]
VP EMEA
ACADEMIC ADVISORY (CO-FOUNDERS)
Chris Re[2]
Stanford Professor
Percy Liang[2]
Stanford Professor, CRFM Director

Open-Source Research Contributions

Together AI's open-source strategy is core to their business model: publish foundational research freely to build developer adoption, then monetize through the optimized commercial platform.[1]

ProjectYearImpact
FlashAttention (1-4)[3]2022-2025Used by every major AI lab. Stanford Open Source Software Prize. Foundational to all LLM inference.
RedPajama[13]20231.2T-token open training dataset. 500+ models built on it. Used by Snowflake Arctic, Salesforce XGen, AI2 OLMo.
RedPajama-V2[13]2024100T+ token web dataset with quality signals. NeurIPS 2024 Datasets Track publication.
Mamba[12]2023State-space model alternative to Transformers. Created by Tri Dao. Linear-time sequence modeling.
Together Inference Engine[18]2024Proprietary (not open source). Combines FA, Flash-Decoding, Medusa. Commercial moat.
CodeSandbox SDK[15]2024Code execution API acquired and integrated for code interpreter capability.

Headcount and Growth

PeriodEmployeesYoY Growth
2023~107[7]--
End of 2024~287[7]165%
Jan 2026~320[7]~11%
Headcount Growth Slowing

Together AI's headcount growth decelerated sharply from 165% (2023-2024) to ~11% (2024-2026). With $300M ARR on ~320 employees, the company is prioritizing revenue efficiency over headcount growth. This suggests either capital discipline or difficulty hiring in a competitive AI talent market. The platform should track whether this signals a mature operating model or a constraint on growth.

Page 9 of 10

Competitive Positioning: Inference API Landscape

Together AI competes in the inference API market alongside several specialized providers and hyperscalers.[22]

Together AI vs. Inference API Peers

MetricTogether AIFireworks AIGroqCerebras
Revenue (ARR)~$300M[5]~$100M (est.)~$50M (est.)Pre-revenue (inference)
Valuation$3.3B[6]$3.2B (est.)$14B+$8.3B (public filing)
Employees~320[7]~150~400~450
Technical MoatFlashAttention[3]FireAttention[22]LPU (custom chip)[22]Wafer-scale chip
Models Supported200+[4]100+30+Limited
Own HardwareNo (leased via Hypertec)NoYes (LPU)Yes (Wafer)
Own Data CentersNoNoNoNo
GPU Cluster RentalYesNoNoNo
Fine-TuningYes[17]YesNoNo
Compliance (SOC2/HIPAA)LimitedSOC2 + HIPAA[22]LimitedLimited
Raw Speed (TTFT)Middle tier[22]Middle tierFastestNear-fastest

Together AI vs. Inference Platform: Head-to-Head

Together AI

Business ModelSoftware-layer optimization (leased infra)[8]
EnergyNo owned energy assets
Technical MoatFlashAttention, Inference Engine[18]
Model Breadth200+ models[4]
Developer Base450K+ developers[5]
ComplianceLimited (no SOC2, no HIPAA)
Pricing StrategyNear-breakeven on commodity models[9]
InfrastructureLeased GPU capacity (Hypertec)[8]

the inference platform

Business ModelVertically integrated (owned infra)
EnergyOwned energy assets (structural cost advantage)
Technical MoatMulti-chip architecture, energy cost
Model BreadthIn development (targeting 3+ LLMs)
Developer BaseBuilding
ComplianceSovereign-ready positioning
Pricing StrategyMargin-sustainable via energy advantage
InfrastructureOwned data centers, modular containers
The platform's Structural Advantages over Together AI
  • Energy cost ownership: Together AI leases all infrastructure. The platform owns energy assets, creating 30-50% lower cost of compute at the hardware layer. This is the single most important differentiator.
  • Multi-chip flexibility: Together AI is NVIDIA-only. The platform's architecture with alternative silicon enables workload-optimal routing and hedges against single-vendor dependency.
  • Sovereign/compliance positioning: Together AI has no sovereign AI story. The platform's physically isolated, compliance-ready infrastructure serves a segment Together AI cannot address.
  • Sustainable margins: Together AI prices at breakeven on commodity inference. The platform's energy advantage enables strong gross marginss at competitive price points.
Page 10 of 10

Strategic Strategic Implications

What Together AI Got Right (Lessons)

#DecisionImpact
1Research-first positioning[3]FlashAttention gives them credibility with every AI developer on the planet. Open-source builds trust.
2Developer-first GTM[4]450K developers on platform. OpenAI-compatible API. Simple pricing. Frictionless onboarding.
3Broad model catalog[4]200+ models = one-stop shop. Developers try models and stay on platform.
4Strategic investor mix[6]NVIDIA (GPU access), Prosperity7 (sovereign AI), Salesforce (enterprise distribution).
5Acquisitions for capabilities[15]CodeSandbox (code interpreter) and Refuel (data labeling) expand platform without building from scratch.

Together AI's Vulnerabilities (Opportunities for the platform)

#VulnerabilityOpportunity
1No owned infrastructure (all leased)[8]The platform's owned energy and DCs create structural cost advantage impossible to replicate via leasing
2Near-breakeven pricing[9]Together AI cannot sustain low pricing without margin compression. The platform can price competitively AND maintain strong gross margins
3NVIDIA-only chip dependencyA multi-chip strategy hedges against NVIDIA supply constraints and enables workload-optimal routing
4No sovereign/compliance positioningThe platform can win regulated enterprise, government, and healthcare deals that Together AI cannot serve
5GPU rental revenue concentration (~60-70%)[5]GPU rental is commoditizing. The platform should focus on managed inference (higher margin, stickier) from day one
6Slowing headcount growth (~11% YoY)[7]May signal execution constraints. The platform can recruit aggressively from the talent pool

Recommended Actions

1. Lead with Energy-Based Margin Story

Together AI proves inference pricing trends toward breakeven. The platform's energy ownership is the only path to sustainable strong gross marginss. Make this the centerpiece of every sales conversation and investor pitch.

2. Evaluate Partnership Opportunity

Together AI's model serving expertise + The platform's cost-optimized infrastructure could create a mutually beneficial integration. Explore Together AI as a model serving layer on the platform hardware. They need cheap GPUs; The platform has them.

3. Integrate FlashAttention Immediately

FlashAttention-3 is open source (BSD license).[21] The platform's inference engine must include it on day one. Build additional optimizations for alternative silicon accelerators on top.

4. Win on Compliance and Sovereignty

Together AI has no compliance story. The platform should ship SOC 2, HIPAA, and sovereign-ready inference environments before Together AI builds them. This is a segment they are structurally unable to serve with leased infrastructure.

Threat Assessment Summary

Overall: MEDIUM Threat

Together AI is a MEDIUM threat to the platform. They are not a direct competitor in the sovereign/enterprise inference segment The platform is targeting, but they set the pricing floor for open-source model inference that the platform must be aware of. Their FlashAttention technology is freely available and should be adopted, not competed against. The greatest risk is that Together AI's developer ecosystem and 200+ model catalog become the default API for inference, making it harder for the platform to win developer mindshare. The greatest opportunity is a potential partnership where Together AI's software optimization runs on The platform's cost-advantaged hardware.

Sources & Footnotes

  1. [1] Together AI company overview, "The AI Native Cloud," product positioning and mission, together.ai
  2. [2] Contrary Research, "Together AI Business Breakdown, Valuation, & Founding Story," founding team backgrounds, Stanford connections, company history, research.contrary.com/company/together-ai
  3. [3] Together AI Blog, "Introducing Together AI Chief Scientist Tri Dao," FlashAttention history, adoption by OpenAI/Anthropic/Meta/Google/DeepSeek, Stanford Open Source Software Prize, together.ai/blog/tri-dao-flash-attention
  4. [4] Together AI Products Page, "Inference, Fine-Tuning, Training, and GPU Clusters," 200+ models, OpenAI-compatible API, full product catalog, together.ai/products
  5. [5] Sacra Research, Together AI revenue analysis: $130M ARR end of 2024, $300M ARR by Sep 2025, 450K+ developers, revenue composition (30-40% API, 60-70% GPU rental), sacra.com/c/together-ai
  6. [6] Together AI Blog & PRNewswire, "$305M Series B to Scale AI Acceleration Cloud," $3.3B valuation, investor list (General Catalyst, Prosperity7, Salesforce Ventures, NVIDIA, Kleiner Perkins, DAMAC Capital, SK Telecom, Coatue, etc.), total $534M raised, together.ai/blog/together-ai-announcing-305m-series-b
  7. [7] Tracxn & Growjo, Together AI employee count and growth: ~107 (2023), ~287 (2024), ~320 (Jan 2026), 165% growth 2023-2024, taptwicedigital.com/stats/together-ai
  8. [8] Together AI Blog & PRNewswire, "Together AI and Hypertec Cloud Join Forces to Co-Build Turbocharged NVIDIA GB200 Cluster of 36K Blackwell GPUs," Nov 2024, deployment Q1 2025, complementing existing H100/H200 fleet, together.ai/blog/nvidia-gb200-together-gpu-cluster-36k
  9. [9] Together AI Pricing Page, all serverless inference pricing by model (Llama, DeepSeek, Qwen, Mistral), fine-tuning pricing, GPU cluster pricing, together.ai/pricing
  10. [10] Wikipedia & Clay, "Vipul Ved Prakash," founded Topsy (acquired by Apple $200M+, 2013), co-founded Cloudmark (acquired by Proofpoint $110M, 2017), Director of Engineering at Apple, en.wikipedia.org/wiki/Vipul_Ved_Prakash
  11. [11] Together AI About Page & The Org, leadership team: Ce Zhang (CTO), Kai Mak (CRO), Charles Srisuwananukorn (Founding VP Eng), James Barker (VP EMEA), together.ai/about
  12. [12] Princeton CS & AI2050, "Tri Dao," PhD Stanford (co-advised by Chris Re & Stefano Ermon), BS Math Stanford, Assistant Professor Princeton, creator of FlashAttention & Mamba, AI2050 Fellow, cs.princeton.edu/people/profile/td8762
  13. [13] Together AI Blog & ArXiv, "RedPajama: an Open Dataset for Training Large Language Models," 1.2T-token dataset, V2 100T+ tokens, 500+ models built on it, used by Snowflake Arctic/Salesforce XGen/AI2 OLMo, NeurIPS 2024 Datasets Track, together.ai/blog/redpajama
  14. [14] Together AI Blog & ArXiv, "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision," H100 Hopper architecture, warp specialization, WGMMA, TMA, FP8, 740 TFLOPS FP16, ~1.2 PFLOPS FP8, together.ai/blog/flashattention-3
  15. [15] PRNewswire & SiliconANGLE, "Together AI Acquires CodeSandbox" (Dec 2024) for code interpreter, "Together AI acquires Refuel" (May 2025) for data labeling, prnewswire.com
  16. [16] Data Center Dynamics & Hypertec, "Hypertec and 5C target Europe, plan 2GW roll-out for Together AI," 100K GPUs Europe by 2029, 800+ MW capacity across NA/Europe/Asia, datacenterdynamics.com
  17. [17] Together AI Fine-Tuning Page & Blog, "Fine-Tuning Platform Upgrades," LoRA/QLoRA support, $0.80-$2.00/M training tokens, W&B integration, Hugging Face integration, together.ai/fine-tuning
  18. [18] Together AI Blog, "Together Inference Engine v1" & "Inference Engine 2.0," 3-4x faster than vLLM, 2x faster than serverless APIs, FlashAttention + Flash-Decoding + Medusa, custom CUDA kernels, dynamic batching, Turbo/Lite endpoints, together.ai/blog/together-inference-engine-v1
  19. [19] Together AI Enterprise Page, "The Enterprise Platform for Inference & Fine-tuning," dedicated endpoints, VPC, SSO, enterprise security features, together.ai/enterprise
  20. [20] DigitalOcean & NVIDIA Developer Blog, "FlashAttention-4: Faster, Memory-Efficient Attention for LLMs," Blackwell SM100 target, 5-stage warp-specialized pipeline, first kernel >1 PFLOPS target, forward-only (research stage), digitalocean.com
  21. [21] GitHub, Dao-AILab/flash-attention, BSD license, open-source repository, integrated into PyTorch/Hugging Face/vLLM, github.com/Dao-AILab/flash-attention
  22. [22] Helicone, Northflank, & Artificial Analysis, "LLM API Providers Comparison 2025," Together AI vs. Fireworks AI vs. Groq vs. Cerebras speed and pricing benchmarks, Fireworks HIPAA/SOC2 compliance, helicone.ai/blog/llm-api-providers
  23. [23] Latent Space Podcast, "FlashAttention 2: making Transformers 800% faster w/o approximation - with Tri Dao of Together AI," technical deep-dive on FlashAttention architecture and research philosophy, latent.space/p/flashattention
  24. [24] Eesel.ai, "A complete guide to Together AI pricing in 2025," pricing tiers, model categories, fine-tuning costs analysis, eesel.ai/blog/together-ai-pricing

Methodology

This report was compiled from 24 primary sources including Together AI's corporate website, product pages, engineering blog posts, press releases, investor announcements, third-party research (Contrary Research, Sacra, Tracxn, Growjo), academic publications (ArXiv), independent benchmarks (Helicone, Artificial Analysis), and industry publications (Data Center Dynamics, SiliconANGLE, PRNewswire). Revenue estimates rely on Sacra Research and Tracxn data. Organizational structure is inferred from the official about page and The Org. All performance claims are self-reported by Together AI unless otherwise noted. Report accessed and compiled February 16, 2026.