FAST AI INFERENCE
SIMPLE API
Build powerful AI applications with the best models like DeepSeek R1 and Llama 3.3 using our fast and simple APIs. Set up in minutes. Scale forever.

DeepSeek R1
DeepSeek-R1 is an open-source first-generation reasoning model leveraging large-scale reinforcement learning to achieve state-of-the-art performance in math, code, and reasoning tasks, and includes distilled models suitable for various applications.
TRY IT

DeepSeek V3
DeepSeek-V3 is a 671 billion parameter Mixture-of-Experts (MoE) language model optimized for efficiency and performance, demonstrating superior results across various benchmarks through innovative strategies and extensive pre-training on high-quality data.
TRY IT

Llama 3.1 70B Instruct
The Meta Llama 3.1 collection consists of high-performing, multilingual large language models optimized for dialogue and capable of handling text and code across 8 languages, available in 8B, 70B, and 405B parameter sizes, with a focus on safety, inclusivity, and societal benefit.
TRY IT

Llama 3.1 8B Instruct
Meta Llama 3.1 is a collection of advanced, multilingual large language models designed for dialogues, available in 8B, 70B, and 405B sizes, that outperform many chat models on industry benchmarks and emphasize safe, responsible use in various applications.
TRY IT

Llama 3.2 11B Vision Instruct
Llama 3.2-Vision, developed by Meta, is a state-of-the-art multimodal language model optimized for image recognition, reasoning, and captioning, surpassing both open and closed models in industry benchmarks.
TRY IT

Mistral Nemo 12B Instruct
Mistral-NeMo-12B-Instruct is a 12-billion-parameter multilingual large language model designed for English-language chat applications, featuring impressive multilingual and code comprehension, with customization options via NVIDIA's NeMo Framework.
TRY IT
PLAYGROUND
Total Cost = $0.00
Time To First Token
0ms
Tokens Per Second
0
Total Tokens
0
Total Cost = $0.00
Time To First Token
0ms
Tokens Per Second
0
Total Tokens
0
Type a message to get started
Tweak the overall style and tone of the conversation.
Control how creative you'd like the model to be when responding to you.
Set the maximum token length of generated text.
Calculate Your Savings
Price on Together.ai
$4.2K/mo
Input: $3 / million tokens | Output: $7 / million tokens
Price on
$1.4K/mo
Save 66%Input: $0.75 / million tokens | Output: $3 / million tokens
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "https://api.inference.net/v1",
apiKey: process.env.INFERENCE_API_KEY,
});
const completion = await openai.chat.completions.create({
model: "meta-llama/llama-3.1-8b-instruct/fp-8",
messages: [
{
role: "user",
content: "What is the meaning of life?"
}
],
stream: true,
});
for await (const chunk of completion) {
process.stdout.write(chunk.choices[0]?.delta.content as string);
}
REAL-TIME CHAT
Powerful serverless inference APIs that scale from zero to billions.
Top-Tier Performance
Industry-leading latency and throughput powered by highly optimized GPU infrastructure.
Unbeatable Pricing
Up to 90% cost savings vs legacy providers. Only pay for what you use, and never a penny more.
Easy Integration
First-class support for LangChain, LlamaIndex and other popular LLM frameworks.
BATCH INFERENCE
Process millions of requests per batch with a single API call.
Unmatched Scale & Cost
We handle the largest asynchronous LLM workloads at the lowest prices on the market.
Build Advanced Workflows
Power massive-scale data analysis, synthetic data generation, document processing, and more with our batch API.
Built for Developers
Easy to integrate. Find the code samples and documentation you need, when you need it.
DATA EXTRACTION
Transform unstructured data into actionable insights with powerful schema validation and parsing.
Precise Extraction
Extract structured data with guaranteed schema compliance using JSON Schema validation. Handle complex nested objects with confidence.
Flexible Processing
Process data at scale with our Batch API, or stream response objects in real-time as they are generated.
Familiar Tooling
First-class SDK support for TypeScript, Python, and more. Support for popular validation tools like Pydantic and Zod.
2 MINUTES TO INTEGRATE
We designed our API from scratch to make integration as easy as possible. It takes only two minutes to fully integrate. Switch today. Satisfaction guaranteed.
END-TO-END GENERATIVE AI
We power the most comprehensive generative AI workflows for your application without missing a beat. Get tokens to your users at blazing fast speeds.
AFFORDABLE AT EVERY SCALE
We built custom AI-native orchestration and scheduling software to ensure you always get the best prices without compromising on performance.
TO INFINITY AND BEYOND
We regularly update our model catalog when new models are released, so you always have access to the latest and greatest AI models.