Qwen 3 30B A3B
Qwen3-30B-A3B-FP8 is a cutting-edge large language model that offers seamless switching between complex reasoning and general-purpose dialogue, excelling in reasoning, instruction-following, multilingual support, and agent integration, with a robust capacity for handling long contexts and supporting over 100 languages.

API USAGE
API IDENTIFIER
qwen/qwen3-30b-a3b/fp8
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "https://api.inference.net/v1",
apiKey: process.env.INFERENCE_API_KEY,
});
const completion = await openai.chat.completions.create({
model: "qwen/qwen3-30b-a3b/fp8",
messages: [
{
role: "user",
content: "What is the meaning of life?"
}
],
stream: true,
});
for await (const chunk of completion) {
process.stdout.write(chunk.choices[0]?.delta.content as string);
}
PLAYGROUND
Total Cost = $0.00
Time To First Token
0ms
Tokens Per Second
0
Total Tokens
0
Total Cost = $0.00
Time To First Token
0ms
Tokens Per Second
0
Total Tokens
0
Type a message to get started
Tweak the overall style and tone of the conversation.
Control how creative you'd like the model to be when responding to you.
Set the maximum token length of generated text.
RELATED MODELS

DeepSeek R1
DeepSeek-R1 is an open-source first-generation reasoning model leveraging large-scale reinforcement learning to achieve state-of-the-art performance in math, code, and reasoning tasks, and includes distilled models suitable for various applications.
TRY IT

DeepSeek V3 0324
DeepSeek-V3-0324 is an advanced language model with improved reasoning capabilities, enhanced web development support, superior Chinese writing proficiency, and refined function calling accuracy, designed to provide detailed search analysis and high-quality interactive experiences.
TRY IT

Llama 3.3 70B Instruct
Meta's Llama 3.3 is a 70B parameter multilingual instruction-tuned language model designed for dialogue use, outperforming many open and closed-source models and incorporating safety features such as supervised fine-tuning and reinforcement learning with human feedback.
TRY IT
Qwen3-30B-A3B-FP8
Qwen3 Highlights
Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support, with the following key features:
- Uniquely support of seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within single model, ensuring optimal performance across various scenarios.
- Significantly enhancement in its reasoning capabilities, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
- Superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
- Expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
- Support of 100+ languages and dialects with strong capabilities for multilingual instruction following and translation.
Model Overview
This repo contains the FP8 version of Qwen3-30B-A3B, which has the following features:
- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Number of Parameters: 30.5B in total and 3.3B activated
- Number of Paramaters (Non-Embedding): 29.9B
- Number of Layers: 48
- Number of Attention Heads (GQA): 32 for Q and 4 for KV
- Number of Experts: 128
- Number of Activated Experts: 8
- Context Length: 32,768 natively and 131,072 tokens with YaRN.
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.
Quickstart
The code of Qwen3-MoE has been in the latest Hugging Face transformers
and we advise you to use the latest version of transformers
.
With transformers<4.51.0
, you will encounter the following error:
KeyError: 'qwen3moe'
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Qwen/Qwen3-30B-A3B-FP8" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # prepare the model input prompt = "Give me a short introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=True # Switch between thinking and non-thinking modes. Default is True. ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate( **model_inputs, max_new_tokens=32768 ) output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() # parsing thinking content try: # rindex finding 151668 (</think>) index = len(output_ids) - output_ids[::-1].index(151668) except ValueError: index = 0 thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") print("thinking content:", thinking_content) print("content:", content)
vllm serve Qwen/Qwen3-30B-A3B-FP8 --enable-reasoning --reasoning-parser deepseek_r1
For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3.
For deployment, you can use sglang>=0.4.6.post1
or vllm>=0.8.5
or to create an OpenAI-compatible API endpoint:
- SGLang:
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-FP8 --reasoning-parser qwen3