This is a comprehensive installation and execution guide for OpenAI’s first open-weight models GPT OSS (gpt-oss-120b and gpt-oss-20b). Commercially available under Apache 2.0 license, featuring revolutionary Mixture-of-Experts (MoE) architecture achieving 10x efficiency improvements over traditional models.
This guide covers everything from beginner-friendly basic setup to advanced production environment deployment, suitable for all skill levels.
Model Overview and Features
The GPT OSS series are reasoning-specialized MoE models featuring high-quality response generation through Harmony format:
- gpt-oss-120b: 117B parameters (5.1B active) – Production-ready, runs on single H100
- gpt-oss-20b: 21B parameters (3.6B active) – Consumer-friendly, runs on 16GB RAM
- Reasoning capabilities: MMLU 94.2%, Math contests 96.6% high accuracy
- Tool integration: Native web search and Python execution functions
- Quantization optimization: Significant memory reduction through MXFP4
System Requirements
GPT-OSS-120B
Minimum Requirements:
- GPU: 80GB VRAM (NVIDIA H100, A100 80GB, GB200)
- Multi-GPU Alternative: 4×24GB GPU (RTX 3090/4090×4 units)
- Recommended: Single H100 80GB – Optimal performance
GPT-OSS-20B
Minimum Requirements:
- GPU: 16GB VRAM minimum
- Compatible Cards: RTX 4080, RTX 5070 Ti, RTX 3090 (24GB), AMD RX 9070 XT 16GB
- Apple Silicon: M1/M2/M3/M4 Mac support (Metal backend)
Memory Usage
- MXFP4 Quantization: Approximately 12.8-13GB actual VRAM usage
- BF16: Approximately 48GB (not recommended for general users)
Installation Methods
Method 1: Transformers Library (Most Common)
# Create virtual environment
python -m venv gpt-oss-env
source gpt-oss-env/bin/activate # Linux/Mac
# gpt-oss-env\Scripts\activate # Windows
# Install basic dependencies
pip install -U transformers accelerate torch kernels
pip install openai-harmony
# For MXFP4 quantization (optional, for Hopper GPUs)
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Method 2: vLLM (High-Performance Server)
# Optimized vLLM version
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
Method 3: Easy Installation for General Users
Ollama (Easiest)
# Install and run immediately
ollama pull gpt-oss:20b
ollama run gpt-oss:20b
ollama pull gpt-oss:120b
ollama run gpt-oss:120b
LM Studio (GUI)
lms get openai/gpt-oss-20b
lms get openai/gpt-oss-120b
Basic Usage Examples
Simple Transformers Pipeline
from transformers import pipeline
import torch
# Model selection
model_id = "openai/gpt-oss-20b" # or "openai/gpt-oss-120b"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics in simple terms"},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
Manual Loading with Optimized Settings
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Flash Attention 3 for Hopper GPUs
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
attn_implementation="kernels-community/vllm-flash-attn3", # Hopper GPU
)
messages = [{"role": "user", "content": "How many r's are in strawberry?"}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Platform-Specific Installation
Windows Setup
Prerequisites: WSL2 and Docker Desktop, NVIDIA GPU-PV support
# Enable GPU in WSL2
wsl --update
# Configure GPU in .wslconfig
# Direct Ollama installation
ollama run gpt-oss:20b
# Microsoft AI Foundry Local
foundry model run gpt-oss-20b
macOS (Apple Silicon)
Native Support: Metal acceleration on M1/M2/M3/M4 chips
# Install with Homebrew
brew install ollama
ollama run gpt-oss:20b
# 16GB+ RAM recommended, 32GB+ for optimal performance
Linux
Full Container Support: Native NVIDIA Container Toolkit
# Direct transformers installation
pip install --upgrade accelerate transformers kernels
# GPU-enabled Docker container
docker run --gpus all -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
Quantization and Memory Optimization
GPTQ 4-bit Quantization
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
gptq_config = GPTQConfig(
bits=4,
dataset="c4",
tokenizer=tokenizer,
block_name_to_quantize="k_proj",
backend="marlin" # For A100 GPU
)
quantized_model = AutoModelForCausalLM.from_pretrained(
"facebook/opt-125m",
device_map="auto",
quantization_config=gptq_config
)
# 4x memory reduction, 3.25x inference speedup
Optimization Benefits: Achieves 4x memory reduction and 3.25x inference acceleration.
API Server Configuration
Production FastAPI Implementation
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from contextlib import asynccontextmanager
# Global model storage
ml_models = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
# Load model at startup
ml_models["tokenizer"] = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
ml_models["model"] = AutoModelForCausalLM.from_pretrained(
"openai/gpt-oss-20b",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2"
)
yield
ml_models.clear()
app = FastAPI(
title="GPT OSS API Server",
description="Production LLM API",
lifespan=lifespan
)
class GenerationRequest(BaseModel):
prompt: str
max_tokens: int = 512
temperature: float = 0.7
@app.post("/generate")
async def generate_text(request: GenerationRequest):
# Inference processing implementation
pass
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000, workers=1)
Performance Benchmarks
Inference Speed (Tokens/Second)
GPT-OSS-120B
- H100 (Single): 45 tokens/sec
- GB200 NVL72: 1.5M tokens/sec (Enterprise)
- AMD Ryzen AI Max+ 395: 30 tokens/sec
- Multi-GPU H100 (2 units): ~68 tokens/sec
GPT-OSS-20B
- RTX 4080/5070 Ti: 68 tokens/sec
- AMD RX 9070 XT 16GB: High throughput
- Apple Silicon M3: Variable performance with Metal backend
Accuracy Benchmarks
Industry-Leading Performance:
- MMLU: gpt-oss-120b 94.2% (vs GPT-4’s 95.1%)
- AIME Math: 96.6% mathematical contest accuracy
- Codeforces: 2622 Elo rating (competitive programming)
- HumanEval: 87.3% code generation success rate
Production Environment Deployment Checklist
Pre-Deployment
- Verify GPU instance service quotas
- Configure security VPC/network settings
- Set up model artifact storage (S3/GCS/Blob)
- Configure monitoring and logging
- Set up cost budgets and alerts
During Deployment
- Configure appropriate instance sizing based on model requirements
- Set up auto-scaling policies
- Configure health checks and monitoring
- Configure security groups/firewall settings
- Set up SSL/TLS certificates for production
Post-Deployment
- Performance benchmarking and optimization
- Cost analysis and optimization review
- Disaster recovery and backup procedures
- Regular security updates and patch application
- Scaling policy adjustments based on usage patterns
Summary and Recommendations
GPT OSS models provide revolutionary performance and cost efficiency as open source reasoning-specialized LLMs. Key recommendations:
- Beginners: Start easily with Ollama or LM Studio
- Developers: Customize with Transformers library
- Production Environment: Utilize vLLM or cloud-managed services
- Cost Optimization: Use quantization (MXFP4/GPTQ) and appropriate hardware selection
- Scalability: Implement continuous batching and auto-scaling
From building sophisticated AI applications to personal learning purposes, GPT OSS serves as a powerful choice as a commercially available open LLM.
Leave a Reply