AI Infrastructure

Planner

Send Feedback
Help us improve the AI Infrastructure Planner

GenAI Sizing Calculator

Calculate the optimal GPU configuration for your GenAI workloads. Input your requirements and get recommendations for hardware sizing, expected throughput, and cost estimates.

Model Configuration

Used for memory footprint calculation

For future optimization features

Memory Requirements

Model Weights
N/A
Based on 16bit precision
KV Cache per Token
N/A
Based on 16bit precision

Workload Requirements

GPU Selection

%

Real-world efficiency modifier for throughput estimates

Number of GPUs in each server (default: 8 for DGX/HGX systems)

Performance Specifications

Compute (16bit)
989 TFLOPS
Memory Bandwidth
4.8 TB/s
Memory Capacity
141 GB HBM3e

Sizing Results

Select a model to see GPU requirements.

Key Assumptions

Memory-First Sizing: GPU count determined by total memory requirements (model weights + KV cache), not compute requirements.

KV Cache Calculation: Uses actual model architecture details (layers, hidden size, attention heads) with support for GQA and MoE models.

Memory-Bandwidth Bound: Throughput estimates assume memory bandwidth is the bottleneck, typical for autoregressive inference.

Real-World Efficiency: Throughput includes configurable efficiency modifier (default 70%) to account for batching and scheduling overhead.

Model Parameters: Extracted directly from HuggingFace model configs when available, with manual input fallback.

Precision Support: Supports 4-bit through 32-bit precision for both model weights and KV cache independently.

Model Instance Deployment: Small models use 1 instance per GPU. Models using >80% of GPU memory span 2 GPUs. Large models can span up to 4 servers per instance.

Server Configuration: Default 8 GPUs per server. Large models span multiple servers (2-4) per instance. Smaller models achieve better utilization with multiple instances per server.

Roadmap

Launch v1.0!

Multiple model instance support

Add additional GPU options

Add Time per output token

Differentiate between prefill and decode timing