April 2026

LLM Benchmarks for WGSL Geometry Generation

I benchmarked all the top open source models for the task of generating 3D geometry in WGSL. This page documents the methodology and results.

Methodology

A fixed benchmark of 46 prompts across five difficulty categories was run against nine open-weight models. WGSL validity was checked programmatically; geometric accuracy was rated (from both the code and 2D image of the result) by Claude Opus 4.6 on a 1–5 scale.

Prompt categories

B1 — Classic CAD with specified dimensions + numbered steps
B2 — Classic CAD with specified dimensions, no steps
B3 — Classic CAD, no dimensions specified
B4 — Vague / short prompts
B5 — Organic / SDF-native shapes (smooth blending, gyroids, procedural)

Results

Model	Valid WGSL	Mean Accuracy	Grade
Qwen2.5-Coder-32B-Instruct	100%	4.35	A
Qwen3-14B-FP8	100%	4.28	A
Qwen3-32B-FP8	100%	4.15	A
GLM-4.7-Flash-FP8	100%	4.00	A
GLM-4-32B-0414	100%	3.96	A
DeepSeek-R1-Distill-Qwen-32B	100%	3.87	A
GLM-Z1-32B-0414	67%	3.02	C
llava-v1.6-mistral-7b-hf	96%	2.37	B
llava-onevision-qwen2-7b-ov-hf	83%	2.15	B

Key Findings

The dominant failure mode across all models was precise spatial reasoning (e.g. writing incorrect half-extents, wrong positional offsets, rotation errors) rather than syntax failures.

Qwen2.5-Coder-32B-Instruct outperforms both larger Qwen3 models, demonstrating that code-specialised training is more valuable than raw scale for this task. The model also produced the most creative solutions for complex SDF compositions.

← Back to Experiments