April 2026
LLM Benchmarks for WGSL Geometry Generation
I benchmarked all the top open source models for the task of generating 3D geometry in WGSL. This page documents the methodology and results.
Methodology
A fixed benchmark of 46 prompts across five difficulty categories was run against nine open-weight models. WGSL validity was checked programmatically; geometric accuracy was rated (from both the code and 2D image of the result) by Claude Opus 4.6 on a 1–5 scale.
Prompt categories
- B1 — Classic CAD with specified dimensions + numbered steps
- B2 — Classic CAD with specified dimensions, no steps
- B3 — Classic CAD, no dimensions specified
- B4 — Vague / short prompts
- B5 — Organic / SDF-native shapes (smooth blending, gyroids, procedural)
Results
| Model | Valid WGSL | Mean Accuracy | Grade |
|---|---|---|---|
| Qwen2.5-Coder-32B-Instruct | 100% | 4.35 | A |
| Qwen3-14B-FP8 | 100% | 4.28 | A |
| Qwen3-32B-FP8 | 100% | 4.15 | A |
| GLM-4.7-Flash-FP8 | 100% | 4.00 | A |
| GLM-4-32B-0414 | 100% | 3.96 | A |
| DeepSeek-R1-Distill-Qwen-32B | 100% | 3.87 | A |
| GLM-Z1-32B-0414 | 67% | 3.02 | C |
| llava-v1.6-mistral-7b-hf | 96% | 2.37 | B |
| llava-onevision-qwen2-7b-ov-hf | 83% | 2.15 | B |
Key Findings
The dominant failure mode across all models was precise spatial reasoning (e.g. writing incorrect half-extents, wrong positional offsets, rotation errors) rather than syntax failures.
Qwen2.5-Coder-32B-Instruct outperforms both larger Qwen3 models, demonstrating that code-specialised training is more valuable than raw scale for this task. The model also produced the most creative solutions for complex SDF compositions.