LLM Benchmarks for WGSL Geometry Generation

I benchmarked all the top open source models for the task of generating 3D geometry in WGSL. This page documents the methodology and results.

Methodology

A fixed benchmark of 46 prompts across five difficulty categories was run against nine open-weight models. WGSL validity was checked programmatically; geometric accuracy was rated (from both the code and 2D image of the result) by Claude Opus 4.6 on a 1–5 scale.

Prompt categories

Results

Model Valid WGSL Mean Accuracy Grade
Qwen2.5-Coder-32B-Instruct 100% 4.35 A
Qwen3-14B-FP8 100% 4.28 A
Qwen3-32B-FP8 100% 4.15 A
GLM-4.7-Flash-FP8 100% 4.00 A
GLM-4-32B-0414 100% 3.96 A
DeepSeek-R1-Distill-Qwen-32B 100% 3.87 A
GLM-Z1-32B-0414 67% 3.02 C
llava-v1.6-mistral-7b-hf 96% 2.37 B
llava-onevision-qwen2-7b-ov-hf 83% 2.15 B

Key Findings

The dominant failure mode across all models was precise spatial reasoning (e.g. writing incorrect half-extents, wrong positional offsets, rotation errors) rather than syntax failures.

Qwen2.5-Coder-32B-Instruct outperforms both larger Qwen3 models, demonstrating that code-specialised training is more valuable than raw scale for this task. The model also produced the most creative solutions for complex SDF compositions.