Clojure is almost as fast as C (with some help)

I have a stress test written in C: 100,000 cubes flying around in space. The CPU rebuilds every cube’s 4x4 transform matrix on every frame and sends all of them to the GPU. That is around 900,000 sine evaluations and 6 MB of matrix data per frame, and after that the GPU still has to draw 3.6 million triangles. So the frame is half CPU work, half GPU work.

I ported it to Clojure and wanted to see how close I could get to the C version’s FPS. I should say up front that I did not do the optimization work alone: I paired with Claude Code on it, and most of the digging in this post (the benchmarks, the JIT logs, the failed attempts) comes from that session.

I did not expect much. The C version is built with clang at -O2, and at that level clang auto-vectorizes the transform loop with NEON SIMD instructions without telling you anything. So when you benchmark a language against C, you are not really competing with the loop in the source file. You are competing with whatever the optimizer turned it into.

The first measurements confirmed this. C computes all 100K matrices in 0.70 ms on a single thread. My best scalar Clojure loop - primitive arrays, type hints, unchecked math, every trick I knew - took 2.6 ms. Almost four times slower, and I had nothing left to try. HotSpot does not auto-vectorize a loop like this. Clang does. That is the entire gap.

The JVM does have an answer though: the Vector API from Project Panama. Instead of hoping the JIT vectorizes your loop, you write the SIMD operations yourself, and since it is just a Java API, it works from Clojure too.

My first attempt with the Vector API was a disaster. 7.7 ms. Slower than the scalar loop, ten times behind C. The code was correct, I could even see the vector intrinsics being compiled in the JIT logs, so for a while I was staring at it with no idea what was wrong. The problem is that the API only becomes fast when the JIT can treat the “species” (the descriptor that says how wide your vectors are) as a compile-time constant. I had stored it in a Clojure var. A var is a field lookup, the JIT cannot fold it, and every single vector operation silently fell back to a slow path that allocates objects. One indirection, 10x. Nothing warns you about this.

Three things fixed it:

Reference the species as a static final field (FloatVector/SPECIES_128) at every call site, so the JIT sees a constant.
Write the helper math as macros instead of functions, so the whole kernel ends up in one inlinable body.
Use a recent JDK. On JDK 21 the shuffle operations I use for transposing matrices in registers do not compile to the right NEON instructions, on JDK 25 they do. That upgrade alone took the pass from 2.5 ms to about 1 ms. Same code.

After adding fused multiply-adds on top (clang was already doing that for the C side), the Clojure pass landed at 0.86 ms against C’s 0.70 ms, single-threaded. When I run both apps side by side they average around 370 FPS on my M3 MacBook. At this point neither version is limited by the CPU pass anymore, the GPU is the bottleneck for both, which is what parity means for this test.

The other thing I kept an eye on was garbage, because a hot loop that allocates will eventually stutter, and 0.86 ms means nothing if the GC interrupts you every second. The final kernel reads from plain float arrays, keeps everything in SIMD registers, and writes into one preallocated float array that goes directly to OpenGL. There is nothing for the collector to do. The heap sits flat at about 134 MB the whole time. For comparison, the broken first attempt was producing roughly 7.5 GB per second of temporary vector objects. Same algorithm, same API. The entire difference is whether the JIT gets to do its job.

Nobody would call this idiomatic Clojure. There is no immutability in the hot path, no laziness, no sequences. It reads like C with parentheses. I am fine with that. You write normal Clojure for the 99% of the program where performance does not matter, and for the one loop where it does, the language lets you go this low without leaving it.

The real thanks goes to the JVM developers. Project Panama’s Vector API is what made this possible: explicit SIMD from a dynamic language, landing within 20% of clang’s auto-vectorized output. Ten years ago my answer to this problem would have been writing the kernel in C and calling it through JNI. I am glad I do not have to do that anymore.

Clojure code:

(ns cpu-stress-test
    (:import [org.lwjgl BufferUtils]
      [org.lwjgl.glfw GLFW GLFWErrorCallback Callbacks
       GLFWKeyCallbackI GLFWCursorPosCallbackI GLFWFramebufferSizeCallbackI]
      [org.lwjgl.opengl GL GL11 GL15 GL20 GL30 GL31 GL33]
      [org.lwjgl.system MemoryUtil]
      [java.nio FloatBuffer]
      [java.util.concurrent Executors ExecutorService Callable]
      [jdk.incubator.vector FloatVector VectorShuffle])
    (:gen-class))

(set! *warn-on-reflection* true)
(set! *unchecked-math* :warn-on-boxed)

;; ── configuration ───────────────────────────────────────────────────────────

(def ^:const NUM-CUBES 100000)
(def ^:const CUBE-SPREAD 200.0)
(def ^:const ROTATION-SPEED 0.4)
(def ^:const INST-MOVE-SPEED 0.5)
(def ^:const OSC-AMPLITUDE 3.0)
(def ^:const WINDOW-W 1920)
(def ^:const WINDOW-H 1080)
(def ^:const NEAR-PLANE 0.1)
(def ^:const FAR-PLANE 500.0)
(def ^:const FPS-SAMPLE-COUNT 120)
(def ^:const STATIC-FLOATS 10)                              ; per cube: bp.xyz, rs.xyz, mf.xyz, scale
(def ^:const MODEL-FLOATS 16)                               ; per cube: column-major mat4
(def ^:const TAU 6.28318530717958647692)
(def ^:const CAM-MOVE-SPEED 50.0)
(def ^:const CAM-SENSITIVITY 0.002)
(def CAM-FOV-RAD (* Math/PI (/ 70.0 180.0)))

(def ncores (.availableProcessors (Runtime/getRuntime)))
;; thread counts to cycle through with T: 1,2,4,8,... clamped to core count
(def thread-counts
  (vec (sort (distinct (filter #(<= ^long % ^long ncores) [1 2 4 8 16 (long ncores)])))))

;; ── shaders ──────────────────────────────────────────────────────────────────

(def vertex-shader-src "
#version 330 core
layout (location = 0) in vec3 aPos;
layout (location = 1) in vec3 aNormal;
layout (location = 3) in vec4 aModel0;
layout (location = 4) in vec4 aModel1;
layout (location = 5) in vec4 aModel2;
layout (location = 6) in vec4 aModel3;
layout (location = 7) in vec3 aColor;
uniform mat4 uView;
uniform mat4 uProjection;
out vec3 vColor;
out vec3 vNormal;
out vec3 vFragPos;
void main() {
    mat4 model = mat4(aModel0, aModel1, aModel2, aModel3);
    vec4 worldPos = model * vec4(aPos, 1.0);
    gl_Position = uProjection * uView * worldPos;
    vColor = aColor;
    vNormal = mat3(model) * aNormal;
    vFragPos = worldPos.xyz;
}
")

(def fragment-shader-src "
#version 330 core
in vec3 vColor;
in vec3 vNormal;
in vec3 vFragPos;
out vec4 FragColor;
uniform vec3 uLightDir;
uniform vec3 uViewPos;
void main() {
    vec3 norm = normalize(vNormal);
    vec3 lightDir = normalize(uLightDir);
    vec3 ambient = 0.15 * vColor;
    float diff = max(dot(norm, lightDir), 0.0);
    vec3 diffuse = diff * vColor;
    vec3 viewDir = normalize(uViewPos - vFragPos);
    vec3 halfDir = normalize(lightDir + viewDir);
    float spec = pow(max(dot(norm, halfDir), 0.0), 32.0);
    vec3 specular = 0.3 * spec * vec3(1.0);
    float dist = length(vFragPos - uViewPos);
    float fog = exp(-dist * 0.008);
    fog = clamp(fog, 0.0, 1.0);
    vec3 result = ambient + diffuse + specular;
    result = mix(vec3(0.02, 0.02, 0.05), result, fog);
    FragColor = vec4(result, 1.0);
}
")

;; ── cube geometry (pos.xyz, normal.xyz) ─────────────────────────────────────

(def cube-vertices
  (float-array
    [-0.5 -0.5 0.5 0 0 1 0.5 -0.5 0.5 0 0 1 0.5 0.5 0.5 0 0 1
     0.5 0.5 0.5 0 0 1 -0.5 0.5 0.5 0 0 1 -0.5 -0.5 0.5 0 0 1
     -0.5 -0.5 -0.5 0 0 -1 -0.5 0.5 -0.5 0 0 -1 0.5 0.5 -0.5 0 0 -1
     0.5 0.5 -0.5 0 0 -1 0.5 -0.5 -0.5 0 0 -1 -0.5 -0.5 -0.5 0 0 -1
     -0.5 0.5 -0.5 0 1 0 -0.5 0.5 0.5 0 1 0 0.5 0.5 0.5 0 1 0
     0.5 0.5 0.5 0 1 0 0.5 0.5 -0.5 0 1 0 -0.5 0.5 -0.5 0 1 0
     -0.5 -0.5 -0.5 0 -1 0 0.5 -0.5 -0.5 0 -1 0 0.5 -0.5 0.5 0 -1 0
     0.5 -0.5 0.5 0 -1 0 -0.5 -0.5 0.5 0 -1 0 -0.5 -0.5 -0.5 0 -1 0
     0.5 -0.5 -0.5 1 0 0 0.5 0.5 -0.5 1 0 0 0.5 0.5 0.5 1 0 0
     0.5 0.5 0.5 1 0 0 0.5 -0.5 0.5 1 0 0 0.5 -0.5 -0.5 1 0 0
     -0.5 -0.5 -0.5 -1 0 0 -0.5 -0.5 0.5 -1 0 0 -0.5 0.5 0.5 -1 0 0
     -0.5 0.5 0.5 -1 0 0 -0.5 0.5 -0.5 -1 0 0 -0.5 -0.5 -0.5 -1 0 0]))

;; ── deterministic PRNG (SplitMix64, seed 42) ─────────────────────────────────

(def ^:const SM-GAMMA (unchecked-long (Long/parseUnsignedLong "9E3779B97F4A7C15" 16)))
(def ^:const SM-MIX1 (unchecked-long (Long/parseUnsignedLong "BF58476D1CE4E5B9" 16)))
(def ^:const SM-MIX2 (unchecked-long (Long/parseUnsignedLong "94D049BB133111EB" 16)))
(def ^:const INV-2P24 (/ 1.0 16777216.0))

;; Returns ^double, not ^float: Clojure has no primitive float fn return, so a
;; ^float hint would box every result and poison downstream arithmetic. The cloud
;; needn't be bit-identical to C (the integer RNG state stream is), so double is fine.
(defn sm-mix ^double [^long s]
      (let [z (unchecked-multiply (bit-xor s (unsigned-bit-shift-right s 30)) SM-MIX1)
            z (unchecked-multiply (bit-xor z (unsigned-bit-shift-right z 27)) SM-MIX2)
            z (bit-xor z (unsigned-bit-shift-right z 31))]
           (* (double (unsigned-bit-shift-right z 40)) INV-2P24)))

;; ── instance generation: static CPU inputs + static GPU color buffer ─────────

(defn generate-instance-data []
      (let [^floats statics (float-array (* NUM-CUBES STATIC-FLOATS))
            ^FloatBuffer color (BufferUtils/createFloatBuffer (* NUM-CUBES 3))
            half (/ CUBE-SPREAD 2.0)]
           (loop [i 0, st (long 42)]
                 (if (< i NUM-CUBES)
                   (let [s (unchecked-add st SM-GAMMA) px (- (* (sm-mix s) CUBE-SPREAD) half)
                         s (unchecked-add s SM-GAMMA) py (- (* (sm-mix s) CUBE-SPREAD) half)
                         s (unchecked-add s SM-GAMMA) pz (- (* (sm-mix s) CUBE-SPREAD) half)
                         s (unchecked-add s SM-GAMMA) hue (* (sm-mix s) TAU)
                         r (+ 0.5 (* 0.5 (Math/sin hue)))
                         g (+ 0.5 (* 0.5 (Math/sin (+ hue 2.094))))
                         b (+ 0.5 (* 0.5 (Math/sin (+ hue 4.188))))
                         s (unchecked-add s SM-GAMMA) sc (+ 0.3 (* (sm-mix s) 0.7))
                         s (unchecked-add s SM-GAMMA) rsx (* (- (sm-mix s) 0.5) (* ROTATION-SPEED 2.0))
                         s (unchecked-add s SM-GAMMA) rsy (* (- (sm-mix s) 0.5) (* ROTATION-SPEED 2.0))
                         s (unchecked-add s SM-GAMMA) rsz (* (- (sm-mix s) 0.5) (* ROTATION-SPEED 2.0))
                         s (unchecked-add s SM-GAMMA) mfx (* (+ 0.5 (sm-mix s)) INST-MOVE-SPEED)
                         s (unchecked-add s SM-GAMMA) mfy (* (+ 0.5 (sm-mix s)) INST-MOVE-SPEED)
                         s (unchecked-add s SM-GAMMA) mfz (* (+ 0.5 (sm-mix s)) INST-MOVE-SPEED)
                         si (* i STATIC-FLOATS)]
                        (aset statics si (float px))
                        (aset statics (+ si 1) (float py))
                        (aset statics (+ si 2) (float pz))
                        (aset statics (+ si 3) (float rsx))
                        (aset statics (+ si 4) (float rsy))
                        (aset statics (+ si 5) (float rsz))
                        (aset statics (+ si 6) (float mfx))
                        (aset statics (+ si 7) (float mfy))
                        (aset statics (+ si 8) (float mfz))
                        (aset statics (+ si 9) (float sc))
                        (.put color (float r)) (.put color (float g)) (.put color (float b))
                        (recur (inc i) s))
                   (do
                     (.flip color)
                     ;; transpose AoS -> SoA (one-time) so the SIMD path can do contiguous
                     ;; per-field vector loads: soa = [bpx[] bpy[] bpz[] rsx[]..rsz[] mfx[]..mfz[] scale[]]
                     (let [soa (object-array STATIC-FLOATS)]
                          (dotimes [k STATIC-FLOATS] (aset soa (int k) (float-array NUM-CUBES)))
                          (dotimes [c NUM-CUBES]
                                   (let [cs (* c STATIC-FLOATS)]
                                        (dotimes [k STATIC-FLOATS]
                                                 (aset ^floats (aget soa (int k)) (int c) (aget statics (int (+ cs k)))))))
                          [statics soa color]))))))

;; ── shared polynomial sin/cos (identical algorithm to the C version) ─────────
;; Replaces Math/sin|cos (fdlibm — not a HW intrinsic on Apple Silicon) so the
;; C-vs-Clojure comparison isolates language/JIT instead of measuring two
;; different transcendental implementations. Range-reduce to [-pi, pi], parabola
;; + one refinement. ^double on the arg vector ⇒ true primitive in/out, no boxing.

(def ^:const INV-TWO-PI 0.15915494309189535)
(def ^:const TWO-PI 6.283185307179586)
(def ^:const HALF-PI 1.5707963267948966)
(def ^:const B-SIN 1.2732395447351628)                      ;  4/pi
(def ^:const C-SIN -0.40528473456935109)                    ; -4/pi^2
(def ^:const P-SIN 0.225)

(defn fast-sin ^double [^double x]
      (let [k (double (long (+ (* x INV-TWO-PI) (if (>= x 0.0) 0.5 -0.5)))) ; round to nearest
            r (- x (* k TWO-PI))                            ; r in [-pi, pi]
            ar (Math/abs r)
            y (+ (* B-SIN r) (* C-SIN r ar))
            ay (Math/abs y)]
           (+ (* P-SIN (- (* y ay) y)) y)))

;; ── SIMD: the same polynomial sin on a FloatVector (JDK Panama Vector API) ───
;; Lane-parallel version of fast-sin. Range reduction uses the float "magic
;; number" trick: (v + 1.5*2^23) - 1.5*2^23 == round-to-nearest, all add/sub, so
;; it lowers to NEON — no per-lane libm and no copysign branch.
;;
;; CRITICAL for performance: the Vector API only compiles to actual SIMD when
;; C2 can constant-fold the species. A species stored in a Clojure Var is an
;; opaque field load, which silently drops every op onto the boxed Java fallback
;; path (~10x slower than scalar). So the species is referenced as the static
;; final field FloatVector/SPECIES_128 at every call site (NEON = 128-bit = 4
;; float lanes — this project is pinned to Apple Silicon), and the sin polynomial
;; is a macro, not a fn, so the whole kernel is one inlinable method body.

(def ^:const LANES 4)                                       ; SPECIES_128 float lanes
(def ^:const MAGIC 12582912.0)                              ; 1.5 * 2^23

(defmacro vbc
          "Broadcast a compile-time constant to a FloatVector. Inside a loop C2 hoists
          the resulting Replicate node out as loop-invariant, so this costs nothing."
          [c]
          `(FloatVector/broadcast FloatVector/SPECIES_128 (float ~c)))

(defmacro vsin
          "Inline lanewise polynomial sin (same algorithm as fast-sin) on a FloatVector.
          Uses fma like the clang -O2 build of the C kernel (fp-contract is on there)."
          [x]
          `(let [x# ~x
                 v# (.mul x# (float INV-TWO-PI))
                 k# (.sub (.add v# (float MAGIC)) (float MAGIC)) ; round-to-nearest, lanewise
                 r# (.fma k# (vbc (- TWO-PI)) x#)           ; r = x - k*2pi, in [-pi, pi]
                 ar# (.abs r#)
                 y# (.mul r# (.fma ar# (vbc C-SIN) (vbc B-SIN))) ; y = B*r + C*r*|r|
                 ay# (.abs y#)]
                (.fma (.fma y# ay# (.neg y#)) (vbc P-SIN) y#))) ; P*(y*|y| - y) + y

;; ── THE CPU WORKLOAD ─────────────────────────────────────────────────────────
;; Build column-major model matrices for cubes [start, end) at time t, writing
;; into the heap float[] `model` (handed to glBufferData via LWJGL's array
;; overload — a pinned zero-copy JNI pass, symmetric with C handing malloc'd
;; memory to GL). Params are left unhinted (a fn taking >4 primitives is illegal)
;; and re-bound to primitives inside, so the body is fully primitive with no boxing.

(defn compute-range! [model statics start end t]
      (let [^floats model model
            ^floats statics statics
            end (long end)
            t (double t)
            osc (double OSC-AMPLITUDE)]
           (loop [i (long start)]
                 (when (< i end)
                       (let [si (* i STATIC-FLOATS)
                             apx (+ (aget statics si) (* osc (fast-sin (* (aget statics (+ si 6)) t))))
                             apy (+ (aget statics (+ si 1)) (* osc (fast-sin (* (aget statics (+ si 7)) t))))
                             apz (+ (aget statics (+ si 2)) (* osc (fast-sin (* (aget statics (+ si 8)) t))))
                             ax (* (aget statics (+ si 3)) t)
                             ay (* (aget statics (+ si 4)) t)
                             az (* (aget statics (+ si 5)) t)
                             scale (double (aget statics (+ si 9)))
                             cx (fast-sin (+ ax HALF-PI)) sx (fast-sin ax)
                             cy (fast-sin (+ ay HALF-PI)) sy (fast-sin ay)
                             cz (fast-sin (+ az HALF-PI)) sz (fast-sin az)
                             base (* i MODEL-FLOATS)]
                            (aset model base (float (* scale (* cy cz))))
                            (aset model (+ base 1) (float (* scale (+ (* sx sy cz) (* cx sz)))))
                            (aset model (+ base 2) (float (* scale (+ (* (- cx) sy cz) (* sx sz)))))
                            (aset model (+ base 3) (float 0.0))
                            (aset model (+ base 4) (float (* scale (- (* cy sz)))))
                            (aset model (+ base 5) (float (* scale (+ (* (- sx) sy sz) (* cx cz)))))
                            (aset model (+ base 6) (float (* scale (+ (* cx sy sz) (* sx cz)))))
                            (aset model (+ base 7) (float 0.0))
                            (aset model (+ base 8) (float (* scale sy)))
                            (aset model (+ base 9) (float (* scale (- (* sx cy)))))
                            (aset model (+ base 10) (float (* scale (* cx cy))))
                            (aset model (+ base 11) (float 0.0))
                            (aset model (+ base 12) (float apx))
                            (aset model (+ base 13) (float apy))
                            (aset model (+ base 14) (float apz))
                            (aset model (+ base 15) (float 1.0))
                            (recur (inc i)))))))

;; SIMD variant: same math, LANES cubes at a time, reading SoA arrays with
;; contiguous vector loads. The 16 matrix elements are computed as 16 FloatVectors
;; (lane j = cube i+j), then transposed in registers — four 4x4 transposes built
;; from two-vector rearranges (NEON tbl) — so each cube's 16 floats are written
;; with 4 contiguous vector stores. No scratch buffer, no scalar scatter.
;; Requires [start,end) to be LANES-aligned (guaranteed by run-pass!); a scalar
;; tail covers any remainder.
;;
;; transpose4 turns column vectors A,B,C,W (lane j = cube j) into row vectors
;; T_j = [A_j, B_j, C_j, W_j] and stores them at model[(i+j)*16 + off]:
;;   ab-lo = [A0 B0 A1 B1]   cw-lo = [C0 W0 C1 W1]   ->  T0 = [A0 B0 C0 W0]
;;   ab-hi = [A2 B2 A3 B3]   cw-hi = [C2 W2 C3 W3]       T1 = [A1 B1 C1 W1] ...
;; In a two-vector rearrange, shuffle index k in [0,4) picks this[k] and the
;; "exceptional" index k-4 (negative) picks other[k].
(defmacro ^:private transpose4-store!
          [model i off A B C W zip-lo zip-hi cat-lo cat-hi]
          `(let [ab-lo# (.rearrange ~A ~zip-lo ~B)
                 ab-hi# (.rearrange ~A ~zip-hi ~B)
                 cw-lo# (.rearrange ~C ~zip-lo ~W)
                 cw-hi# (.rearrange ~C ~zip-hi ~W)]
                (.intoArray (.rearrange ab-lo# ~cat-lo cw-lo#) ~model (int (+ (* ~i 16) ~off)))
                (.intoArray (.rearrange ab-lo# ~cat-hi cw-lo#) ~model (int (+ (* (+ ~i 1) 16) ~off)))
                (.intoArray (.rearrange ab-hi# ~cat-lo cw-hi#) ~model (int (+ (* (+ ~i 2) 16) ~off)))
                (.intoArray (.rearrange ab-hi# ~cat-hi cw-hi#) ~model (int (+ (* (+ ~i 3) 16) ~off)))))

(defn compute-range-simd! [model soa start end t]
      (let [^floats model model
            ^objects soa soa
            start (long start)
            end (long end)
            tf (float t)
            osc (float OSC-AMPLITUDE)
            hp (float HALF-PI)
            lanes (long LANES)
            ^floats bpx (aget soa 0) ^floats bpy (aget soa 1) ^floats bpz (aget soa 2)
            ^floats rsx (aget soa 3) ^floats rsy (aget soa 4) ^floats rsz (aget soa 5)
            ^floats mfx (aget soa 6) ^floats mfy (aget soa 7) ^floats mfz (aget soa 8)
            ^floats scl (aget soa 9)
            zero (FloatVector/zero FloatVector/SPECIES_128)
            one (FloatVector/broadcast FloatVector/SPECIES_128 (float 1.0))
            zip-lo (VectorShuffle/fromValues FloatVector/SPECIES_128 (int-array [0 -4 1 -3]))
            zip-hi (VectorShuffle/fromValues FloatVector/SPECIES_128 (int-array [2 -2 3 -1]))
            cat-lo (VectorShuffle/fromValues FloatVector/SPECIES_128 (int-array [0 1 -4 -3]))
            cat-hi (VectorShuffle/fromValues FloatVector/SPECIES_128 (int-array [2 3 -2 -1]))
            simd-end (+ start (* (quot (- end start) lanes) lanes))]
           (loop [i start]
                 (when (< i simd-end)
                       (let [ii (int i)
                             BPX (FloatVector/fromArray FloatVector/SPECIES_128 bpx ii)
                             BPY (FloatVector/fromArray FloatVector/SPECIES_128 bpy ii)
                             BPZ (FloatVector/fromArray FloatVector/SPECIES_128 bpz ii)
                             RSX (FloatVector/fromArray FloatVector/SPECIES_128 rsx ii)
                             RSY (FloatVector/fromArray FloatVector/SPECIES_128 rsy ii)
                             RSZ (FloatVector/fromArray FloatVector/SPECIES_128 rsz ii)
                             MFX (FloatVector/fromArray FloatVector/SPECIES_128 mfx ii)
                             MFY (FloatVector/fromArray FloatVector/SPECIES_128 mfy ii)
                             MFZ (FloatVector/fromArray FloatVector/SPECIES_128 mfz ii)
                             SCALE (FloatVector/fromArray FloatVector/SPECIES_128 scl ii)
                             APX (.add BPX (.mul (vsin (.mul MFX tf)) osc))
                             APY (.add BPY (.mul (vsin (.mul MFY tf)) osc))
                             APZ (.add BPZ (.mul (vsin (.mul MFZ tf)) osc))
                             AX (.mul RSX tf) AY (.mul RSY tf) AZ (.mul RSZ tf)
                             CX (vsin (.add AX hp)) SX (vsin AX)
                             CY (vsin (.add AY hp)) SY (vsin AY)
                             CZ (vsin (.add AZ hp)) SZ (vsin AZ)
                             SXSY (.mul SX SY)
                             CXSY (.mul CX SY)
                             M0 (.mul SCALE (.mul CY CZ))
                             M1 (.mul SCALE (.fma SXSY CZ (.mul CX SZ)))
                             M2 (.mul SCALE (.fma (.neg CXSY) CZ (.mul SX SZ)))
                             M4 (.mul SCALE (.neg (.mul CY SZ)))
                             M5 (.mul SCALE (.fma (.neg SXSY) SZ (.mul CX CZ)))
                             M6 (.mul SCALE (.fma CXSY SZ (.mul SX CZ)))
                             M8 (.mul SCALE SY)
                             M9 (.mul SCALE (.neg (.mul SX CY)))
                             M10 (.mul SCALE (.mul CX CY))]
                            (transpose4-store! model ii 0 M0 M1 M2 zero zip-lo zip-hi cat-lo cat-hi)
                            (transpose4-store! model ii 4 M4 M5 M6 zero zip-lo zip-hi cat-lo cat-hi)
                            (transpose4-store! model ii 8 M8 M9 M10 zero zip-lo zip-hi cat-lo cat-hi)
                            (transpose4-store! model ii 12 APX APY APZ one zip-lo zip-hi cat-lo cat-hi)
                            (recur (+ i lanes)))))
           ;; scalar tail (only runs if the range wasn't LANES-aligned)
           (loop [i simd-end]
                 (when (< i end)
                       (let [ix (int i)
                             td (double tf)
                             od (double osc)
                             apx (+ (aget bpx ix) (* od (fast-sin (* (aget mfx ix) td))))
                             apy (+ (aget bpy ix) (* od (fast-sin (* (aget mfy ix) td))))
                             apz (+ (aget bpz ix) (* od (fast-sin (* (aget mfz ix) td))))
                             ax (* (aget rsx ix) td) ay (* (aget rsy ix) td) az (* (aget rsz ix) td)
                             sc (double (aget scl ix))
                             cx (fast-sin (+ ax HALF-PI)) sx (fast-sin ax)
                             cy (fast-sin (+ ay HALF-PI)) sy (fast-sin ay)
                             cz (fast-sin (+ az HALF-PI)) sz (fast-sin az)
                             base (* i MODEL-FLOATS)]
                            (aset model base (float (* sc (* cy cz))))
                            (aset model (+ base 1) (float (* sc (+ (* sx sy cz) (* cx sz)))))
                            (aset model (+ base 2) (float (* sc (+ (* (- cx) sy cz) (* sx sz)))))
                            (aset model (+ base 3) (float 0.0))
                            (aset model (+ base 4) (float (* sc (- (* cy sz)))))
                            (aset model (+ base 5) (float (* sc (+ (* (- sx) sy sz) (* cx cz)))))
                            (aset model (+ base 6) (float (* sc (+ (* cx sy sz) (* sx cz)))))
                            (aset model (+ base 7) (float 0.0))
                            (aset model (+ base 8) (float (* sc sy)))
                            (aset model (+ base 9) (float (* sc (- (* sx cy)))))
                            (aset model (+ base 10) (float (* sc (* cx cy))))
                            (aset model (+ base 11) (float 0.0))
                            (aset model (+ base 12) (float apx))
                            (aset model (+ base 13) (float apy))
                            (aset model (+ base 14) (float apz))
                            (aset model (+ base 15) (float 1.0)))
                       (recur (inc i))))))

;; Run the transform pass for the whole cloud, fanned across `nt` threads, in
;; either scalar (AoS `statics`) or SIMD (SoA `soa`) mode. The calling thread
;; computes chunk 0 itself while the pool runs chunks 1..nt-1 (cheaper than
;; parking the caller in invokeAll — closer to OpenMP's fork-join cost).
(defn run-pass! [^ExecutorService pool model statics soa nt t simd?]
      (let [n (long NUM-CUBES)
            nt (long nt)
            lanes (long LANES)
            run (fn [s e] (if simd?
                            (compute-range-simd! model soa s e t)
                            (compute-range! model statics s e t)))]
           (if (== nt 1)
             (run 0 n)
             (let [base (quot (+ n (dec nt)) nt)            ; ceil(n / nt)
                   chunk (long (if simd?
                                 (* (quot (+ base (dec lanes)) lanes) lanes) ; round up to LANES
                                 base))
                   futs (mapv (fn [c]
                                  (let [start (* (long c) chunk)
                                        e (+ start chunk)
                                        end (if (< e n) e n)]
                                       (.submit pool (reify Callable
                                                            (call [_] (when (< start n) (run start end)) nil)))))
                              (range 1 nt))]
                  (run 0 (min chunk n))
                  (run! (fn [^java.util.concurrent.Future f] (.get f)) futs)
                  nil))))

;; ── minimal column-major math (view + projection only) ───────────────────────

(defn perspective! [^FloatBuffer b fovy aspect near far]
      (let [fovy (double fovy) aspect (double aspect) near (double near) far (double far)
            t (Math/tan (/ fovy 2.0))]
           (dotimes [i 16] (.put b (int i) (float 0)))
           (doto b
                 (.put 0 (float (/ 1.0 (* aspect t))))
                 (.put 5 (float (/ 1.0 t)))
                 (.put 10 (float (/ (- (+ far near)) (- far near))))
                 (.put 11 (float -1.0))
                 (.put 14 (float (/ (- (* 2.0 far near)) (- far near)))))))

(defn look-at! [^FloatBuffer b ex ey ez cx cy cz ux uy uz]
      (let [ex (double ex) ey (double ey) ez (double ez)
            cx (double cx) cy (double cy) cz (double cz)
            ux (double ux) uy (double uy) uz (double uz)
            dx (- cx ex) dy (- cy ey) dz (- cz ez)
            dl (Math/sqrt (+ (* dx dx) (* dy dy) (* dz dz)))
            fx (/ dx dl) fy (/ dy dl) fz (/ dz dl)
            ax (- (* fy uz) (* fz uy)) ay (- (* fz ux) (* fx uz)) az (- (* fx uy) (* fy ux))
            al (Math/sqrt (+ (* ax ax) (* ay ay) (* az az)))
            sx (/ ax al) sy (/ ay al) sz (/ az al)
            vx (- (* sy fz) (* sz fy)) vy (- (* sz fx) (* sx fz)) vz (- (* sx fy) (* sy fx))]
           (doto b
                 (.put 0 (float sx)) (.put 1 (float vx)) (.put 2 (float (- fx))) (.put 3 (float 0))
                 (.put 4 (float sy)) (.put 5 (float vy)) (.put 6 (float (- fy))) (.put 7 (float 0))
                 (.put 8 (float sz)) (.put 9 (float vz)) (.put 10 (float (- fz))) (.put 11 (float 0))
                 (.put 12 (float (- (+ (* sx ex) (* sy ey) (* sz ez)))))
                 (.put 13 (float (- (+ (* vx ex) (* vy ey) (* vz ez)))))
                 (.put 14 (float (+ (* fx ex) (* fy ey) (* fz ez))))
                 (.put 15 (float 1)))))

;; ── shader compilation ────────────────────────────────────────────────────────

(defn compile-shader [^CharSequence src ^long kind ^String label]
      (let [id (GL20/glCreateShader (int kind))]
           (GL20/glShaderSource id src)
           (GL20/glCompileShader id)
           (when (zero? (GL20/glGetShaderi id GL20/GL_COMPILE_STATUS))
                 (binding [*out* *err*]
                          (println (str "ERROR::SHADER::" label "::COMPILATION_FAILED"))
                          (println (GL20/glGetShaderInfoLog id)))
                 (throw (ex-info "shader compilation failed" {:label label})))
           id))

(defn build-program [vs-src fs-src]
      (let [vs (compile-shader vs-src GL20/GL_VERTEX_SHADER "VERTEX")
            fs (compile-shader fs-src GL20/GL_FRAGMENT_SHADER "FRAGMENT")
            program (GL20/glCreateProgram)]
           (GL20/glAttachShader program vs)
           (GL20/glAttachShader program fs)
           (GL20/glLinkProgram program)
           (GL20/glDeleteShader vs)
           (GL20/glDeleteShader fs)
           (when (zero? (GL20/glGetProgrami program GL20/GL_LINK_STATUS))
                 (binding [*out* *err*]
                          (println "ERROR::SHADER::PROGRAM::LINKING_FAILED")
                          (println (GL20/glGetProgramInfoLog program)))
                 (throw (ex-info "program linking failed" {})))
           program))

;; ── shared state for GLFW callbacks ──────────────────────────────────────────

(def cam-pos (atom [0.0 50.0 150.0]))
(def cam-yaw (atom 0.0))
(def cam-pitch (atom 0.0))
(def first-mouse (atom true))
(def last-mouse (atom [0.0 0.0]))
(def mouse-captured (atom true))
(def vsync-on (atom false))
(def show-stats (atom true))
(def fb-size (atom [WINDOW-W WINDOW-H]))
(def thread-idx (atom 0))                                   ; index into thread-counts
(def simd? (atom true))                                     ; true = Panama SIMD (default), false = scalar

(defn current-threads ^long [] (long (nth thread-counts @thread-idx)))

(defn process-keyboard [^long window ^double dt]
      (let [yaw (double @cam-yaw)
            pitch (double @cam-pitch)
            speed (* CAM-MOVE-SPEED dt)
            cp (Math/cos pitch)
            fx (* (Math/sin yaw) cp)
            fz (* (- (Math/cos yaw)) cp)
            len (Math/sqrt (+ (* fx fx) (* fz fz)))
            nfx (if (zero? len) fx (/ fx len))
            nfz (if (zero? len) fz (/ fz len))
            rx (Math/cos yaw)
            rz (Math/sin yaw)
            down? (fn [k] (= (GLFW/glfwGetKey window (int k)) GLFW/GLFW_PRESS))
            w? (down? GLFW/GLFW_KEY_W) s? (down? GLFW/GLFW_KEY_S)
            a? (down? GLFW/GLFW_KEY_A) d? (down? GLFW/GLFW_KEY_D)
            up? (down? GLFW/GLFW_KEY_SPACE) dn? (down? GLFW/GLFW_KEY_LEFT_SHIFT)
            ddx (+ (if w? (* nfx speed) 0.0) (if s? (- (* nfx speed)) 0.0)
                   (if d? (* rx speed) 0.0) (if a? (- (* rx speed)) 0.0))
            ddz (+ (if w? (* nfz speed) 0.0) (if s? (- (* nfz speed)) 0.0)
                   (if d? (* rz speed) 0.0) (if a? (- (* rz speed)) 0.0))
            ddy (+ (if up? speed 0.0) (if dn? (- speed) 0.0))]
           (when (or w? s? a? d? up? dn?)
                 (swap! cam-pos (fn [[x y z]] [(+ (double x) ddx) (+ (double y) ddy) (+ (double z) ddz)])))))

(defn make-key-callback []
      (reify GLFWKeyCallbackI
             (invoke [_ window key _scancode action _mods]
                     (when (= action GLFW/GLFW_PRESS)
                           (condp = key
                                  GLFW/GLFW_KEY_ESCAPE (GLFW/glfwSetWindowShouldClose window true)
                                  GLFW/GLFW_KEY_TAB (do (swap! mouse-captured not)
                                                        (if @mouse-captured
                                                          (do (GLFW/glfwSetInputMode window GLFW/GLFW_CURSOR GLFW/GLFW_CURSOR_DISABLED)
                                                              (reset! first-mouse true))
                                                          (GLFW/glfwSetInputMode window GLFW/GLFW_CURSOR GLFW/GLFW_CURSOR_NORMAL)))
                                  GLFW/GLFW_KEY_T (do (swap! thread-idx (fn [i] (mod (inc (long i)) (count thread-counts))))
                                                      (println "Worker threads:" (current-threads)))
                                  GLFW/GLFW_KEY_M (do (swap! simd? not)
                                                      (println "Compute mode:" (if @simd? "SIMD (Panama)" "scalar")))
                                  GLFW/GLFW_KEY_V (do (swap! vsync-on not)
                                                      (GLFW/glfwSwapInterval (int (if @vsync-on 1 0)))
                                                      (println "VSync:" (if @vsync-on "ON" "OFF")))
                                  GLFW/GLFW_KEY_F1 (swap! show-stats not)
                                  nil)))))

(defn make-cursor-callback []
      (reify GLFWCursorPosCallbackI
             (invoke [_ _window xpos ypos]
                     (if @first-mouse
                       (do (reset! last-mouse [xpos ypos]) (reset! first-mouse false))
                       (when @mouse-captured
                             (let [[lx ly] @last-mouse
                                   xoff (- (double xpos) (double lx))
                                   yoff (- (double ypos) (double ly))
                                   max-pitch (- (/ Math/PI 2.0) 0.1)]
                                  (reset! last-mouse [xpos ypos])
                                  (swap! cam-yaw (fn [y] (+ (double y) (* xoff CAM-SENSITIVITY))))
                                  (swap! cam-pitch (fn [p]
                                                       (let [np (- (double p) (* yoff CAM-SENSITIVITY))]
                                                            (cond (> np max-pitch) max-pitch
                                                                  (< np (- max-pitch)) (- max-pitch)
                                                                  :else np))))))))))

(defn make-fb-callback []
      (reify GLFWFramebufferSizeCallbackI
             (invoke [_ _window w h]
                     (reset! fb-size [w h])
                     (GL11/glViewport 0 0 (int w) (int h)))))

(def log-stats? (some? (System/getenv "STRESS_STATS")))     ; STRESS_STATS=1 -> also print stats to stdout

(defn update-title [window threads cpu-ms fps frame-ms]
      (let [window (long window)
            mode (if @simd? "SIMD" "scalar")]
           (if @show-stats
             (let [rt (Runtime/getRuntime)
                   used-mb (quot (- (.totalMemory rt) (.freeMemory rt)) (* 1024 1024))
                   max-mb (quot (.maxMemory rt) (* 1024 1024))
                   title (format "CPU Stress (Clojure, %s, %d thread%s) | %d cubes | CPU: %.2f ms | FPS: %.1f | %.2f ms | Mem: %d/%d MB | VSync: %s"
                                 mode threads (if (== (long threads) 1) "" "s") NUM-CUBES cpu-ms fps frame-ms used-mb max-mb (if @vsync-on "ON" "OFF"))]
                  (when log-stats? (println title) (flush))
                  (GLFW/glfwSetWindowTitle window title))
             (GLFW/glfwSetWindowTitle window "CPU Stress Test - 100K Cubes"))))

;; ── main ─────────────────────────────────────────────────────────────────────

(defn -main [& _args]
      (.set (GLFWErrorCallback/createPrint System/err))
      (when-not (GLFW/glfwInit)
                (binding [*out* *err*] (println "GLFW init failed"))
                (System/exit 1))

      (GLFW/glfwDefaultWindowHints)
      (GLFW/glfwWindowHint GLFW/GLFW_CONTEXT_VERSION_MAJOR 3)
      (GLFW/glfwWindowHint GLFW/GLFW_CONTEXT_VERSION_MINOR 3)
      (GLFW/glfwWindowHint GLFW/GLFW_OPENGL_PROFILE GLFW/GLFW_OPENGL_CORE_PROFILE)
      (GLFW/glfwWindowHint GLFW/GLFW_OPENGL_FORWARD_COMPAT GLFW/GLFW_TRUE)

      (let [window (GLFW/glfwCreateWindow (int WINDOW-W) (int WINDOW-H)
                                          "CPU Stress Test - 100K Cubes (Clojure)"
                                          MemoryUtil/NULL MemoryUtil/NULL)]
           (when (= window MemoryUtil/NULL)
                 (binding [*out* *err*] (println "Window creation failed"))
                 (GLFW/glfwTerminate)
                 (System/exit 1))

           (GLFW/glfwMakeContextCurrent window)
           (GLFW/glfwSwapInterval 0)
           (let [wbuf (int-array 1) hbuf (int-array 1)]
                (GLFW/glfwGetFramebufferSize window wbuf hbuf)
                (reset! fb-size [(aget wbuf 0) (aget hbuf 0)]))
           (GLFW/glfwSetFramebufferSizeCallback window (make-fb-callback))
           (GLFW/glfwSetCursorPosCallback window (make-cursor-callback))
           (GLFW/glfwSetKeyCallback window (make-key-callback))
           (GLFW/glfwSetInputMode window GLFW/GLFW_CURSOR GLFW/GLFW_CURSOR_DISABLED)

           (GL/createCapabilities)

           (println (format (str "\n"
                                 "CPU STRESS TEST - %d CPU-TRANSFORMED CUBES (Clojure / LWJGL)\n"
                                 "OpenGL: %s\nGPU:    %s\n"
                                 "Cores: %d | thread cycle (T): %s | SIMD lanes: %d (M to toggle Panama SIMD)\n"
                                 "WASD/Space/Shift fly | Mouse look | Tab cursor | T threads | M SIMD | V vsync | F1 stats | Esc quit\n")
                            NUM-CUBES
                            (GL11/glGetString GL11/GL_VERSION)
                            (GL11/glGetString GL11/GL_RENDERER)
                            ncores (pr-str thread-counts) LANES))

           (GL11/glEnable GL11/GL_DEPTH_TEST)
           (GL11/glDepthFunc GL11/GL_LESS)
           (GL11/glEnable GL11/GL_CULL_FACE)
           (GL11/glCullFace GL11/GL_BACK)

           (let [shader (int (build-program vertex-shader-src fragment-shader-src))
                 [statics soa color-buf] (generate-instance-data)
                 model-bytes (long (* NUM-CUBES MODEL-FLOATS 4))
                 ^floats model (float-array (* NUM-CUBES MODEL-FLOATS))
                 ^ExecutorService pool (Executors/newFixedThreadPool ncores)
                 cube-buf (doto (BufferUtils/createFloatBuffer (alength ^floats cube-vertices))
                                (.put ^floats cube-vertices) (.flip))
                 stride-cube (int (* 6 4))
                 stride-model (int (* MODEL-FLOATS 4))
                 vao (GL30/glGenVertexArrays)
                 vbo-cube (GL15/glGenBuffers)
                 vbo-model (GL15/glGenBuffers)
                 vbo-color (GL15/glGenBuffers)]

                (println (format "Generated %d cubes (streaming %.2f MB/frame of model matrices)"
                                 NUM-CUBES (/ (double model-bytes) (* 1024.0 1024.0))))
                (let [[fw fh] @fb-size] (println (format "Framebuffer: %dx%d" (long fw) (long fh))))

                (GL30/glBindVertexArray vao)

                ;; per-vertex cube geometry
                (GL15/glBindBuffer GL15/GL_ARRAY_BUFFER vbo-cube)
                (GL15/glBufferData GL15/GL_ARRAY_BUFFER ^FloatBuffer cube-buf GL15/GL_STATIC_DRAW)
                (GL20/glVertexAttribPointer 0 3 GL11/GL_FLOAT false stride-cube (long 0))
                (GL20/glEnableVertexAttribArray 0)
                (GL20/glVertexAttribPointer 1 3 GL11/GL_FLOAT false stride-cube (long (* 3 4)))
                (GL20/glEnableVertexAttribArray 1)

                ;; per-instance model matrix (mat4 = locations 3..6), streamed each frame
                (GL15/glBindBuffer GL15/GL_ARRAY_BUFFER vbo-model)
                (GL15/glBufferData GL15/GL_ARRAY_BUFFER model-bytes GL15/GL_STREAM_DRAW)
                (dotimes [col 4]
                         (let [loc (int (+ 3 col))]
                              (GL20/glVertexAttribPointer loc 4 GL11/GL_FLOAT false stride-model (long (* col 4 4)))
                              (GL20/glEnableVertexAttribArray loc)
                              (GL33/glVertexAttribDivisor loc 1)))

                ;; per-instance color (location 7), static
                (GL15/glBindBuffer GL15/GL_ARRAY_BUFFER vbo-color)
                (GL15/glBufferData GL15/GL_ARRAY_BUFFER ^FloatBuffer color-buf GL15/GL_STATIC_DRAW)
                (GL20/glVertexAttribPointer 7 3 GL11/GL_FLOAT false (int (* 3 4)) (long 0))
                (GL20/glEnableVertexAttribArray 7)
                (GL33/glVertexAttribDivisor 7 1)

                (GL30/glBindVertexArray 0)
                (GL15/glBindBuffer GL15/GL_ARRAY_BUFFER 0)

                (let [u-view (GL20/glGetUniformLocation shader "uView")
                      u-proj (GL20/glGetUniformLocation shader "uProjection")
                      u-light (GL20/glGetUniformLocation shader "uLightDir")
                      u-viewpos (GL20/glGetUniformLocation shader "uViewPos")
                      view-buf (BufferUtils/createFloatBuffer 16)
                      proj-buf (BufferUtils/createFloatBuffer 16)
                      frame-times (double-array FPS-SAMPLE-COUNT)]

                     ;; ── render loop ──────────────────────────────────────────────────────
                     (loop [last-t (GLFW/glfwGetTime)
                            idx 0
                            cnt 0
                            elapsed 0.0
                            title-t 0.0
                            cpu-accum 0.0
                            cpu-frames 0]
                           (if (GLFW/glfwWindowShouldClose window)
                             (do
                               (.shutdown pool)
                               (GL30/glDeleteVertexArrays vao)
                               (GL15/glDeleteBuffers vbo-cube)
                               (GL15/glDeleteBuffers vbo-model)
                               (GL15/glDeleteBuffers vbo-color)
                               (GL20/glDeleteProgram shader)
                               (Callbacks/glfwFreeCallbacks window)
                               (GLFW/glfwDestroyWindow window)
                               (GLFW/glfwTerminate)
                               (some-> (GLFW/glfwSetErrorCallback nil) (.free)))

                             (let [now (GLFW/glfwGetTime)
                                   dt (- now last-t)
                                   _ (aset frame-times idx dt)
                                   idx' (rem (inc idx) FPS-SAMPLE-COUNT)
                                   cnt' (if (< cnt FPS-SAMPLE-COUNT) (inc cnt) cnt)
                                   total (double (loop [i 0 acc 0.0]
                                                       (if (< i cnt') (recur (inc i) (+ acc (aget frame-times i))) acc)))
                                   fps (if (> total 0.0) (/ (double cnt') total) 0.0)
                                   frame-ms (* dt 1000.0)
                                   elapsed' (+ elapsed dt)
                                   title-t' (+ title-t dt)
                                   nt (current-threads)]

                                  (process-keyboard window dt)

                                  ;; ── the CPU pass (timed) ──────────────────────────────────────
                                  (let [c0 (GLFW/glfwGetTime)
                                        _ (run-pass! pool model statics soa nt (* elapsed' 5.0) @simd?)
                                        c1 (GLFW/glfwGetTime)
                                        cpu-accum' (+ cpu-accum (* (- c1 c0) 1000.0))
                                        cpu-frames' (inc cpu-frames)]

                                       (GL11/glClearColor (float 0.02) (float 0.02) (float 0.05) (float 1.0))
                                       (GL11/glClear (bit-or GL11/GL_COLOR_BUFFER_BIT GL11/GL_DEPTH_BUFFER_BIT))

                                       (let [[fw fh] @fb-size
                                             aspect (/ (double fw) (double (max 1 (long fh))))
                                             [cx cy cz] @cam-pos
                                             cx (double cx) cy (double cy) cz (double cz)
                                             yaw (double @cam-yaw) pitch (double @cam-pitch)
                                             cp (Math/cos pitch)
                                             fwx (* (Math/sin yaw) cp)
                                             fwy (Math/sin pitch)
                                             fwz (* (- (Math/cos yaw)) cp)]
                                            (look-at! view-buf cx cy cz
                                                      (+ cx (* fwx 20.0)) (+ cy (* fwy 20.0)) (+ cz (* fwz 20.0))
                                                      0.0 1.0 0.0)
                                            (perspective! proj-buf CAM-FOV-RAD aspect NEAR-PLANE FAR-PLANE)

                                            ;; stream the freshly-computed matrices (orphan + upload)
                                            (GL15/glBindBuffer GL15/GL_ARRAY_BUFFER vbo-model)
                                            (GL15/glBufferData GL15/GL_ARRAY_BUFFER model GL15/GL_STREAM_DRAW)

                                            (GL20/glUseProgram shader)
                                            (GL20/glUniformMatrix4fv u-view false ^FloatBuffer view-buf)
                                            (GL20/glUniformMatrix4fv u-proj false ^FloatBuffer proj-buf)
                                            (GL20/glUniform3f u-light (float 0.5) (float 0.8) (float 0.3))
                                            (GL20/glUniform3f u-viewpos (float cx) (float cy) (float cz))

                                            (GL30/glBindVertexArray vao)
                                            (GL31/glDrawArraysInstanced GL11/GL_TRIANGLES 0 36 (int NUM-CUBES))
                                            (GL30/glBindVertexArray 0))

                                       (let [refresh? (>= title-t' 0.5)
                                             _ (when refresh?
                                                     (update-title window nt (/ cpu-accum' cpu-frames') fps frame-ms))
                                             title-t'' (if refresh? 0.0 title-t')
                                             cpu-accum'' (if refresh? 0.0 cpu-accum')
                                             cpu-frames'' (if refresh? 0 cpu-frames')]
                                            (GLFW/glfwSwapBuffers window)
                                            (GLFW/glfwPollEvents)
                                            (recur now idx' cnt' elapsed' title-t'' cpu-accum'' cpu-frames''))))))))))

C code:

#define GL_SILENCE_DEPRECATION 1
#define GLFW_INCLUDE_NONE 1
#include <OpenGL/gl3.h>
#include <GLFW/glfw3.h>

#include <math.h>
#include <omp.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#include <mach/mach.h>
#include <sys/sysctl.h>

// ── configuration ───────────────────────────────────────────────────────────

#define NUM_CUBES 100000
#define CUBE_SPREAD 200.0f
#define ROTATION_SPEED 0.4f
#define MOVEMENT_SPEED 0.5f
#define OSC_AMPLITUDE 3.0f
#define WINDOW_W 1920
#define WINDOW_H 1080
#define NEAR_PLANE 0.1f
#define FAR_PLANE 500.0f
#define FPS_SAMPLE_COUNT 120

// per-cube static CPU inputs: bpx,bpy,bpz, rsx,rsy,rsz, mfx,mfy,mfz, scale
#define STATIC_FLOATS 10
// per-cube streamed output: a column-major mat4
#define MODEL_FLOATS 16

#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif
#define TAU 6.28318530717958647692f

// ── shaders ──────────────────────────────────────────────────────────────────
// Vertex shader is now trivial: the model matrix arrives per-instance (built on
// the CPU) and we just transform by it. Fragment shader is unchanged.

static const char *vertex_shader_src =
    "#version 330 core\n"
    "layout (location = 0) in vec3 aPos;\n"
    "layout (location = 1) in vec3 aNormal;\n"
    "layout (location = 3) in vec4 aModel0;\n"
    "layout (location = 4) in vec4 aModel1;\n"
    "layout (location = 5) in vec4 aModel2;\n"
    "layout (location = 6) in vec4 aModel3;\n"
    "layout (location = 7) in vec3 aColor;\n"
    "uniform mat4 uView;\n"
    "uniform mat4 uProjection;\n"
    "out vec3 vColor;\n"
    "out vec3 vNormal;\n"
    "out vec3 vFragPos;\n"
    "void main() {\n"
    "    mat4 model = mat4(aModel0, aModel1, aModel2, aModel3);\n"
    "    vec4 worldPos = model * vec4(aPos, 1.0);\n"
    "    gl_Position = uProjection * uView * worldPos;\n"
    "    vColor = aColor;\n"
    "    vNormal = mat3(model) * aNormal;\n"
    "    vFragPos = worldPos.xyz;\n"
    "}\n";

static const char *fragment_shader_src =
    "#version 330 core\n"
    "in vec3 vColor;\n"
    "in vec3 vNormal;\n"
    "in vec3 vFragPos;\n"
    "out vec4 FragColor;\n"
    "uniform vec3 uLightDir;\n"
    "uniform vec3 uViewPos;\n"
    "void main() {\n"
    "    vec3 norm = normalize(vNormal);\n"
    "    vec3 lightDir = normalize(uLightDir);\n"
    "    vec3 ambient = 0.15 * vColor;\n"
    "    float diff = max(dot(norm, lightDir), 0.0);\n"
    "    vec3 diffuse = diff * vColor;\n"
    "    vec3 viewDir = normalize(uViewPos - vFragPos);\n"
    "    vec3 halfDir = normalize(lightDir + viewDir);\n"
    "    float spec = pow(max(dot(norm, halfDir), 0.0), 32.0);\n"
    "    vec3 specular = 0.3 * spec * vec3(1.0);\n"
    "    float dist = length(vFragPos - uViewPos);\n"
    "    float fog = exp(-dist * 0.008);\n"
    "    fog = clamp(fog, 0.0, 1.0);\n"
    "    vec3 result = ambient + diffuse + specular;\n"
    "    result = mix(vec3(0.02, 0.02, 0.05), result, fog);\n"
    "    FragColor = vec4(result, 1.0);\n"
    "}\n";

// ── cube geometry (pos.xyz, normal.xyz) ─────────────────────────────────────

static const float cube_vertices[] = {
    -0.5f, -0.5f,  0.5f,  0,  0,  1,   0.5f, -0.5f,  0.5f,  0,  0,  1,   0.5f,  0.5f,  0.5f,  0,  0,  1,
     0.5f,  0.5f,  0.5f,  0,  0,  1,  -0.5f,  0.5f,  0.5f,  0,  0,  1,  -0.5f, -0.5f,  0.5f,  0,  0,  1,
    -0.5f, -0.5f, -0.5f,  0,  0, -1,  -0.5f,  0.5f, -0.5f,  0,  0, -1,   0.5f,  0.5f, -0.5f,  0,  0, -1,
     0.5f,  0.5f, -0.5f,  0,  0, -1,   0.5f, -0.5f, -0.5f,  0,  0, -1,  -0.5f, -0.5f, -0.5f,  0,  0, -1,
    -0.5f,  0.5f, -0.5f,  0,  1,  0,  -0.5f,  0.5f,  0.5f,  0,  1,  0,   0.5f,  0.5f,  0.5f,  0,  1,  0,
     0.5f,  0.5f,  0.5f,  0,  1,  0,   0.5f,  0.5f, -0.5f,  0,  1,  0,  -0.5f,  0.5f, -0.5f,  0,  1,  0,
    -0.5f, -0.5f, -0.5f,  0, -1,  0,   0.5f, -0.5f, -0.5f,  0, -1,  0,   0.5f, -0.5f,  0.5f,  0, -1,  0,
     0.5f, -0.5f,  0.5f,  0, -1,  0,  -0.5f, -0.5f,  0.5f,  0, -1,  0,  -0.5f, -0.5f, -0.5f,  0, -1,  0,
     0.5f, -0.5f, -0.5f,  1,  0,  0,   0.5f,  0.5f, -0.5f,  1,  0,  0,   0.5f,  0.5f,  0.5f,  1,  0,  0,
     0.5f,  0.5f,  0.5f,  1,  0,  0,   0.5f, -0.5f,  0.5f,  1,  0,  0,   0.5f, -0.5f, -0.5f,  1,  0,  0,
    -0.5f, -0.5f, -0.5f, -1,  0,  0,  -0.5f, -0.5f,  0.5f, -1,  0,  0,  -0.5f,  0.5f,  0.5f, -1,  0,  0,
    -0.5f,  0.5f,  0.5f, -1,  0,  0,  -0.5f,  0.5f, -0.5f, -1,  0,  0,  -0.5f, -0.5f, -0.5f, -1,  0,  0,
};

// ── minimal column-major math ────────────────────────────────────────────────

typedef float vec3[3];
typedef float mat4[16];

static void vec3_normalize(const vec3 v, vec3 out) {
    float len = sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
    if (len == 0.0f) { out[0] = v[0]; out[1] = v[1]; out[2] = v[2]; return; }
    float inv = 1.0f / len;
    out[0] = v[0] * inv; out[1] = v[1] * inv; out[2] = v[2] * inv;
}
static float vec3_dot(const vec3 a, const vec3 b) { return a[0]*b[0] + a[1]*b[1] + a[2]*b[2]; }
static void vec3_cross(const vec3 a, const vec3 b, vec3 out) {
    out[0] = a[1]*b[2] - a[2]*b[1];
    out[1] = a[2]*b[0] - a[0]*b[2];
    out[2] = a[0]*b[1] - a[1]*b[0];
}
static void mat4_perspective(float fovy_rad, float aspect, float near, float far, mat4 m) {
    float t = tanf(fovy_rad / 2.0f);
    for (int i = 0; i < 16; i++) m[i] = 0.0f;
    m[0] = 1.0f / (aspect * t);
    m[5] = 1.0f / t;
    m[10] = -(far + near) / (far - near);
    m[11] = -1.0f;
    m[14] = -(2.0f * far * near) / (far - near);
}
static void mat4_look_at(const vec3 eye, const vec3 center, const vec3 up, mat4 m) {
    vec3 f, s, u, dir;
    dir[0] = center[0]-eye[0]; dir[1] = center[1]-eye[1]; dir[2] = center[2]-eye[2];
    vec3_normalize(dir, f);
    vec3 fxup; vec3_cross(f, up, fxup); vec3_normalize(fxup, s);
    vec3_cross(s, f, u);
    m[0]=s[0]; m[1]=u[0]; m[2]=-f[0]; m[3]=0;
    m[4]=s[1]; m[5]=u[1]; m[6]=-f[1]; m[7]=0;
    m[8]=s[2]; m[9]=u[2]; m[10]=-f[2]; m[11]=0;
    m[12]=-vec3_dot(s, eye); m[13]=-vec3_dot(u, eye); m[14]=vec3_dot(f, eye); m[15]=1;
}

// ── deterministic PRNG (SplitMix64, seed 42) ─────────────────────────────────

static uint64_t rng_state = 42;
static float rng_float(void) {
    rng_state += 0x9E3779B97F4A7C15ULL;
    uint64_t z = rng_state;
    z = (z ^ (z >> 30)) * 0xBF58476D1CE4E5B9ULL;
    z = (z ^ (z >> 27)) * 0x94D049BB133111EBULL;
    z = z ^ (z >> 31);
    return (float)(z >> 40) * (1.0f / 16777216.0f);
}

// ── fly camera ───────────────────────────────────────────────────────────────

typedef struct { vec3 pos; float yaw, pitch, move_speed, sensitivity, fov_rad; } FlyCam;

static void flycam_forward(const FlyCam *self, vec3 out) {
    float cp = cosf(self->pitch);
    out[0] = sinf(self->yaw) * cp;
    out[1] = sinf(self->pitch);
    out[2] = -cosf(self->yaw) * cp;
}
static void flycam_right(const FlyCam *self, vec3 out) {
    out[0] = cosf(self->yaw); out[1] = 0.0f; out[2] = sinf(self->yaw);
}
static void flycam_view_matrix(const FlyCam *self, mat4 m) {
    vec3 fwd, target;
    flycam_forward(self, fwd);
    target[0] = self->pos[0] + fwd[0]*20.0f;
    target[1] = self->pos[1] + fwd[1]*20.0f;
    target[2] = self->pos[2] + fwd[2]*20.0f;
    vec3 world_up = {0.0f, 1.0f, 0.0f};
    mat4_look_at(self->pos, target, world_up, m);
}
static void flycam_process_keyboard(FlyCam *self, GLFWwindow *window, float dt) {
    float speed = self->move_speed * dt;
    vec3 fwd, r;
    flycam_forward(self, fwd);
    vec3 fwd_xz_raw = {fwd[0], 0.0f, fwd[2]}, fwd_xz;
    vec3_normalize(fwd_xz_raw, fwd_xz);
    flycam_right(self, r);
    if (glfwGetKey(window, GLFW_KEY_W) == GLFW_PRESS) { self->pos[0]+=fwd_xz[0]*speed; self->pos[2]+=fwd_xz[2]*speed; }
    if (glfwGetKey(window, GLFW_KEY_S) == GLFW_PRESS) { self->pos[0]-=fwd_xz[0]*speed; self->pos[2]-=fwd_xz[2]*speed; }
    if (glfwGetKey(window, GLFW_KEY_A) == GLFW_PRESS) { self->pos[0]-=r[0]*speed; self->pos[2]-=r[2]*speed; }
    if (glfwGetKey(window, GLFW_KEY_D) == GLFW_PRESS) { self->pos[0]+=r[0]*speed; self->pos[2]+=r[2]*speed; }
    if (glfwGetKey(window, GLFW_KEY_SPACE) == GLFW_PRESS) self->pos[1]+=speed;
    if (glfwGetKey(window, GLFW_KEY_LEFT_SHIFT) == GLFW_PRESS) self->pos[1]-=speed;
}

// ── memory usage (for the title bar): resident set size / physical RAM ───────

static size_t process_rss_mb(void) {
    struct mach_task_basic_info info;
    mach_msg_type_number_t count = MACH_TASK_BASIC_INFO_COUNT;
    if (task_info(mach_task_self(), MACH_TASK_BASIC_INFO, (task_info_t)&info, &count) != KERN_SUCCESS)
        return 0;
    return (size_t)(info.resident_size / (1024 * 1024));
}
static size_t physical_ram_mb(void) {
    uint64_t mem = 0;
    size_t len = sizeof(mem);
    if (sysctlbyname("hw.memsize", &mem, &len, NULL, 0) != 0) return 0;
    return (size_t)(mem / (1024 * 1024));
}

// ── FPS counter ───────────────────────────────────────────────────────────────

typedef struct {
    double last_time, frame_times[FPS_SAMPLE_COUNT], fps, frame_ms;
    size_t frame_index, frame_count;
} FpsCounter;

static void fps_init(FpsCounter *self) {
    self->last_time = glfwGetTime();
    for (int i = 0; i < FPS_SAMPLE_COUNT; i++) self->frame_times[i] = 0.0;
    self->frame_index = 0; self->frame_count = 0; self->fps = 0.0; self->frame_ms = 0.0;
}
static float fps_update(FpsCounter *self) {
    double now = glfwGetTime();
    double dt = now - self->last_time;
    self->frame_times[self->frame_index] = dt;
    self->frame_index = (self->frame_index + 1) % FPS_SAMPLE_COUNT;
    if (self->frame_count < FPS_SAMPLE_COUNT) self->frame_count++;
    double total = 0.0;
    for (size_t i = 0; i < self->frame_count; i++) total += self->frame_times[i];
    self->fps = (total > 0.0) ? (double)self->frame_count / total : 0.0;
    self->frame_ms = dt * 1000.0;
    self->last_time = now;
    return (float)dt;
}

// ── threading (OpenMP) ────────────────────────────────────────────────────────
// Cycled live with T. g_num_threads feeds the `num_threads(...)` clause below.

static int g_thread_counts[16];
static int g_thread_count_n = 0;
static int g_thread_idx = 0;
static int g_num_threads = 1;

// ── shared polynomial sin/cos (identical algorithm in C and Clojure) ─────────
// Replaces libm sinf/cosf so the C-vs-Clojure comparison isolates language/JIT
// instead of measuring two different transcendental implementations. Range-reduce
// to [-pi, pi] (in double, robust as the animation time grows unbounded), then a
// parabola + one Newton-ish refinement. Max abs error ~9e-4 — invisible here.

#define INV_TWO_PI 0.15915494309189535
#define TWO_PI_D   6.283185307179586
#define HALF_PI_F  1.5707963267948966f
#define B_SIN      1.2732395447351628f   //  4/pi
#define C_SIN      (-0.40528473456935109f) // -4/pi^2
#define P_SIN      0.225f

static inline float fast_sinf(float x) {
    double xd = (double)x;
    double k = (double)(long long)(xd * INV_TWO_PI + (xd >= 0.0 ? 0.5 : -0.5)); // round to nearest
    float r = (float)(xd - k * TWO_PI_D);                                       // r in [-pi, pi]
    float y = B_SIN * r + C_SIN * r * fabsf(r);
    return P_SIN * (y * fabsf(y) - y) + y;
}
static inline float fast_cosf(float x) { return fast_sinf(x + HALF_PI_F); }

// ── static instance inputs + streamed model output ───────────────────────────

static float *cpu_static = NULL;  // [NUM_CUBES][STATIC_FLOATS]
static float *model_data = NULL;  // [NUM_CUBES][MODEL_FLOATS]  recomputed each frame
static float *color_data = NULL;  // [NUM_CUBES][3]             static

static bool generate_instance_data(void) {
    cpu_static = malloc((size_t)NUM_CUBES * STATIC_FLOATS * sizeof(float));
    model_data = malloc((size_t)NUM_CUBES * MODEL_FLOATS * sizeof(float));
    color_data = malloc((size_t)NUM_CUBES * 3 * sizeof(float));
    if (!cpu_static || !model_data || !color_data) return false;
    const float half = CUBE_SPREAD / 2.0f;
    for (size_t i = 0; i < NUM_CUBES; i++) {
        float px = rng_float() * CUBE_SPREAD - half;
        float py = rng_float() * CUBE_SPREAD - half;
        float pz = rng_float() * CUBE_SPREAD - half;
        float hue = rng_float() * TAU;
        float r = 0.5f + 0.5f * sinf(hue);
        float g = 0.5f + 0.5f * sinf(hue + 2.094f);
        float b = 0.5f + 0.5f * sinf(hue + 4.188f);
        float sc = 0.3f + rng_float() * 0.7f;
        float rsx = (rng_float() - 0.5f) * ROTATION_SPEED * 2.0f;
        float rsy = (rng_float() - 0.5f) * ROTATION_SPEED * 2.0f;
        float rsz = (rng_float() - 0.5f) * ROTATION_SPEED * 2.0f;
        float mfx = (0.5f + rng_float()) * MOVEMENT_SPEED;
        float mfy = (0.5f + rng_float()) * MOVEMENT_SPEED;
        float mfz = (0.5f + rng_float()) * MOVEMENT_SPEED;
        float *s = &cpu_static[i * STATIC_FLOATS];
        s[0]=px; s[1]=py; s[2]=pz; s[3]=rsx; s[4]=rsy; s[5]=rsz;
        s[6]=mfx; s[7]=mfy; s[8]=mfz; s[9]=sc;
        color_data[i*3+0]=r; color_data[i*3+1]=g; color_data[i*3+2]=b;
    }
    return true;
}

// THE CPU WORKLOAD: build every cube's column-major model matrix for time `t`.
// This is exactly what the GPU vertex shader used to do per vertex; here it runs
// once per cube on the CPU each frame. 9 transcendentals + ~30 flops per cube.
static void compute_models(float elapsed) {
    const float t = elapsed * 5.0f;
    #pragma omp parallel for schedule(static) num_threads(g_num_threads)
    for (int i = 0; i < NUM_CUBES; i++) {
        const float *s = &cpu_static[i * STATIC_FLOATS];
        float apx = s[0] + OSC_AMPLITUDE * fast_sinf(s[6] * t);
        float apy = s[1] + OSC_AMPLITUDE * fast_sinf(s[7] * t);
        float apz = s[2] + OSC_AMPLITUDE * fast_sinf(s[8] * t);
        float ax = s[3] * t, ay = s[4] * t, az = s[5] * t;
        float scale = s[9];
        float cx = fast_cosf(ax), sx = fast_sinf(ax);
        float cy = fast_cosf(ay), sy = fast_sinf(ay);
        float cz = fast_cosf(az), sz = fast_sinf(az);
        float *m = &model_data[i * MODEL_FLOATS];
        m[0]  = scale * (cy * cz);
        m[1]  = scale * (sx*sy*cz + cx*sz);
        m[2]  = scale * (-cx*sy*cz + sx*sz);
        m[3]  = 0.0f;
        m[4]  = scale * (-cy * sz);
        m[5]  = scale * (-sx*sy*sz + cx*cz);
        m[6]  = scale * (cx*sy*sz + sx*cz);
        m[7]  = 0.0f;
        m[8]  = scale * (sy);
        m[9]  = scale * (-sx * cy);
        m[10] = scale * (cx * cy);
        m[11] = 0.0f;
        m[12] = apx; m[13] = apy; m[14] = apz; m[15] = 1.0f;
    }
}

// ── shader compilation ────────────────────────────────────────────────────────

static GLuint compile_shader(const char *src, GLenum kind, const char *label) {
    GLuint id = glCreateShader(kind);
    glShaderSource(id, 1, &src, NULL);
    glCompileShader(id);
    GLint ok = 0; glGetShaderiv(id, GL_COMPILE_STATUS, &ok);
    if (!ok) {
        char log[1024]; GLsizei len = 0;
        glGetShaderInfoLog(id, sizeof(log), &len, log);
        fprintf(stderr, "ERROR::SHADER::%s::COMPILATION_FAILED\n%.*s\n", label, (int)len, log);
        glDeleteShader(id); return 0;
    }
    return id;
}
static GLuint build_program(const char *vs_src, const char *fs_src) {
    GLuint vs = compile_shader(vs_src, GL_VERTEX_SHADER, "VERTEX");
    if (!vs) return 0;
    GLuint fs = compile_shader(fs_src, GL_FRAGMENT_SHADER, "FRAGMENT");
    if (!fs) { glDeleteShader(vs); return 0; }
    GLuint program = glCreateProgram();
    glAttachShader(program, vs); glAttachShader(program, fs);
    glLinkProgram(program);
    glDeleteShader(vs); glDeleteShader(fs);
    GLint ok = 0; glGetProgramiv(program, GL_LINK_STATUS, &ok);
    if (!ok) {
        char log[1024]; GLsizei len = 0;
        glGetProgramInfoLog(program, sizeof(log), &len, log);
        fprintf(stderr, "ERROR::SHADER::PROGRAM::LINKING_FAILED\n%.*s\n", (int)len, log);
        glDeleteProgram(program); return 0;
    }
    return program;
}

// ── shared state for GLFW callbacks ──────────────────────────────────────────

static FlyCam cam = {
    .pos = {0.0f, 50.0f, 150.0f}, .yaw = 0.0f, .pitch = 0.0f,
    .move_speed = 50.0f, .sensitivity = 0.002f, .fov_rad = (float)(M_PI * 70.0 / 180.0),
};
static bool first_mouse = true;
static double last_mouse_x = 0.0, last_mouse_y = 0.0;
static bool mouse_captured = true, vsync_on = false, show_stats = true;
static int fb_w = WINDOW_W, fb_h = WINDOW_H;

static void framebuffer_size_callback(GLFWwindow *window, int w, int h) {
    (void)window; fb_w = w; fb_h = h; glViewport(0, 0, w, h);
}
static void cursor_pos_callback(GLFWwindow *window, double xpos, double ypos) {
    (void)window;
    if (first_mouse) { last_mouse_x = xpos; last_mouse_y = ypos; first_mouse = false; return; }
    if (!mouse_captured) return;
    float xoff = (float)(xpos - last_mouse_x), yoff = (float)(ypos - last_mouse_y);
    last_mouse_x = xpos; last_mouse_y = ypos;
    cam.yaw += xoff * cam.sensitivity;
    cam.pitch -= yoff * cam.sensitivity;
    float max_pitch = (float)(M_PI / 2.0) - 0.1f;
    if (cam.pitch > max_pitch) cam.pitch = max_pitch;
    if (cam.pitch < -max_pitch) cam.pitch = -max_pitch;
}
static void key_callback(GLFWwindow *window, int key, int scancode, int action, int mods) {
    (void)scancode; (void)mods;
    if (action != GLFW_PRESS) return;
    switch (key) {
        case GLFW_KEY_ESCAPE: glfwSetWindowShouldClose(window, GLFW_TRUE); break;
        case GLFW_KEY_TAB:
            mouse_captured = !mouse_captured;
            if (mouse_captured) { glfwSetInputMode(window, GLFW_CURSOR, GLFW_CURSOR_DISABLED); first_mouse = true; }
            else glfwSetInputMode(window, GLFW_CURSOR, GLFW_CURSOR_NORMAL);
            break;
        case GLFW_KEY_T:
            g_thread_idx = (g_thread_idx + 1) % g_thread_count_n;
            g_num_threads = g_thread_counts[g_thread_idx];
            printf("Worker threads: %d\n", g_num_threads);
            break;
        case GLFW_KEY_V:
            vsync_on = !vsync_on; glfwSwapInterval(vsync_on ? 1 : 0);
            printf("VSync: %s\n", vsync_on ? "ON" : "OFF");
            break;
        case GLFW_KEY_F1: show_stats = !show_stats; break;
        default: break;
    }
}

// ── main ─────────────────────────────────────────────────────────────────────

int main(void) {
    if (glfwInit() == GLFW_FALSE) { fprintf(stderr, "GLFW init failed\n"); return 1; }

    glfwDefaultWindowHints();
    glfwWindowHint(GLFW_CONTEXT_VERSION_MAJOR, 3);
    glfwWindowHint(GLFW_CONTEXT_VERSION_MINOR, 3);
    glfwWindowHint(GLFW_OPENGL_PROFILE, GLFW_OPENGL_CORE_PROFILE);
    glfwWindowHint(GLFW_OPENGL_FORWARD_COMPAT, GLFW_TRUE);

    GLFWwindow *window = glfwCreateWindow(WINDOW_W, WINDOW_H, "CPU Stress Test - 100K Cubes (C)", NULL, NULL);
    if (!window) { fprintf(stderr, "Window creation failed\n"); glfwTerminate(); return 1; }

    glfwMakeContextCurrent(window);
    glfwSwapInterval(0);
    glfwGetFramebufferSize(window, &fb_w, &fb_h);
    glfwSetFramebufferSizeCallback(window, framebuffer_size_callback);
    glfwSetCursorPosCallback(window, cursor_pos_callback);
    glfwSetKeyCallback(window, key_callback);
    glfwSetInputMode(window, GLFW_CURSOR, GLFW_CURSOR_DISABLED);

    // build the worker-thread cycle: 1,2,4,8,... clamped to the core count
    int max_threads = omp_get_max_threads();
    for (int v = 1; v <= max_threads && g_thread_count_n < 15; v *= 2)
        g_thread_counts[g_thread_count_n++] = v;
    if (g_thread_count_n == 0 || g_thread_counts[g_thread_count_n - 1] != max_threads)
        g_thread_counts[g_thread_count_n++] = max_threads;
    g_thread_idx = 0;
    g_num_threads = g_thread_counts[0];

    printf("\nCPU STRESS TEST - %d CPU-TRANSFORMED CUBES (C + OpenMP, poly sin/cos)\n"
           "OpenGL: %s\nGPU:    %s\n"
           "Cores (omp_get_max_threads): %d | press T to cycle worker threads\n"
           "WASD/Space/Shift fly | Mouse look | Tab cursor | T threads | V vsync | F1 stats | Esc quit\n\n",
           NUM_CUBES, glGetString(GL_VERSION), glGetString(GL_RENDERER), max_threads);

    glEnable(GL_DEPTH_TEST); glDepthFunc(GL_LESS);
    glEnable(GL_CULL_FACE); glCullFace(GL_BACK);

    GLuint shader = build_program(vertex_shader_src, fragment_shader_src);
    if (!shader) { glfwTerminate(); return 1; }

    if (!generate_instance_data()) {
        fprintf(stderr, "Out of memory generating instance data\n"); glfwTerminate(); return 1;
    }
    printf("Generated %d cubes (streaming %.2f MB/frame of model matrices)\n",
           NUM_CUBES, (double)((size_t)NUM_CUBES * MODEL_FLOATS * sizeof(float)) / (1024.0 * 1024.0));
    printf("Framebuffer: %dx%d\n", fb_w, fb_h);

    GLuint vao = 0, vbo_cube = 0, vbo_model = 0, vbo_color = 0;
    glGenVertexArrays(1, &vao);
    glGenBuffers(1, &vbo_cube);
    glGenBuffers(1, &vbo_model);
    glGenBuffers(1, &vbo_color);

    const GLsizei stride_cube = 6 * sizeof(float);
    const GLsizei stride_model = MODEL_FLOATS * sizeof(float);

    glBindVertexArray(vao);

    // per-vertex cube geometry
    glBindBuffer(GL_ARRAY_BUFFER, vbo_cube);
    glBufferData(GL_ARRAY_BUFFER, sizeof(cube_vertices), cube_vertices, GL_STATIC_DRAW);
    glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, stride_cube, (void *)0);
    glEnableVertexAttribArray(0);
    glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, stride_cube, (void *)(3 * sizeof(float)));
    glEnableVertexAttribArray(1);

    // per-instance model matrix (mat4 = 4 vec4 attributes, locations 3..6), streamed
    glBindBuffer(GL_ARRAY_BUFFER, vbo_model);
    glBufferData(GL_ARRAY_BUFFER, (GLsizeiptr)((size_t)NUM_CUBES * MODEL_FLOATS * sizeof(float)),
                 NULL, GL_STREAM_DRAW);
    for (int col = 0; col < 4; col++) {
        GLuint loc = 3 + col;
        glVertexAttribPointer(loc, 4, GL_FLOAT, GL_FALSE, stride_model, (void *)(size_t)(col * 4 * sizeof(float)));
        glEnableVertexAttribArray(loc);
        glVertexAttribDivisor(loc, 1);
    }

    // per-instance color (location 7), static
    glBindBuffer(GL_ARRAY_BUFFER, vbo_color);
    glBufferData(GL_ARRAY_BUFFER, (GLsizeiptr)((size_t)NUM_CUBES * 3 * sizeof(float)),
                 color_data, GL_STATIC_DRAW);
    glVertexAttribPointer(7, 3, GL_FLOAT, GL_FALSE, 3 * sizeof(float), (void *)0);
    glEnableVertexAttribArray(7);
    glVertexAttribDivisor(7, 1);

    glBindVertexArray(0);
    glBindBuffer(GL_ARRAY_BUFFER, 0);

    GLint u_view = glGetUniformLocation(shader, "uView");
    GLint u_proj = glGetUniformLocation(shader, "uProjection");
    GLint u_light = glGetUniformLocation(shader, "uLightDir");
    GLint u_viewpos = glGetUniformLocation(shader, "uViewPos");

    FpsCounter fps; fps_init(&fps);
    double elapsed = 0.0, title_timer = 0.0;
    double cpu_ms_accum = 0.0; int cpu_ms_frames = 0; double cpu_ms_avg = 0.0;
    char title_buf[256];

    while (glfwWindowShouldClose(window) == GLFW_FALSE) {
        float dt = fps_update(&fps);
        elapsed += dt; title_timer += dt;

        flycam_process_keyboard(&cam, window, dt);

        // ── the CPU pass (timed) ──────────────────────────────────────────────
        double c0 = glfwGetTime();
        compute_models((float)elapsed);
        double c1 = glfwGetTime();
        cpu_ms_accum += (c1 - c0) * 1000.0; cpu_ms_frames++;

        glClearColor(0.02f, 0.02f, 0.05f, 1.0f);
        glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

        float aspect = (float)fb_w / (float)(fb_h > 1 ? fb_h : 1);
        mat4 view, proj;
        flycam_view_matrix(&cam, view);
        mat4_perspective(cam.fov_rad, aspect, NEAR_PLANE, FAR_PLANE, proj);

        // stream the freshly-computed matrices (orphan + upload)
        glBindBuffer(GL_ARRAY_BUFFER, vbo_model);
        glBufferData(GL_ARRAY_BUFFER, (GLsizeiptr)((size_t)NUM_CUBES * MODEL_FLOATS * sizeof(float)),
                     model_data, GL_STREAM_DRAW);

        glUseProgram(shader);
        glUniformMatrix4fv(u_view, 1, GL_FALSE, view);
        glUniformMatrix4fv(u_proj, 1, GL_FALSE, proj);
        glUniform3f(u_light, 0.5f, 0.8f, 0.3f);
        glUniform3f(u_viewpos, cam.pos[0], cam.pos[1], cam.pos[2]);

        glBindVertexArray(vao);
        glDrawArraysInstanced(GL_TRIANGLES, 0, 36, NUM_CUBES);
        glBindVertexArray(0);

        if (title_timer >= 0.5) {
            title_timer = 0.0;
            cpu_ms_avg = cpu_ms_frames > 0 ? cpu_ms_accum / cpu_ms_frames : 0.0;
            cpu_ms_accum = 0.0; cpu_ms_frames = 0;
            if (show_stats) {
                snprintf(title_buf, sizeof(title_buf),
                         "CPU Stress (C+OpenMP, %d thread%s) | %d cubes | CPU: %.2f ms | FPS: %.1f | %.2f ms | Mem: %zu/%zu MB | VSync: %s",
                         g_num_threads, g_num_threads == 1 ? "" : "s",
                         NUM_CUBES, cpu_ms_avg, fps.fps, fps.frame_ms,
                         process_rss_mb(), physical_ram_mb(), vsync_on ? "ON" : "OFF");
                if (getenv("STRESS_STATS")) { puts(title_buf); fflush(stdout); } // STRESS_STATS=1 -> stats on stdout too
                glfwSetWindowTitle(window, title_buf);
            } else {
                glfwSetWindowTitle(window, "CPU Stress Test - 100K Cubes");
            }
        }

        glfwSwapBuffers(window);
        glfwPollEvents();
    }

    glDeleteVertexArrays(1, &vao);
    glDeleteBuffers(1, &vbo_cube);
    glDeleteBuffers(1, &vbo_model);
    glDeleteBuffers(1, &vbo_color);
    glDeleteProgram(shader);
    free(cpu_static); free(model_data); free(color_data);
    glfwTerminate();
    return 0;
}