Most “AI-powered” developer tools are a thin client around someone else’s API. You paste code in, it goes to a datacenter, an answer comes back. That’s a fine shape for a chat box. It’s the wrong shape for a code scanner that people run on proprietary repositories, in CI, on machines we don’t control.
So two of ik’s checks don’t call a model. They are a model — a
Qwen2.5-Coder-0.5B GGUF that
ships inside the binary and runs on the CPU sitting in front of you. No API key.
No network request. Your code never leaves the machine.
This post is about why we did that, what a model that small can actually do, and the week we lost to a GPU we couldn’t see.
Two jobs for one small model
The model does two unrelated things, using two different capabilities.
Semantic duplication (semdup) — embeddings. Two functions can be
copy-paste-and-rename clones without sharing a single token a regex would catch.
So semdup embeds every function in the repo into a vector, buckets them by
token count, and runs a pairwise cosine sweep inside each bucket. Functions that
mean the same thing land close together regardless of how they’re spelled.
We calibrated the threshold on a corpus of hand-labelled Go function pairs. The separation was cleaner than we expected: the lowest-scoring true duplicate sat at cosine 0.944, the highest-scoring non-duplicate at 0.725 — a 0.22 gap with no overlap at all. We set the default cut at 0.85, biased slightly toward fewer false positives. Cost on an M1 Pro: ~92 ms per function.
Magic-number context (magic-numbers) — generation. A magic number flagged
as “literal 86400 on line 42” is technically correct and practically useless.
So we hand the model a window around the finding and ask it for a short label.
It comes back with seconds in a day. On a calibration set of 13 fixtures it
got 77% “good,” and the misses weren’t wrong so much as under-specified —
device width for 768 instead of mobile breakpoint. Review-aiding, not
misleading. p95 latency: 175 ms.
Same model file, two llama.cpp contexts: one with embeddings on and pooling
off, one with a greedy sampler for text. That symmetry is the whole pitch — one
~469 MB download earns its keep twice.
Why embed it instead of calling an API
Three reasons, in priority order.
- Privacy is the product. People scan code they can’t email to a third party. “We never send your source anywhere” is only true if it’s structurally true — if there’s no code path that could. An in-process model makes the promise unbreakable instead of a policy.
- It never skips. This is the same principle behind moving our complexity analysis in-process: a check that depends on a reachable API is a check that silently does nothing when the API is down, the key is missing, or the box is offline. A model compiled into the binary is always there.
- It’s free to run. No per-token cost means we can label every magic number in a repo, or embed every function, without watching a meter. The only budget is wall-clock time, which we cap per scan.
The trade is real: a 0.5B model is not GPT-4. It will never reason about your architecture. But for “is this function a clone of that one” and “what does this constant mean,” it’s the right tool — small enough to ship, good enough to help.
Running a model on hardware you didn’t pick
Shipping inference to other people’s laptops is where the fun starts. Two findings worth passing on.
macOS scrubs DYLD_*. gollama.cpp only looks for the llama.cpp shared
library in a relative libs/<arch>/ path from the working directory, and on
macOS you can’t fix that by setting DYLD_LIBRARY_PATH — dyld strips DYLD_*
variables on startup for many binaries, signed or not. The fix is grubby but
robust: stage a temp directory of symlinks pointing at the real library, chdir
into it just for Backend_init, then chdir back.
The GPU we couldn’t turn off. A user on an Apple M5 got a hard SIGABRT
mid-scan: computeFunction must not be nil inside ggml_metal_init. The
bundled llama.cpp didn’t have Metal kernels for the M5’s GPU family yet. We
set NGpuLayers = 0 to force CPU — and it still crashed, because that flag only
controls layer offload; the Metal backend gets registered and initialized
regardless. We hid the sidecar libggml-metal dylib from the loader — still
crashed, because the Metal shaders are baked inside libllama.dylib itself,
not in a separate file. Every layer we peeled, the crash was one layer deeper.
The honest fix shipped first: probe machdep.cpu.brand_string, and on a GPU
family the bundled kernels don’t support, skip the check with a friendly
one-line reason instead of dumping a 500-line goroutine trace. A scanner that
says “semdup unsupported on this hardware” is doing its job; one that aborts is
not. We later bumped the underlying llama.cpp build to one with M5 kernels and
shrank the block list — but the graceful-skip path stays, because the next chip
family will do this to us again.
What this buys you
When you run ik on a fresh machine with no network, you still get semantic
duplicate detection and labelled magic numbers — because the intelligence is in
the binary, not behind a key. The model is small on purpose: small enough that
shipping it is reasonable, private by construction, and free to run as often as
the scan needs.
It can’t tell you whether your design is good. We’re not pretending it can. But the boring, mechanical judgements — these two functions are the same, this number is a timeout in milliseconds — are exactly the ones a small local model can make, on your hardware, without your code ever leaving the room.
Want to see what it finds in your codebase? Install the CLI — one line, about a minute, no account required.