Running a local 0.5B LLM inside a code scanner

Two of ik's checks — semantic duplication and magic-number labelling — run a 0.5B local LLM that ships with the binary. No API key, no network call, your code never leaves the machine. Here's why we embedded a model instead of calling one, what it does well, and the Metal crash that ate a week.

Most “AI-powered” developer tools are a thin client around someone else’s API. You paste code in, it goes to a datacenter, an answer comes back. That’s a fine shape for a chat box. It’s the wrong shape for a code scanner that people run on proprietary repositories, in CI, on machines we don’t control.

So two of ik’s checks don’t call a model. They are a model — a Qwen2.5-Coder-0.5B GGUF that ships inside the binary and runs on the CPU sitting in front of you. No API key. No network request. Your code never leaves the machine.

This post is about why we did that, what a model that small can actually do, and the week we lost to a GPU we couldn’t see.

Two jobs for one small model

The model does two unrelated things, using two different capabilities.

Semantic duplication (semdup) — embeddings. Two functions can be copy-paste-and-rename clones without sharing a single token a regex would catch. So semdup embeds every function in the repo into a vector, buckets them by token count, and runs a pairwise cosine sweep inside each bucket. Functions that mean the same thing land close together regardless of how they’re spelled.

We calibrated the threshold on a corpus of hand-labelled Go function pairs. The separation was cleaner than we expected: the lowest-scoring true duplicate sat at cosine 0.944, the highest-scoring non-duplicate at 0.725 — a 0.22 gap with no overlap at all. We set the default cut at 0.85, biased slightly toward fewer false positives. Cost on an M1 Pro: ~92 ms per function.

Magic-number context (magic-numbers) — generation. A magic number flagged as “literal 86400 on line 42” is technically correct and practically useless. So we hand the model a window around the finding and ask it for a short label. It comes back with seconds in a day. On a calibration set of 13 fixtures it got 77% “good,” and the misses weren’t wrong so much as under-specified — device width for 768 instead of mobile breakpoint. Review-aiding, not misleading. p95 latency: 175 ms.

Same model file, two llama.cpp contexts: one with embeddings on and pooling off, one with a greedy sampler for text. That symmetry is the whole pitch — one ~469 MB download earns its keep twice.

Why embed it instead of calling an API

Three reasons, in priority order.

Privacy is the product. People scan code they can’t email to a third party. “We never send your source anywhere” is only true if it’s structurally true — if there’s no code path that could. An in-process model makes the promise unbreakable instead of a policy.
It never skips. This is the same principle behind moving our complexity analysis in-process: a check that depends on a reachable API is a check that silently does nothing when the API is down, the key is missing, or the box is offline. A model compiled into the binary is always there.
It’s free to run. No per-token cost means we can label every magic number in a repo, or embed every function, without watching a meter. The only budget is wall-clock time, which we cap per scan.

The trade is real: a 0.5B model is not GPT-4. It will never reason about your architecture. But for “is this function a clone of that one” and “what does this constant mean,” it’s the right tool — small enough to ship, good enough to help.

Running a model on hardware you didn’t pick

Shipping inference to other people’s laptops is where the fun starts. Two findings worth passing on.

macOS scrubs DYLD_*. gollama.cpp only looks for the llama.cpp shared library in a relative libs/<arch>/ path from the working directory, and on macOS you can’t fix that by setting DYLD_LIBRARY_PATH — dyld strips DYLD_* variables on startup for many binaries, signed or not. The fix is grubby but robust: stage a temp directory of symlinks pointing at the real library, chdir into it just for Backend_init, then chdir back.

The GPU we couldn’t turn off. A user on an Apple M5 got a hard SIGABRT mid-scan: computeFunction must not be nil inside ggml_metal_init. The bundled llama.cpp didn’t have Metal kernels for the M5’s GPU family yet. We set NGpuLayers = 0 to force CPU — and it still crashed, because that flag only controls layer offload; the Metal backend gets registered and initialized regardless. We hid the sidecar libggml-metal dylib from the loader — still crashed, because the Metal shaders are baked inside libllama.dylib itself, not in a separate file. Every layer we peeled, the crash was one layer deeper.

The honest fix shipped first: probe machdep.cpu.brand_string, and on a GPU family the bundled kernels don’t support, skip the check with a friendly one-line reason instead of dumping a 500-line goroutine trace. A scanner that says “semdup unsupported on this hardware” is doing its job; one that aborts is not. We later bumped the underlying llama.cpp build to one with M5 kernels and shrank the block list — but the graceful-skip path stays, because the next chip family will do this to us again.

What this buys you

When you run ik on a fresh machine with no network, you still get semantic duplicate detection and labelled magic numbers — because the intelligence is in the binary, not behind a key. The model is small on purpose: small enough that shipping it is reasonable, private by construction, and free to run as often as the scan needs.

It can’t tell you whether your design is good. We’re not pretending it can. But the boring, mechanical judgements — these two functions are the same, this number is a timeout in milliseconds — are exactly the ones a small local model can make, on your hardware, without your code ever leaving the room.

Want to see what it finds in your codebase? Install the CLI — one line, about a minute, no account required.

A half-billion-parameter model lives inside our scanner

Two jobs for one small model

Why embed it instead of calling an API

Running a model on hardware you didn’t pick

What this buys you

Know what you shipped.