OpenAI Codex CLI
openai/codex · Open-source coding agent · Rust + Python + TypeScript
Codex still scores F — 28/100 — but
for different reasons than our first pass found. The generated
layer continues to dominate the structural checks: a single
TypeScript barrel (v2/index.ts) now re-exports 472
schema modules, and 609 files exceed the 500-line threshold.
What changed is the complexity picture. inkode now analyses
cyclomatic complexity, nesting depth, function length and
parameter counts in-process across every language —
including the 2,100+ Rust files that make up 86% of the repo
— and that pushes the complexity category to 0:
481 functions sit over the cyclomatic threshold, 1,639 sites
nest too deeply, and 881 functions run long. The secret scanner
still finds 7 hardcoded credentials, two of them in
cli/src/login.rs. Dependency CVEs are clean for the
ecosystems we could audit. The headline from our last study
— “100% AI co-authored” — no longer holds:
the project has matured to 50 contributors, and only a small
fraction of recent commits carry an AI trailer.
What changed since our last scan
We first analysed Codex in May. Re-running the latest inkode against the current repo shows two kinds of movement: the codebase grew (4,472 → 4,674 files, the god-package barrel from 461 → 473), and inkode itself got sharper. Three engine changes account for most of the difference in what we report:
- Complexity is now first-class. Cyclomatic complexity, nesting depth, function length and parameter count are analysed in-process for Rust, Python, TypeScript and JavaScript — no external tool required. Last time complexity quietly passed (Python-only, via radon); this time the full-language view drops the category to 0. This is the single biggest reason the internal scorecard shifted.
- Magic-number detection got precise. Driven by pylint and eslint instead of a regex heuristic, the magic-number count fell from 122 to 8 — the old number was mostly false positives in Python. Fewer, truer findings.
- Semantic duplication landed. A new embedding-based check finds near-duplicate functions that token-level tools miss — 15 candidate pairs here.
Net effect: the overall grade is unchanged (28→28), but the story underneath moved from “an AI wrote all of this” to “a now-mature project carrying real handwritten-Rust complexity debt.”
Category Scores
AI footprint
The repo still declares its lineage — AGENTS.md
sits at the root and the Codex assistant ships with it — but
the picture has shifted since our first study. Then, every commit
in the sampled window carried a Co-Authored-By: Codex
trailer. Now the project has 50 contributors and only a handful of
the most recent commits (about 3%) carry an AI co-author trailer.
Codex grew up: humans are reviewing and gating the changes that an
AI assistant once shipped end-to-end.
What this changes about the review. The structural debt we flag below — the 473-degree barrel, the 609 oversized files, the 481 over-complex functions — is the class of decision an AI assistant won't push back on unless asked. That debt accreted during the period when the AI was doing most of the writing; the human contributors arriving now inherit it. The good news is that there are humans to inherit it: a second opinion exists where, a year ago, the writer and the reviewer were the same model.
Key Findings
Complexity — the check that flips the grade
Cyclomatic complexity, nesting depth, function length, parameter count — analysed in-process across all languages
481 functions exceed the cyclomatic-complexity threshold of 10
— 441 of them in Rust, 39 in Python, 1 in JavaScript. The
worst single function is handle_event in
codex-rs/tui/src/app/event_dispatch.rs at complexity
122. The supporting checks tell the same story: 1,639 sites nest
too deeply (1,546 in Rust), 881 functions run past the length
threshold (875 Rust), and 23 functions take too many parameters.
This is genuine, handwritten complexity in the core — not a
generated-code artifact.
Why this is new. Our first study scored
complexity as a pass, because at the time inkode only measured
cyclomatic complexity for Python. The engine now does the full
analysis in-process for Rust, TypeScript and JavaScript too. The
Rust core was always this complex — we just couldn't see it
before. The takeaway is that the agent loop
(event_dispatch, parse_command, the
session machinery) carries the kind of branching density that
makes every change a careful one.
handle_event — CC 122Secret Scanning — 7, mostly in test fixtures
Detects hardcoded API keys, tokens, and credentials
Seven hardcoded credentials. Two are in
codex-rs/cli/src/login.rs (lines 462 and 468) —
the login flow itself — flagged as generic API keys. The
rest are spread across the codebase: a generic key in
core-plugins/src/remote/catalog_cache.rs, one in a
WebSocket connection-handling test, one in the generated
Cargo.lock, and two private keys in
agent-identity/src/lib.rs and
login/src/auth/auth_tests.rs.
How to read this set. Test-fixture and
lockfile secrets are normally low-risk — disposable keys
nobody is actually using. The two in cli/src/login.rs
and the one in catalog_cache.rs are the ones worth a
closer look: a generic-API-key pattern in production code is
usually either a sample constant that belongs in a test, or a
hardcoded fallback that needs to go away. Triage by opening the
file; the answer takes 30 seconds. (inkode redacts the matched
value before anything leaves the machine, so the secret itself is
never uploaded.)
login.rs, catalog_cache.rsImport Graph — the 473-degree god package
Analyses package dependencies for coupling, cycles, and god packages
The single biggest structural issue is still a generated
TypeScript barrel. codex-rs/app-server-protocol/schema/typescript/v2/index.ts
now re-exports 472 modules — combined degree 473 against a
default threshold of 20, up from 461 last time. Two sibling union
types (ClientRequest.ts at 82 and
ServerNotification.ts at 67) repeat the pattern at
smaller scale. On the inbound side,
AbsolutePathBuf.ts is imported by 51 other modules.
No circular dependencies across 2,814 packages.
Why this matters. A barrel that re-exports 472 modules pulls every consumer into the entire schema. Editor tooling slows down, tree-shaking stops working, and any change to the schema invalidates the build cache for every dependent. It's also a refactor blocker: nobody can split this file cleanly because every renumbered re-export is a breaking change. The good news is the cause is generation, not handwriting — the generator template can produce domain-scoped sub-barrels in an afternoon.
Test Presence — a strong ratio masking 112 untested dirs
Measures test-to-source file ratio and per-directory test absence
Headline: 1,360 test files for 1,452 source files — a 94%
ratio. That's the headline. The per-directory breakdown is
less flattering: 112 directories have source code but no tests
next to them, including the generated TypeScript schema layer
and most of the Python SDK. The Rust crates lean hard on Rust's
convention of inline #[cfg(test)] modules, which
the check counts at the file level — that's where the 94%
comes from. Strip the inline tests out and the picture changes
considerably.
How to read this. A high test-to-source ratio is a necessary condition for refactor-safety, not a sufficient one. The 112 untested directories include the entire generated-TypeScript surface (lower priority — tests belong on the generator, not the output) and the Python SDK (higher priority — this is the interface third parties consume). The Rust core is genuinely well-tested. The risk to watch is regression in the Python SDK, where there's no safety net.
Line Count — generated dumps and one big handwritten file
Flags files exceeding recommended length thresholds
609 of 3,969 scanned files exceed the 500-line warning threshold;
256 of those exceed 1,000 lines. The worst offenders are
generated — the JSON schema dumps
(codex_app_server_protocol.schemas.json at 19,513
lines), Cargo.lock, the
v2_all.py SDK bundle. But not all of it is machine
emitted: codex-rs/tui/src/bottom_pane/chat_composer.rs
is a handwritten Rust file at 11,189 lines. A
.ik.yaml exclusion on the schema and lock paths
would collapse this finding by an order of magnitude and surface
the handwritten outliers worth splitting.
Why we still flag generated files. Big files are an onboarding tax. Every new contributor has to scroll through them; every diff against them is harder to review. We don't auto-exclude generated code — the existence of a 19,000-line file in a repo is useful signal, even if the explanation is “the generator emits it.” The takeaway is operational: tell the scanner what's generated so the next scan can focus on what humans wrote.
Hotspots — where the change actually lands
Files ranked by git change frequency
Out of 3,867 changed files in the analysed window, a handful
carry most of the churn — and they cluster in two
directories. codex-rs/core/src/session/ (tests.rs,
mod.rs, turn.rs, turn_context.rs) is one;
codex-rs/core/src/config/ (mod.rs, config_tests.rs)
is the other, alongside the
app-server-protocol JSON schema bundle. Together they
account for the bulk of the change risk. High-churn +
high-complexity is the architectural pressure point worth watching
— and the session machinery sits at the top of both lists.
How to use this. Hotspot × complexity is
where bugs land. If you have one person-week to harden this
codebase, spend it on
codex-rs/core/src/session/ and
config/: that's where the change is happening,
that's where the product behaviour lives, and that's where a
regression has the biggest blast radius. The schema churn is a
different story — it's machine-driven and predictable; you
don't review the diffs, you review the generator.
codex-rs/core/src/session/tests.rscodex-rs/core/src/session/mod.rsChange Coupling
File pairs that co-change in git history — hidden dependencies
17 file pairs change together more often than chance. Most pair a Rust source file with its generated TypeScript schema sibling — the schema is regenerated whenever the Rust type changes, which is the design. The pattern to watch is the few cross-cutting pairs that aren't schema-driven: those are real hidden dependencies the codebase doesn't declare.
What's signal vs noise here. Schema ↔ type co-change is expected and tells you nothing new. The interesting pairs are the ones where two files in different domains keep moving in lockstep without an explicit import between them — that's the “temporal contract” that bites later when one of them is touched by someone who doesn't know the other exists. Worth filtering the full coupling list by “neither file imports the other” before triaging.
The smaller-but-real findings
Magic numbers, shell-script issues, Python error handling
8 magic-number literals (7 in JS, 1 in Python) — far fewer
than the 122 we reported last time, because the check now leans on
pylint and eslint rather than a regex heuristic that over-counted.
41 shell-script issues across 17 scripts (ShellCheck-driven:
unrecognised shebangs, brace-group parse errors, unquoted
expansions). And 22 bare except: clauses in the
Python SDK and the bundled skill samples. None of these alone
moves the score; together they're the kind of paper-cut work a
reviewer would flag on a first pass.
Why bare-except is the one to fix first.
Magic numbers are an audit problem. Shell-script warnings are a
CI-reliability problem. But bare except: is an
in-production problem: it swallows KeyboardInterrupt,
SystemExit, and every real bug along with the one
error you meant to catch. In a long-running process, that's how
this-should-have-thrown turns into silent data
corruption. 22 of them in an SDK that other people will
embed in their software is the finding here.
Checks that passed
No known CVEs across the npm and pip ecosystems. (Rust cargo-audit not run in this scan.)
No unused symbols detected by vulture (Python) or knip (JS/TS).
No token-level duplicate blocks across the sampled files; 15 semantic near-duplicate pairs surfaced for review.
Zero circular dependencies across 2,814 packages and 1,293 edges.
210 TODOs across 2,163 files — well below the density threshold.
Map of the top connectors
The seven files below carry most of the structural weight in the import graph. A change here ripples; a deletion is a multi-day project.
Takeaways
- Open
codex-rs/cli/src/login.rsandcore-plugins/src/remote/catalog_cache.rs— if the generic-API-key matches there are production constants, rotate and refactor; if they're samples, move them into test files. - Replace the 22 bare
except:blocks in the Python SDK and skill samples with typed exceptions — bare-except is how production bugs become silent data corruption. - Break down the highest-complexity functions in the core —
handle_event(CC 122),parse_command, the session machinery. These sit at the intersection of high churn and high complexity. - Wire
cargo-auditinto the Rust build so the dependency-audit picture is complete; the current pass excludes the largest dependency ecosystem in the repo.
- Add
codex-rs/app-server-protocol/schema/**and**/*.schemas.jsonto.ik.yaml's line-count and import-graph excludes. That alone collapses most of the structural noise. - Split
v2/index.tsinto domain-scoped re-exports (auth, session, exec, fs). A 472-fan-out barrel is a refactor blocker. - Split the handwritten outliers the exclusions reveal — starting with
chat_composer.rsat 11,189 lines. - Add per-directory tests for the 112 untested directories, prioritising the Python SDK — Rust's inline tests cover most of that side.
See how your codebase scores
Run inkode against your repo in under a minute. No account required.
Scan Your Repo