← All Case Studies

OpenAI Codex CLI

openai/codex · Open-source coding agent · Rust + Python + TypeScript

Rust Python TypeScript GitHub Actions
28
Grade F · 4,674 files
481
Functions over the complexity threshold
473
Highest god-package degree
609
Oversized files
7
Hardcoded secrets (mostly test fixtures)

Codex still scores F — 28/100 — but for different reasons than our first pass found. The generated layer continues to dominate the structural checks: a single TypeScript barrel (v2/index.ts) now re-exports 472 schema modules, and 609 files exceed the 500-line threshold. What changed is the complexity picture. inkode now analyses cyclomatic complexity, nesting depth, function length and parameter counts in-process across every language — including the 2,100+ Rust files that make up 86% of the repo — and that pushes the complexity category to 0: 481 functions sit over the cyclomatic threshold, 1,639 sites nest too deeply, and 881 functions run long. The secret scanner still finds 7 hardcoded credentials, two of them in cli/src/login.rs. Dependency CVEs are clean for the ecosystems we could audit. The headline from our last study — “100% AI co-authored” — no longer holds: the project has matured to 50 contributors, and only a small fraction of recent commits carry an AI trailer.

What changed since our last scan

We first analysed Codex in May. Re-running the latest inkode against the current repo shows two kinds of movement: the codebase grew (4,472 → 4,674 files, the god-package barrel from 461 → 473), and inkode itself got sharper. Three engine changes account for most of the difference in what we report:

  • Complexity is now first-class. Cyclomatic complexity, nesting depth, function length and parameter count are analysed in-process for Rust, Python, TypeScript and JavaScript — no external tool required. Last time complexity quietly passed (Python-only, via radon); this time the full-language view drops the category to 0. This is the single biggest reason the internal scorecard shifted.
  • Magic-number detection got precise. Driven by pylint and eslint instead of a regex heuristic, the magic-number count fell from 122 to 8 — the old number was mostly false positives in Python. Fewer, truer findings.
  • Semantic duplication landed. A new embedding-based check finds near-duplicate functions that token-level tools miss — 15 candidate pairs here.

Net effect: the overall grade is unchanged (28→28), but the story underneath moved from “an AI wrote all of this” to “a now-mature project carrying real handwritten-Rust complexity debt.”

Category Scores

Security 25
30% weight
Testing 0
20% weight
Maintainability 67
20% weight
Complexity 0
15% weight
Change Risk 45
15% weight

AI footprint

The repo still declares its lineage — AGENTS.md sits at the root and the Codex assistant ships with it — but the picture has shifted since our first study. Then, every commit in the sampled window carried a Co-Authored-By: Codex trailer. Now the project has 50 contributors and only a handful of the most recent commits (about 3%) carry an AI co-author trailer. Codex grew up: humans are reviewing and gating the changes that an AI assistant once shipped end-to-end.

What this changes about the review. The structural debt we flag below — the 473-degree barrel, the 609 oversized files, the 481 over-complex functions — is the class of decision an AI assistant won't push back on unless asked. That debt accreted during the period when the AI was doing most of the writing; the human contributors arriving now inherit it. The good news is that there are humans to inherit it: a second opinion exists where, a year ago, the writer and the reviewer were the same model.

AI co-authors detected Codex, OpenAI
AI commit trailers ~3% of recent commits (was 100%)
AI rules file AGENTS.md present
Signal count 4 (3 tools, 1 rules file)

Key Findings

Complexity — the check that flips the grade

Cyclomatic complexity, nesting depth, function length, parameter count — analysed in-process across all languages

481 findings

481 functions exceed the cyclomatic-complexity threshold of 10 — 441 of them in Rust, 39 in Python, 1 in JavaScript. The worst single function is handle_event in codex-rs/tui/src/app/event_dispatch.rs at complexity 122. The supporting checks tell the same story: 1,639 sites nest too deeply (1,546 in Rust), 881 functions run past the length threshold (875 Rust), and 23 functions take too many parameters. This is genuine, handwritten complexity in the core — not a generated-code artifact.

Why this is new. Our first study scored complexity as a pass, because at the time inkode only measured cyclomatic complexity for Python. The engine now does the full analysis in-process for Rust, TypeScript and JavaScript too. The Rust core was always this complex — we just couldn't see it before. The takeaway is that the agent loop (event_dispatch, parse_command, the session machinery) carries the kind of branching density that makes every change a careful one.

Over CC threshold 481 (441 Rust)
Deep nesting 1,639 sites
Long functions 881
Worst function handle_event — CC 122

Secret Scanning — 7, mostly in test fixtures

Detects hardcoded API keys, tokens, and credentials

7 findings

Seven hardcoded credentials. Two are in codex-rs/cli/src/login.rs (lines 462 and 468) — the login flow itself — flagged as generic API keys. The rest are spread across the codebase: a generic key in core-plugins/src/remote/catalog_cache.rs, one in a WebSocket connection-handling test, one in the generated Cargo.lock, and two private keys in agent-identity/src/lib.rs and login/src/auth/auth_tests.rs.

How to read this set. Test-fixture and lockfile secrets are normally low-risk — disposable keys nobody is actually using. The two in cli/src/login.rs and the one in catalog_cache.rs are the ones worth a closer look: a generic-API-key pattern in production code is usually either a sample constant that belongs in a test, or a hardcoded fallback that needs to go away. Triage by opening the file; the answer takes 30 seconds. (inkode redacts the matched value before anything leaves the machine, so the secret itself is never uploaded.)

Production-code findings 3 — login.rs, catalog_cache.rs
Test / generated findings 4 — private keys + Cargo.lock + a test
Rules generic-api-key, private-key

Import Graph — the 473-degree god package

Analyses package dependencies for coupling, cycles, and god packages

19 findings

The single biggest structural issue is still a generated TypeScript barrel. codex-rs/app-server-protocol/schema/typescript/v2/index.ts now re-exports 472 modules — combined degree 473 against a default threshold of 20, up from 461 last time. Two sibling union types (ClientRequest.ts at 82 and ServerNotification.ts at 67) repeat the pattern at smaller scale. On the inbound side, AbsolutePathBuf.ts is imported by 51 other modules. No circular dependencies across 2,814 packages.

Why this matters. A barrel that re-exports 472 modules pulls every consumer into the entire schema. Editor tooling slows down, tree-shaking stops working, and any change to the schema invalidates the build cache for every dependent. It's also a refactor blocker: nobody can split this file cleanly because every renumbered re-export is a breaking change. The good news is the cause is generation, not handwriting — the generator template can produce domain-scoped sub-barrels in an afternoon.

Packages 2,814
Edges 1,293
Cycles 0
Top god package v2/index.ts — degree 473

Test Presence — a strong ratio masking 112 untested dirs

Measures test-to-source file ratio and per-directory test absence

112 findings

Headline: 1,360 test files for 1,452 source files — a 94% ratio. That's the headline. The per-directory breakdown is less flattering: 112 directories have source code but no tests next to them, including the generated TypeScript schema layer and most of the Python SDK. The Rust crates lean hard on Rust's convention of inline #[cfg(test)] modules, which the check counts at the file level — that's where the 94% comes from. Strip the inline tests out and the picture changes considerably.

How to read this. A high test-to-source ratio is a necessary condition for refactor-safety, not a sufficient one. The 112 untested directories include the entire generated-TypeScript surface (lower priority — tests belong on the generator, not the output) and the Python SDK (higher priority — this is the interface third parties consume). The Rust core is genuinely well-tested. The risk to watch is regression in the Python SDK, where there's no safety net.

Test files 1,360
Source files 1,452
Ratio 94%
Untested directories 112

Line Count — generated dumps and one big handwritten file

Flags files exceeding recommended length thresholds

609 findings

609 of 3,969 scanned files exceed the 500-line warning threshold; 256 of those exceed 1,000 lines. The worst offenders are generated — the JSON schema dumps (codex_app_server_protocol.schemas.json at 19,513 lines), Cargo.lock, the v2_all.py SDK bundle. But not all of it is machine emitted: codex-rs/tui/src/bottom_pane/chat_composer.rs is a handwritten Rust file at 11,189 lines. A .ik.yaml exclusion on the schema and lock paths would collapse this finding by an order of magnitude and surface the handwritten outliers worth splitting.

Why we still flag generated files. Big files are an onboarding tax. Every new contributor has to scroll through them; every diff against them is harder to review. We don't auto-exclude generated code — the existence of a 19,000-line file in a repo is useful signal, even if the explanation is “the generator emits it.” The takeaway is operational: tell the scanner what's generated so the next scan can focus on what humans wrote.

Files scanned 3,969
Above warning (500) 609
Above error (1,000) 256

Hotspots — where the change actually lands

Files ranked by git change frequency

11 findings

Out of 3,867 changed files in the analysed window, a handful carry most of the churn — and they cluster in two directories. codex-rs/core/src/session/ (tests.rs, mod.rs, turn.rs, turn_context.rs) is one; codex-rs/core/src/config/ (mod.rs, config_tests.rs) is the other, alongside the app-server-protocol JSON schema bundle. Together they account for the bulk of the change risk. High-churn + high-complexity is the architectural pressure point worth watching — and the session machinery sits at the top of both lists.

How to use this. Hotspot × complexity is where bugs land. If you have one person-week to harden this codebase, spend it on codex-rs/core/src/session/ and config/: that's where the change is happening, that's where the product behaviour lives, and that's where a regression has the biggest blast radius. The schema churn is a different story — it's machine-driven and predictable; you don't review the diffs, you review the generator.

Top file codex-rs/core/src/session/tests.rs
Top non-test file codex-rs/core/src/session/mod.rs
Commits analysed 250

Change Coupling

File pairs that co-change in git history — hidden dependencies

17 findings

17 file pairs change together more often than chance. Most pair a Rust source file with its generated TypeScript schema sibling — the schema is regenerated whenever the Rust type changes, which is the design. The pattern to watch is the few cross-cutting pairs that aren't schema-driven: those are real hidden dependencies the codebase doesn't declare.

What's signal vs noise here. Schema ↔ type co-change is expected and tells you nothing new. The interesting pairs are the ones where two files in different domains keep moving in lockstep without an explicit import between them — that's the “temporal contract” that bites later when one of them is touched by someone who doesn't know the other exists. Worth filtering the full coupling list by “neither file imports the other” before triaging.

The smaller-but-real findings

Magic numbers, shell-script issues, Python error handling

71 findings

8 magic-number literals (7 in JS, 1 in Python) — far fewer than the 122 we reported last time, because the check now leans on pylint and eslint rather than a regex heuristic that over-counted. 41 shell-script issues across 17 scripts (ShellCheck-driven: unrecognised shebangs, brace-group parse errors, unquoted expansions). And 22 bare except: clauses in the Python SDK and the bundled skill samples. None of these alone moves the score; together they're the kind of paper-cut work a reviewer would flag on a first pass.

Why bare-except is the one to fix first. Magic numbers are an audit problem. Shell-script warnings are a CI-reliability problem. But bare except: is an in-production problem: it swallows KeyboardInterrupt, SystemExit, and every real bug along with the one error you meant to catch. In a long-running process, that's how this-should-have-thrown turns into silent data corruption. 22 of them in an SDK that other people will embed in their software is the finding here.

Checks that passed

Dependency Audit Pass

No known CVEs across the npm and pip ecosystems. (Rust cargo-audit not run in this scan.)

Dead Code Pass

No unused symbols detected by vulture (Python) or knip (JS/TS).

Duplication Pass

No token-level duplicate blocks across the sampled files; 15 semantic near-duplicate pairs surfaced for review.

Cycles Pass

Zero circular dependencies across 2,814 packages and 1,293 edges.

TODO Density Pass

210 TODOs across 2,163 files — well below the density threshold.

Map of the top connectors

The seven files below carry most of the structural weight in the import graph. A change here ripples; a deletion is a multi-day project.

#FileLanguageFan-inFan-outTotal degree
1codex-rs/app-server-protocol/schema/typescript/v2/index.tsTypeScript1472473
2codex-rs/app-server-protocol/schema/typescript/ClientRequest.tsTypeScript18182
3codex-rs/app-server-protocol/schema/typescript/ServerNotification.tsTypeScript16667
4codex-rs/app-server-protocol/schema/typescript/AbsolutePathBuf.tsTypeScript51051
5codex-rs/app-server-protocol/schema/typescript/serde_json/JsonValue.tsTypeScript27027
6codex-rs/app-server-protocol/schema/typescript/v2/ThreadItem.tsTypeScript42125
7codex-rs/app-server-protocol/schema/typescript/v2/Thread.tsTypeScript11617

Takeaways

Immediate action needed
  • Open codex-rs/cli/src/login.rs and core-plugins/src/remote/catalog_cache.rs — if the generic-API-key matches there are production constants, rotate and refactor; if they're samples, move them into test files.
  • Replace the 22 bare except: blocks in the Python SDK and skill samples with typed exceptions — bare-except is how production bugs become silent data corruption.
  • Break down the highest-complexity functions in the core — handle_event (CC 122), parse_command, the session machinery. These sit at the intersection of high churn and high complexity.
  • Wire cargo-audit into the Rust build so the dependency-audit picture is complete; the current pass excludes the largest dependency ecosystem in the repo.
Strategic improvements
  • Add codex-rs/app-server-protocol/schema/** and **/*.schemas.json to .ik.yaml's line-count and import-graph excludes. That alone collapses most of the structural noise.
  • Split v2/index.ts into domain-scoped re-exports (auth, session, exec, fs). A 472-fan-out barrel is a refactor blocker.
  • Split the handwritten outliers the exclusions reveal — starting with chat_composer.rs at 11,189 lines.
  • Add per-directory tests for the 112 untested directories, prioritising the Python SDK — Rust's inline tests cover most of that side.

See how your codebase scores

Run inkode against your repo in under a minute. No account required.

Scan Your Repo