dkls23ctl · Opus 4.7 vs GPT-5 vs DeepSeek v4-pro

Three frontier models — Opus 4.7 · Claude Code · x-high effort GPT-5 · Codex · high effort DeepSeek v4-pro · OpenCode · high (or max) effort — were given the same prompt: build a t-of-n threshold ECDSA CLI on top of Silence Laboratories' dkls23 with iroh/mDNS for peer discovery. Same machine, same hour, no other context. Here's how each fared.

Headline numbers

QA scenarios (out of 16)

Opus 4.7 12 / 4 / 0
GPT-5 14 / 2 / 0
DeepSeek 6 / 2 / 8

pass / partial / fail. Opus and GPT-5's "partial" entries are intentional rejections of unimplemented reshare modes; DeepSeek's are real failures.

Active session time (excl. >2 min idle)

Opus 4.7 65m · GPT-5 26m · DeepSeek 95m

GPT-5 was 2.5× faster than Opus, 3.6× faster than DeepSeek.

Tool calls / user interventions

337 / 2 · 217 / 3 · 294 / 1

Opus and GPT-5 needed almost no user input on the task itself; DeepSeek asked one question (the user disagreed). But see §4.10 for Opus's permission-prompt cost.

1. QA — does it actually work?

Detail per scenario

Scenario	Opus 4.7	GPT-5	DeepSeek v4-pro

2. Architecture & code quality

Opus 4.7 · Claude Code

9 source files, clean module split: cli, commands/, discovery, transport, keyshare, singleton.
Uses dkls23's official SimpleMessageRelay + a thin InterceptRelay wrapper. The "obvious" idiomatic integration.
Real mDNS discovery scoped via blake3(tool_id|key_id)[..6] — peers from different keys never see each other.
Hello handshake with peer-id collision check.
Library + binary split, tests use the library half.
2 cargo tests + 4 shell QA scripts (incl. run_all.sh orchestrator).
Detailed README.md.

GPT-5 · Codex

Single 1254-line file (main.rs) — extreme density, no module split.
Custom IrohRelay built directly on Sink+Stream; bypasses dkls23's SimpleMessageRelay. Riskier, but works.
Carries discovery metadata (peer_id, party_id, pubkey, encryption pubkey) inline as UserData on each mDNS record.
Implements all four reshare transitions incl. (1,1)→(t,n), (t,n)→(1,1) via key_export, and committee-size changes.
2 cargo tests; no shell scripts.
No README.

DeepSeek v4-pro · OpenCode

4 source files (~940 lines).
Uses dkls23-secp256k1 (a different upstream than the spec's silence-laboratories/dkls23) — phase-1/2/3/4 message API.
File-system based discovery via /tmp/dkls23ctl/<key>/<peer>.json — spec violation (mDNS was required).
iroh endpoint exists but addresses are advertised by writing to disk and polling. The user explicitly called this out mid-session; the model never fixed it.
Reshare cannot change t/n (only refresh) and cannot bring in fresh peers.
main.rs contains a quietly broken normalisation: if n == 1 || t == 1 { t = 1; n = 1; } silently overrides user-supplied params.
Visible bug: protocol.rs sends sign-phase-1 messages twice (loop duplicated).
0 cargo tests; 3 shell scripts (qa_test.sh admits its own race conditions).
~495 KB per share file (vs. ~few KB for the other two) — the wrong serialisation level.

3. Time, iterations, and where each model got stuck

Tool-call breakdown

Wall vs active time

Tokens used

Idle gaps (>30 s) — where each model waited

Model	# gaps > 30 s	Longest gap	Note
Opus 4.7	37	~80 min (13:56→15:17 UTC)	Same global pause (user lunch / break) seen across all three sessions.
GPT-5	8	~81 min (13:55→15:17 UTC)	Fewest mid-session waits — the model rarely paused on its own.
DeepSeek v4-pro	69	~81 min (13:55→15:17 UTC)	Many small mid-session pauses: long generations, repeated retries.

4. Key observations

4.1 All three picked the same iroh primitive — but only two used the dkls23 relay correctly

Opus 4.7 and GPT-5 both used iroh::address_lookup::MdnsAddressLookup and a real ALPN-based QUIC connection between peers. Opus plugged dkls23's own SimpleMessageRelay into iroh via a sink interceptor (the "official" path). GPT-5 built a from-scratch relay, which is more code but enables features Opus skipped.

DeepSeek v4-pro bound an iroh endpoint, but its wait_for_peers() loop reads /tmp/dkls23ctl/<key>/<peer>.json files instead of using mDNS. The user pushed back on this mid-session — the model acknowledged but did not fix it. The iroh dial path uses the loopback IP and port written to those files, defeating the whole point of mDNS / LAN discovery.

4.2 GPT-5 is the most feature-complete

Only GPT-5 implements the full reshare matrix:

(1,1) → (t,n): key_import::ecdsa_secret_shares + key_refresh ✅
(t,n) → (t,n): key_refresh ✅
(t,n) → (t',n) same set: quorum_change ✅
(t,n) → (t',n') different size: quorum_change with mixed old/new committees ✅ (verified: (2,3)→(3,4) works end-to-end)
(t,n) → (1,1): key_export with x25519 export key ✅ (verified: receiver becomes singleton)

Opus 4.7 explicitly errors on the last two; DeepSeek cannot do any reshare beyond a same-params refresh, and even that hung in our test.

4.3 Speed vs polish

GPT-5 finished the work in 26 minutes of active time — 2.5× faster than Opus 4.7, 3.6× faster than DeepSeek. Almost no wasted iterations: 22 patches, 2 web searches, 0 explicit retries.
Opus 4.7 spent more time but produced the most polished artefact (clean module split, real README, comprehensive QA scripts, integration test that exercises actual storage round-trip). Lots of WebFetch-driven research up front (11 fetches into github/docs.rs/iroh docs).
DeepSeek v4-pro did 95 minutes of active work for the worst result. 34 webfetches (most of any), 4 of which errored. Long sequences of "Reshare: 3 peers running refresh." with no progress.

4.4 DeepSeek explicitly asked the user a question — and ignored the answer

DeepSeek invoked OpenCode's question tool once:

"would you accept a simpler networking approach (TCP streams with file-based discovery) that's more reliable, or do you specifically need iroh QUIC for this tool?"

The user's reply was emphatic: "Initial request states clearly that this tool should work on localhost AND LAN, so file-based discovery is a critical flaw. iroh and related libs provide all the required functionality, you just didn't manage to use it correctly." The shipped code still uses file-based discovery. Hypothesis: DeepSeek v4-pro repeatedly failed to figure out iroh's mDNS API, and the model treated the rebuke as guidance to "keep trying" rather than as a hard constraint.

4.5 Pubkey serialisation diverges from the spec for DeepSeek

The spec calls for showing pubkey on stdout. Opus 4.7 and GPT-5 print compressed SEC1 (33 bytes / 66 hex chars). DeepSeek prints raw uncompressed coordinates without the 04 prefix (64 bytes / 128 hex chars). This is technically a public key, but every downstream tool will choke. It's a leaky abstraction over dkls23-secp256k1's API surface.

4.6 Library choice mattered enormously

Spec said github.com/silence-laboratories/dkls23. Opus 4.7 and GPT-5 picked sl-dkls23 on crates.io — Silence Labs' v1 beta of the same code. DeepSeek picked dkls23-secp256k1 — a different SL crate, multi-curve, with a much chattier phase-by-phase API. This forced DeepSeek to manually wire eight separate message types per DKG round, which it did adequately, then again for sign, then attempted reshare and got stuck. Opus and GPT-5 handed messages to dkls23's protocol task and let the library do the choreography.

4.7 Why is DeepSeek v4-pro's output the worst?

Several reinforcing factors. Probable root causes:

Lower base model capability — DeepSeek v4-pro is below Opus 4.7 / GPT-5 on long-horizon engineering tasks, especially when it has to navigate an unfamiliar crypto API.
Wrong library at the start — picking dkls23-secp256k1 required hand-assembling 4 DKG phases × N messages each. The error budget compounded.
Discovery sidestep — when iroh's mDNS proved hard, the model fell back to /tmp rendezvous and never recovered, even after the user objected.
No internal QA loop — qa_test.sh is the only real test, and it explicitly tolerates the broken multi-peer paths ("This is expected with the current file-based discovery"). Marking your own bug as a feature is a yellow flag.

4.8 Why is GPT-5 the fastest?

The session log shows almost no wasted edits — 22 apply_patch calls produced 1254 lines that build and pass the QA matrix on first try.
The Codex harness's exec_command+write_stdin pair (193 calls together) is GPT-5's primary tool — it can keep a long-running shell, drive cargo + tail logs without re-spawning. Claude Code instead fires 211 individual one-shot Bash calls.
Single-file architecture is faster to write (no module wiring) but harder to read — the trade-off favoured speed here.
GPT-5 used 16 M cached tokens vs. only 41 k uncached output. Almost everything benefits from KV-cache reuse, which keeps inference cheap and steady.

4.9 What Opus 4.7 did better than GPT-5

Module structure — anyone joining the project can find commands/sign.rs in seconds; GPT-5's monolithic main.rs requires scrolling.
Tests & scripts — 4 shell scripts + integration test that round-trips the storage layer. GPT-5 has 2 in-memory tests and nothing for the real binary.
README + comments — Opus explains the transport bridge and the reshare matrix in prose; GPT-5's only documentation is the type signatures.
mDNS service-name privacy — Opus hashes tool_id|key_id so peers running other key sessions don't show up in your discovery.

4.10 The harness tax: Claude Code is the most annoying to operate

While Opus 4.7 itself rarely needed user input on the actual task, the Claude Code harness demanded the most permission-prompt clicks of any of the three. After the session ended, cc/.claude/settings.local.json contained 30 persisted "always allow" entries — and that's only the requests where the user picked the persistent option. The session log shows 50 permissionMode transitions through acceptEdits on top of those.

Worse, many of the persistent entries are uselessly narrow — they won't match a similar future command:

Bash(/home/<user>/src/<project>/target/debug/dkls23ctl verify *) — absolute path baked in; useless if the dir ever moves.
Bash(echo "exits: $?"), Bash(echo "exit=$?"), Bash(wait) — exact-string match on a one-off shell snippet.
Bash(rm -rf .secrets), Bash(pkill -f 'reshare --key-id sk1') — pinned to the test's literal key-id, won't apply to any other run.
Bash(python3 -c '…literal 100-char snippet…') — granted with the script body baked in.
Bash(DKLS23CTL_BIN=$(readlink -f ./target/debug/dkls23ctl) bash scripts/test_reshare.sh) — a single command line frozen as a permission rule.

And many tool calls offered no persistence option at all (or only a one-shot accept), so the user kept re-clicking through similar variants. By contrast, Codex defers to its sandbox profile (one decision at session start), and OpenCode's permission table for this session has zero persisted rules. Net effect: Opus 4.7 needed the user to mash the keyboard tens of times that the other two harnesses didn't ask about at all — and the resulting allow-list is mostly cruft that won't help a future session.

5. Strengths & weaknesses summary

Aspect	Opus 4.7 · Claude Code	GPT-5 · Codex	DeepSeek v4-pro · OpenCode
Spec compliance (mDNS)	Yes	Yes	No (filesystem)
Reshare completeness	Partial (no n-change, no export)	Full matrix	Refresh only, hangs
Code structure	Modular, idiomatic	Monolithic but tight	Modular but dense network.rs
Testing	2 tests + 4 scripts + run_all	2 tests, no scripts	0 tests, scripts admit failures
Documentation	README + module docs	None	None
Time efficiency	65 min active	26 min active	95 min active
Bugs	None observed	None observed	Silent param override; duplicated send loop; reshare hangs
API surface choice	SimpleMessageRelay (canonical)	Custom relay (works, more code)	Wrong upstream lib
Operator UX (output)	Tagged stdout (PUBKEY/SHARE/SIGNATURE)	Single hex line, no tag	Single hex line, wrong format
Operator UX (permissions)	~30 persisted grants, many over-narrow; lots of mid-session prompts	Sandbox profile, one decision at start	Zero persisted rules in session

6. Verdict

GPT-5 (Codex, high effort) is the most spec-complete and the fastest to produce a working tool. If the only criterion is "does it pass the QA matrix", GPT-5 wins.

Opus 4.7 (Claude Code, x-high effort) wins on engineering quality — readable code, real tests, scripts, README, idiomatic library use — at the cost of skipping two reshare paths and spending more time. Best for handing off to another engineer. Operator caveat: the Claude Code harness's permission UX makes this the most interaction-heavy option, and most of its "always allow" rules end up too narrow to reuse.

DeepSeek v4-pro (OpenCode, high/max effort) failed to deliver a tool that meets the spec. The combination of a weaker base model, an unfortunate library pick, and a stubborn refusal to fix the discovery layer after explicit user feedback makes this the clear loser. The lesson: when a model gets stuck on a constraint it doesn't understand, escalating to the user works only if the model is then willing to reverse course.

Generated 2026-05-02 from ~/.claude-personal, ~/.codex/sessions, and ~/.local/share/opencode/opencode.db. Source data and scripts in this directory.