dkls23ctl · Opus 4.7 vs GPT-5 vs DeepSeek v4-pro

Three frontier models — Opus 4.7 · Claude Code · x-high effort GPT-5 · Codex · high effort DeepSeek v4-pro · OpenCode · high (or max) effort — were given the same prompt: build a t-of-n threshold ECDSA CLI on top of Silence Laboratories' dkls23 with iroh/mDNS for peer discovery. Same machine, same hour, no other context. Here's how each fared.

Headline numbers

QA scenarios (out of 16)

Opus 4.7  12 / 4 / 0
GPT-5  14 / 2 / 0
DeepSeek  6 / 2 / 8
pass / partial / fail. Opus and GPT-5's "partial" entries are intentional rejections of unimplemented reshare modes; DeepSeek's are real failures.

Active session time (excl. >2 min idle)

Opus 4.7  65m · GPT-5  26m · DeepSeek  95m
GPT-5 was 2.5× faster than Opus, 3.6× faster than DeepSeek.

Tool calls / user interventions

337 / 2 · 217 / 3 · 294 / 1
Opus and GPT-5 needed almost no user input on the task itself; DeepSeek asked one question (the user disagreed). But see §4.10 for Opus's permission-prompt cost.

1. QA — does it actually work?

Detail per scenario

ScenarioOpus 4.7GPT-5DeepSeek v4-pro

2. Architecture & code quality

Opus 4.7 · Claude Code

  • 9 source files, clean module split: cli, commands/, discovery, transport, keyshare, singleton.
  • Uses dkls23's official SimpleMessageRelay + a thin InterceptRelay wrapper. The "obvious" idiomatic integration.
  • Real mDNS discovery scoped via blake3(tool_id|key_id)[..6] — peers from different keys never see each other.
  • Hello handshake with peer-id collision check.
  • Library + binary split, tests use the library half.
  • 2 cargo tests + 4 shell QA scripts (incl. run_all.sh orchestrator).
  • Detailed README.md.

GPT-5 · Codex

  • Single 1254-line file (main.rs) — extreme density, no module split.
  • Custom IrohRelay built directly on Sink+Stream; bypasses dkls23's SimpleMessageRelay. Riskier, but works.
  • Carries discovery metadata (peer_id, party_id, pubkey, encryption pubkey) inline as UserData on each mDNS record.
  • Implements all four reshare transitions incl. (1,1)→(t,n), (t,n)→(1,1) via key_export, and committee-size changes.
  • 2 cargo tests; no shell scripts.
  • No README.

DeepSeek v4-pro · OpenCode

  • 4 source files (~940 lines).
  • Uses dkls23-secp256k1 (a different upstream than the spec's silence-laboratories/dkls23) — phase-1/2/3/4 message API.
  • File-system based discovery via /tmp/dkls23ctl/<key>/<peer>.json — spec violation (mDNS was required).
  • iroh endpoint exists but addresses are advertised by writing to disk and polling. The user explicitly called this out mid-session; the model never fixed it.
  • Reshare cannot change t/n (only refresh) and cannot bring in fresh peers.
  • main.rs contains a quietly broken normalisation: if n == 1 || t == 1 { t = 1; n = 1; } silently overrides user-supplied params.
  • Visible bug: protocol.rs sends sign-phase-1 messages twice (loop duplicated).
  • 0 cargo tests; 3 shell scripts (qa_test.sh admits its own race conditions).
  • ~495 KB per share file (vs. ~few KB for the other two) — the wrong serialisation level.

3. Time, iterations, and where each model got stuck

Tool-call breakdown

Wall vs active time

Tokens used

Idle gaps (>30 s) — where each model waited

Model# gaps > 30 sLongest gapNote
Opus 4.737~80 min (13:56→15:17 UTC)Same global pause (user lunch / break) seen across all three sessions.
GPT-58~81 min (13:55→15:17 UTC)Fewest mid-session waits — the model rarely paused on its own.
DeepSeek v4-pro69~81 min (13:55→15:17 UTC)Many small mid-session pauses: long generations, repeated retries.

4. Key observations

4.1 All three picked the same iroh primitive — but only two used the dkls23 relay correctly

Opus 4.7 and GPT-5 both used iroh::address_lookup::MdnsAddressLookup and a real ALPN-based QUIC connection between peers. Opus plugged dkls23's own SimpleMessageRelay into iroh via a sink interceptor (the "official" path). GPT-5 built a from-scratch relay, which is more code but enables features Opus skipped.

DeepSeek v4-pro bound an iroh endpoint, but its wait_for_peers() loop reads /tmp/dkls23ctl/<key>/<peer>.json files instead of using mDNS. The user pushed back on this mid-session — the model acknowledged but did not fix it. The iroh dial path uses the loopback IP and port written to those files, defeating the whole point of mDNS / LAN discovery.

4.2 GPT-5 is the most feature-complete

Only GPT-5 implements the full reshare matrix:

Opus 4.7 explicitly errors on the last two; DeepSeek cannot do any reshare beyond a same-params refresh, and even that hung in our test.

4.3 Speed vs polish

4.4 DeepSeek explicitly asked the user a question — and ignored the answer

DeepSeek invoked OpenCode's question tool once:

"would you accept a simpler networking approach (TCP streams with file-based discovery) that's more reliable, or do you specifically need iroh QUIC for this tool?"

The user's reply was emphatic: "Initial request states clearly that this tool should work on localhost AND LAN, so file-based discovery is a critical flaw. iroh and related libs provide all the required functionality, you just didn't manage to use it correctly." The shipped code still uses file-based discovery. Hypothesis: DeepSeek v4-pro repeatedly failed to figure out iroh's mDNS API, and the model treated the rebuke as guidance to "keep trying" rather than as a hard constraint.

4.5 Pubkey serialisation diverges from the spec for DeepSeek

The spec calls for showing pubkey on stdout. Opus 4.7 and GPT-5 print compressed SEC1 (33 bytes / 66 hex chars). DeepSeek prints raw uncompressed coordinates without the 04 prefix (64 bytes / 128 hex chars). This is technically a public key, but every downstream tool will choke. It's a leaky abstraction over dkls23-secp256k1's API surface.

4.6 Library choice mattered enormously

Spec said github.com/silence-laboratories/dkls23. Opus 4.7 and GPT-5 picked sl-dkls23 on crates.io — Silence Labs' v1 beta of the same code. DeepSeek picked dkls23-secp256k1 — a different SL crate, multi-curve, with a much chattier phase-by-phase API. This forced DeepSeek to manually wire eight separate message types per DKG round, which it did adequately, then again for sign, then attempted reshare and got stuck. Opus and GPT-5 handed messages to dkls23's protocol task and let the library do the choreography.

4.7 Why is DeepSeek v4-pro's output the worst?

Several reinforcing factors. Probable root causes:

4.8 Why is GPT-5 the fastest?

4.9 What Opus 4.7 did better than GPT-5

4.10 The harness tax: Claude Code is the most annoying to operate

While Opus 4.7 itself rarely needed user input on the actual task, the Claude Code harness demanded the most permission-prompt clicks of any of the three. After the session ended, cc/.claude/settings.local.json contained 30 persisted "always allow" entries — and that's only the requests where the user picked the persistent option. The session log shows 50 permissionMode transitions through acceptEdits on top of those.

Worse, many of the persistent entries are uselessly narrow — they won't match a similar future command:

And many tool calls offered no persistence option at all (or only a one-shot accept), so the user kept re-clicking through similar variants. By contrast, Codex defers to its sandbox profile (one decision at session start), and OpenCode's permission table for this session has zero persisted rules. Net effect: Opus 4.7 needed the user to mash the keyboard tens of times that the other two harnesses didn't ask about at all — and the resulting allow-list is mostly cruft that won't help a future session.

5. Strengths & weaknesses summary

AspectOpus 4.7 · Claude CodeGPT-5 · CodexDeepSeek v4-pro · OpenCode
Spec compliance (mDNS)YesYesNo (filesystem)
Reshare completenessPartial (no n-change, no export)Full matrixRefresh only, hangs
Code structureModular, idiomaticMonolithic but tightModular but dense network.rs
Testing2 tests + 4 scripts + run_all2 tests, no scripts0 tests, scripts admit failures
DocumentationREADME + module docsNoneNone
Time efficiency65 min active26 min active95 min active
BugsNone observedNone observedSilent param override; duplicated send loop; reshare hangs
API surface choiceSimpleMessageRelay (canonical)Custom relay (works, more code)Wrong upstream lib
Operator UX (output)Tagged stdout (PUBKEY/SHARE/SIGNATURE)Single hex line, no tagSingle hex line, wrong format
Operator UX (permissions)~30 persisted grants, many over-narrow; lots of mid-session promptsSandbox profile, one decision at startZero persisted rules in session

6. Verdict

GPT-5 (Codex, high effort) is the most spec-complete and the fastest to produce a working tool. If the only criterion is "does it pass the QA matrix", GPT-5 wins.

Opus 4.7 (Claude Code, x-high effort) wins on engineering quality — readable code, real tests, scripts, README, idiomatic library use — at the cost of skipping two reshare paths and spending more time. Best for handing off to another engineer. Operator caveat: the Claude Code harness's permission UX makes this the most interaction-heavy option, and most of its "always allow" rules end up too narrow to reuse.

DeepSeek v4-pro (OpenCode, high/max effort) failed to deliver a tool that meets the spec. The combination of a weaker base model, an unfortunate library pick, and a stubborn refusal to fix the discovery layer after explicit user feedback makes this the clear loser. The lesson: when a model gets stuck on a constraint it doesn't understand, escalating to the user works only if the model is then willing to reverse course.