dkls23 with iroh/mDNS for peer discovery. Same machine, same hour, no other context.
Here's how each fared.
| Scenario | Opus 4.7 | GPT-5 | DeepSeek v4-pro |
|---|
cli, commands/, discovery, transport, keyshare, singleton.SimpleMessageRelay + a thin InterceptRelay wrapper. The "obvious" idiomatic integration.blake3(tool_id|key_id)[..6] — peers from different keys never see each other.run_all.sh orchestrator).README.md.main.rs) — extreme density, no module split.IrohRelay built directly on Sink+Stream; bypasses dkls23's SimpleMessageRelay. Riskier, but works.UserData on each mDNS record.key_export, and committee-size changes.dkls23-secp256k1 (a different upstream than the spec's silence-laboratories/dkls23) — phase-1/2/3/4 message API./tmp/dkls23ctl/<key>/<peer>.json — spec violation (mDNS was required).main.rs contains a quietly broken normalisation: if n == 1 || t == 1 { t = 1; n = 1; } silently overrides user-supplied params.protocol.rs sends sign-phase-1 messages twice (loop duplicated).| Model | # gaps > 30 s | Longest gap | Note |
|---|---|---|---|
| Opus 4.7 | 37 | ~80 min (13:56→15:17 UTC) | Same global pause (user lunch / break) seen across all three sessions. |
| GPT-5 | 8 | ~81 min (13:55→15:17 UTC) | Fewest mid-session waits — the model rarely paused on its own. |
| DeepSeek v4-pro | 69 | ~81 min (13:55→15:17 UTC) | Many small mid-session pauses: long generations, repeated retries. |
Opus 4.7 and GPT-5 both used iroh::address_lookup::MdnsAddressLookup and a real ALPN-based QUIC connection between peers. Opus plugged dkls23's own SimpleMessageRelay into iroh via a sink interceptor (the "official" path). GPT-5 built a from-scratch relay, which is more code but enables features Opus skipped.
DeepSeek v4-pro bound an iroh endpoint, but its wait_for_peers() loop reads /tmp/dkls23ctl/<key>/<peer>.json files instead of using mDNS. The user pushed back on this mid-session — the model acknowledged but did not fix it. The iroh dial path uses the loopback IP and port written to those files, defeating the whole point of mDNS / LAN discovery.
Only GPT-5 implements the full reshare matrix:
key_import::ecdsa_secret_shares + key_refresh ✅key_refresh ✅quorum_change ✅quorum_change with mixed old/new committees ✅ (verified: (2,3)→(3,4) works end-to-end)key_export with x25519 export key ✅ (verified: receiver becomes singleton)Opus 4.7 explicitly errors on the last two; DeepSeek cannot do any reshare beyond a same-params refresh, and even that hung in our test.
DeepSeek invoked OpenCode's question tool once:
"would you accept a simpler networking approach (TCP streams with file-based discovery) that's more reliable, or do you specifically need iroh QUIC for this tool?"
The user's reply was emphatic: "Initial request states clearly that this tool should work on localhost AND LAN, so file-based discovery is a critical flaw. iroh and related libs provide all the required functionality, you just didn't manage to use it correctly." The shipped code still uses file-based discovery. Hypothesis: DeepSeek v4-pro repeatedly failed to figure out iroh's mDNS API, and the model treated the rebuke as guidance to "keep trying" rather than as a hard constraint.
The spec calls for showing pubkey on stdout. Opus 4.7 and GPT-5 print compressed SEC1 (33 bytes / 66 hex chars). DeepSeek prints raw uncompressed coordinates without the 04 prefix (64 bytes / 128 hex chars). This is technically a public key, but every downstream tool will choke. It's a leaky abstraction over dkls23-secp256k1's API surface.
Spec said github.com/silence-laboratories/dkls23. Opus 4.7 and GPT-5 picked sl-dkls23 on crates.io — Silence Labs' v1 beta of the same code. DeepSeek picked dkls23-secp256k1 — a different SL crate, multi-curve, with a much chattier phase-by-phase API. This forced DeepSeek to manually wire eight separate message types per DKG round, which it did adequately, then again for sign, then attempted reshare and got stuck. Opus and GPT-5 handed messages to dkls23's protocol task and let the library do the choreography.
Several reinforcing factors. Probable root causes:
dkls23-secp256k1 required hand-assembling 4 DKG phases × N messages each. The error budget compounded./tmp rendezvous and never recovered, even after the user objected.apply_patch calls produced 1254 lines that build and pass the QA matrix on first try.exec_command+write_stdin pair (193 calls together) is GPT-5's primary tool — it can keep a long-running shell, drive cargo + tail logs without re-spawning. Claude Code instead fires 211 individual one-shot Bash calls.commands/sign.rs in seconds; GPT-5's monolithic main.rs requires scrolling.tool_id|key_id so peers running other key sessions don't show up in your discovery.While Opus 4.7 itself rarely needed user input on the actual task, the Claude Code harness demanded the most permission-prompt clicks of any of the three. After the session ended, cc/.claude/settings.local.json contained 30 persisted "always allow" entries — and that's only the requests where the user picked the persistent option. The session log shows 50 permissionMode transitions through acceptEdits on top of those.
Worse, many of the persistent entries are uselessly narrow — they won't match a similar future command:
Bash(/home/<user>/src/<project>/target/debug/dkls23ctl verify *) — absolute path baked in; useless if the dir ever moves.Bash(echo "exits: $?"), Bash(echo "exit=$?"), Bash(wait) — exact-string match on a one-off shell snippet.Bash(rm -rf .secrets), Bash(pkill -f 'reshare --key-id sk1') — pinned to the test's literal key-id, won't apply to any other run.Bash(python3 -c '…literal 100-char snippet…') — granted with the script body baked in.Bash(DKLS23CTL_BIN=$(readlink -f ./target/debug/dkls23ctl) bash scripts/test_reshare.sh) — a single command line frozen as a permission rule.And many tool calls offered no persistence option at all (or only a one-shot accept), so the user kept re-clicking through similar variants. By contrast, Codex defers to its sandbox profile (one decision at session start), and OpenCode's permission table for this session has zero persisted rules. Net effect: Opus 4.7 needed the user to mash the keyboard tens of times that the other two harnesses didn't ask about at all — and the resulting allow-list is mostly cruft that won't help a future session.
| Aspect | Opus 4.7 · Claude Code | GPT-5 · Codex | DeepSeek v4-pro · OpenCode |
|---|---|---|---|
| Spec compliance (mDNS) | Yes | Yes | No (filesystem) |
| Reshare completeness | Partial (no n-change, no export) | Full matrix | Refresh only, hangs |
| Code structure | Modular, idiomatic | Monolithic but tight | Modular but dense network.rs |
| Testing | 2 tests + 4 scripts + run_all | 2 tests, no scripts | 0 tests, scripts admit failures |
| Documentation | README + module docs | None | None |
| Time efficiency | 65 min active | 26 min active | 95 min active |
| Bugs | None observed | None observed | Silent param override; duplicated send loop; reshare hangs |
| API surface choice | SimpleMessageRelay (canonical) | Custom relay (works, more code) | Wrong upstream lib |
| Operator UX (output) | Tagged stdout (PUBKEY/SHARE/SIGNATURE) | Single hex line, no tag | Single hex line, wrong format |
| Operator UX (permissions) | ~30 persisted grants, many over-narrow; lots of mid-session prompts | Sandbox profile, one decision at start | Zero persisted rules in session |
GPT-5 (Codex, high effort) is the most spec-complete and the fastest to produce a working tool. If the only criterion is "does it pass the QA matrix", GPT-5 wins.
Opus 4.7 (Claude Code, x-high effort) wins on engineering quality — readable code, real tests, scripts, README, idiomatic library use — at the cost of skipping two reshare paths and spending more time. Best for handing off to another engineer. Operator caveat: the Claude Code harness's permission UX makes this the most interaction-heavy option, and most of its "always allow" rules end up too narrow to reuse.
DeepSeek v4-pro (OpenCode, high/max effort) failed to deliver a tool that meets the spec. The combination of a weaker base model, an unfortunate library pick, and a stubborn refusal to fix the discovery layer after explicit user feedback makes this the clear loser. The lesson: when a model gets stuck on a constraint it doesn't understand, escalating to the user works only if the model is then willing to reverse course.