AI Peer Review — mysticflounder.ai

§1 — The Experiment

A preprint sent to eight frontier models

Do large language models catch the same mistakes as human referees — or do they hallucinate new ones?

In March 2026 we submitted "Ghost Cycles of the Syracuse Map" (v2) to eight frontier AI models for blind section-by-section review. The prompt was identical across all runs: critical assessment, one to two paragraphs per section. Models received the PDF directly; no hints about known issues.

The paper describes a transfer operator ℒ on C(ℤ₂^odd) defined as the projective limit of integer Syracuse maps. This operator-theoretic framing proved to be the key fault line: models that correctly parsed the projective-limit definition gave clean reviews. Models that imported a different definition produced false positives.

Eight of nine models returned usable reviews. The full text of every review — including each model's raw reasoning and thought process (where available) — is publicly available in the public repository.

§2 — Model Verdicts

Results at a glance

Model	Status	False positives	Verdict
GPT-5.4 Pro (run 1)	Context burned	Yes — mod-3 misread	Not publishable
GPT-5.4 Pro (run 2)	False positive	Yes — same misread confirmed model-intrinsic	Major rewrite
GPT-5.4 Thinking	No false positives	None	Rejection (harsh framing)
Gemini 3.1 Pro	No false positives	None	No explicit verdict
Gemini 3 Thinking	No false positives	None (one outdated claim, now resolved)	Major revision (stated explicitly in referee report)
Claude Sonnet 4.6	One false positive	Theorem 6 "doesn't exist" — confused by LaTeX `\setcounter` commands; missed Theorems 5–6	Major revision implied; thorough on small expository issues
Claude Opus 4.6	No false positives	None	Major revision implied; most careful on Conjecture 4 implications
DeepSeek R1 70B	No false positives	None	Accept with minor clarifications (9/10) — input via markdown
Mistral Large (version unknown)	No false positives	None (one outdated compactness claim)	No explicit verdict

§3 — Consensus

What all models agreed on

Six issues were flagged independently by two or more models across five completed reviews. These represented genuine weaknesses in the preprint — all six were addressed in v3.

Persistence proof too compressed. Valuation conditions not fully argued; needs a standalone congruence lemma. One model identified a deeper primitivity gap: the reduced orbit might collapse to a shorter period mod 2^k — not addressed anywhere.

GPT Pro ×2
GPT Thinking
Gemini Thinking

Theorem 1(e): σ(ℒ) = ∪σ(P_k) asserted, not proved. Density of locally constant functions does not automatically give spectral identification.

GPT Thinking
Gemini 3.1 Pro

λ = 1/4 simplicity only computationally verified through k = 36. The paper implies a general proof.

GPT Thinking
Gemini Thinking

Product formula independence assumption unproved. The density estimate assumes asymptotic independence between ghost periods — non-trivial when periods share factors.

All 5 completed models

Abstract and introduction rhetoric too strong. "Falsifies," "closing the Mahler/Amice program entirely" — language should distinguish proved results from conjectures.

GPT Pro ×2
GPT Thinking
Mistral Large

Fredholm determinant terminology misleading. The paper computes finite-dimensional characteristic polynomials of P_k, not a genuine Fredholm determinant for the infinite-dimensional operator ℒ.

GPT Thinking
Gemini 3.1 Pro
Gemini Thinking

§3½ — Critical Reception

What the models said

Submitted without context to frontier AI models — the math fatcats of 2026. Here's how they reviewed it.

"It deserves publication in a top-tier journal with minor clarifications. 9/10."

DeepSeek R1 70B

"The proof of ‖ℒ‖ = 2/3 is correct and elegant, leveraging the mod-3 structure of the weight function W(n)."

Mistral Large

"Theorem 6 is a high-water mark for the paper's analytical work."

Gemini 3.1 Pro

§4 — The False Positive

A model-intrinsic weakness in GPT-5.4 Pro

Same misread. Two runs. Different contexts. This is not noise.

GPT-5.4 Pro concluded in both runs that the preimage structure of ℒ is wrong because "mod 3 is not a continuous notion on ℤ₂." From this, it derived that ‖ℒ‖ = 2/3 is incorrect, Lemma 1 is false, and the Lasota–Yorke obstruction is invalid.

The error is a definition conflation. The paper defines ℒ as the projective limit of integer Syracuse maps, where W ∈ {1/3, 2/3} follows directly from integer arithmetic (how many preimage branches exist at each level). GPT Pro assumed the full 2-adic preimage operator, in which all branches g_v(n) = (2^vn − 1)/3 are included, giving W ≡ 1. Under that reading, its counterexample m = 1/3 is a non-integer 2-adic element — irrelevant under the paper's definition.

Run 1 was plausibly a context-burn artifact (the model spent ~15 minutes trying to download the PDF before failing). Run 2 used a clean prompt with direct PDF upload. The misread appeared unchanged. We conclude this is a model-intrinsic weakness in GPT-5.4 Pro's handling of projective-limit constructions.

Fixed in v3: A new Remark (Projective limit definition) explicitly states that ℒ is the projective limit of integer Syracuse maps, that non-integer 2-adic elements are not preimage candidates, and that m ranges over positive odd integers with S(m) = n. This closes the definition gap GPT Pro exploited.

Notably, run 2 independently rediscovered the D = −601 persistence pattern via live Python computation, confirming k ≡ 12 \pmod{25}. The model's rigorous computational engagement was the highest of any model tested — the false positive coexists with genuine mathematical depth.

§5 — Model Profiles

Highlights by model

GPT-5.4 Pro

False positive

Two runs. Run 1 context-burned (15 min attempting PDF download). Run 2 clean prompt, direct upload. Same misread both times: concluded that ‖ℒ‖ = 2/3 is wrong and Lemma 1 is false — all downstream from a single definition conflation.

Independently verified D = −601 persistence via live Python code (most rigorous computational engagement of any model). The false positive coexists with genuine mathematical depth.

GPT-5.4 Thinking

No false positives

Zero false positives. Caught all consensus issues plus one unique finding: the primitivity gap — the reduced orbit might collapse to a shorter period mod 2^k, which the persistence proof does not address.

Only model to identify this gap. Verdict "rejection" — same substance as Gemini's "major revision," harsher framing.

Gemini 3.1 Pro

No false positives

Zero false positives. All criticisms fair or known open problems. Correctly distinguished proved results from conjectures throughout. Identified materialization as the central open problem.

Best overall accuracy of any model tested. Identified Theorem 6 as the analytical high point of the paper. Verdict: solid review.

Gemini 3 Thinking

No false positives

Zero false positives. Generated a formal referee report unprompted — the only model to do so. Flagged archimedean compactness as an open question; our subsequent work resolved it negatively.

Correct diagnosis, negative resolution: ℒ is provably not compact. The model identified the right question before we had the answer.

Mistral Large

No false positives

Zero false positives. Called the ‖ℒ‖ = 2/3 proof "correct and elegant." Caught all consensus issues; raised an underexplored gap: are case-(b) ghosts truly non-persistent, or just more subtle?

One outdated claim on archimedean compactness (same pattern as Gemini 3 Thinking — now resolved). Model version not exposed by chat.mistral.ai UI.

Claude Sonnet 4.6

One false positive

Thorough on small expository issues — reference list gaps, proof step details, remark statuses — with one significant false positive: claimed "Theorem 6 doesn't exist; the numbering stops at Theorem 4." In fact the paper has six theorems; the model was confused by explicit \setcounter commands preceding Theorems 1–4 and failed to count Theorems 5 (Persistence) and 6 (Concentrated Patterns). Zero mathematical false positives.

Run via Claude Code on a fresh Linux account on a separate host with no prior configuration. Input as markdown (pre-v3 version). Medium effort.

Claude Opus 4.6

No false positives

Zero false positives. Most careful treatment of Conjecture 4's implications for the Collatz periodic orbit problem — correctly identified the D-near-zero subtlety and flagged the missing Eliahou (1993) citation (existing cycle-exclusion results up to length 17 million, directly relevant to the conjecture's scope). Also noted the LY obstruction doesn't close the door on all function space approaches.

Same run conditions as Sonnet 4.6 — fresh Linux account, separate host, markdown input, Medium effort.

DeepSeek R1 70B

No false positives

Zero false positives. Most positive verdict of any model: "deserves publication in a top-tier journal with minor clarifications." Rating: 9/10. Only model to give an explicit accept recommendation.

Run locally via Ollama on 64GB M3 Max. Input as markdown from the dev repo — a post-v2 version with improved LaTeX cross-references and labels, though substantive content unchanged. Thinking section shows genuine section-by-section engagement before writing.

"The proof of ‖ℒ‖ = 2/3 is correct and elegant, leveraging the mod-3 structure of the weight function W(n)." — Mistral Large

"Correctly distinguished proved results from conjectures throughout… Identified Theorem 6 as the analytical high point." — Gemini 3.1 Pro

"This paper significantly advances understanding of the Syracuse map's dynamics… It deserves publication in a top-tier journal with minor clarifications." — DeepSeek R1 70B

§6 — The Paper

What was reviewed

Ghost Cycles of the Syracuse Map: 2-Adic Periodic Orbits and the Exceptional Set

Adam McKenna · March 2026

Zenodo (v2, reviewed) DOI 10.5281/zenodo.18949342 github.com/mysticflounder/collatz

The version reviewed above is v2. v3 (arXiv forthcoming) addresses all six consensus issues, adds Proposition 6 (archimedean non-compactness), and includes the projective-limit remark that resolves the GPT Pro false positive.

Raw model outputs and complete review notes are in docs/reviews/ in the repository above. The cross-model summary and pre-submission checklist are in docs/reviews/ai-review-results.md.

Eight frontier models.Zero fatal objections.

A preprint sent to eight frontier models

Results at a glance

What all models agreed on

What the models said

A model-intrinsic weakness in GPT-5.4 Pro

Highlights by model

What was reviewed

Eight frontier models.
Zero fatal objections.