Recently, the application of AI tools to Erdos problems passed a milestone: an Erdos problem (#728 https://www.erdosproblems.com/728) was solved more or less autonomously by AI (after some feedback from an initial attempt), in the spirit of the problem (as reconstructed by the Erdos problem website community), with the result (to the best of our knowledge) not replicated in existing literature (although similar results proven by similar methods were located).
This is a demonstration of the genuine increase in capability of these tools in recent months, and is largely consistent with other recent demonstrations of AI using existing methods to resolve Erdos problems, although in most previous cases a solution to these problems was later located in the literature, as discussed in https://mathstodon.xyz/deck/@tao/115788262274999408 . This particular case was unusual in that the problem as stated by Erdos was misformulated, with a reconstruction of the problem in the intended spirit only obtained in the last few months, which helps explain the lack of prior literature on the problem. However, I would like to talk here about another aspect of the story which I find more interesting than the solution itself, which is the emerging AI-powered capability to rapidly write and rewrite expositions of the solution. (1/5)
Let me begin by quickly recapping the history of this problem. In 1975, Erdos, Graham, Ruzsa, and Strauss studied the prime factorization of binomial coefficients such as
A few months ago, a team associated with the AI tool AlphaProof observed that the problem admitted several trivial solutions, if 𝑎 or 𝑏 were allowed to be large compared with 𝑛. This technically solved the problem, but was deemed not in the spirit of the question, and an additional constraint 𝑎,𝑏≤(1−ε)𝑛 was imposed to rule out these solutions. Further AI-assisted literature search did not turn up significant work on this problem. (2/5)
On Jan 4, ChatGPT was able to produce a proof even with the adjusted constraint, but with 𝐶 taken to be a small constant: https://chatgpt.com/s/t_695bdbf3047c8191af842d03db356b1a ; this was then formalized by Aristotle into Lean. However, it was determined on closer reading of the source paper that 𝐶 was intended to be large; indeed, the results in the original paper already established the small 𝐶 claim, although this was not evident to us until a few days later.
Shortly afterwards, a different web site participant ran the Lean proof through ChatGPT to rewrite it in natural language https://drive.google.com/file/d/1ejHqEddpD52SYubZlK8eYx2aoanFmMs8/view?usp=sharing , and then after further conversation obtained an improved writeup https://drive.google.com/file/d/1HAoRuYiUTN0PNOhjY96-dWKNqJ1X4Ucy/view?usp=sharing , in which several gaps in the original proof were filled in. The exposition was still somewhat clunky and "AI" in feel, and lacked much of the remarks and references to literature that would organically accompany a human-written proof, but was readable enough that the general ideas of the proof could be readily extracted. (3/5)
Meanwhile, with further prompting, ChatGPT was also able to adapt the argument to handle large 𝐶 as well as small 𝐶, thus finally producing a new result in the spirit of the intended question https://drive.google.com/file/d/1xRw8_o2C8HwmxMDnBR5OJlxXaW7jlYbz/view?usp=sharing . Interestingly, the proof contained some minor errors in it, but the AI tool Aristotle was able to automatically repair these gaps and produce a Lean-verified proof.
At this point, a third particiant ran Aristotle again on the existing Lean proof to provide a shorter version, which a different participant then input into a lengthy back-and-forth ChatGPT session https://chatgpt.com/share/695e7cbd-605c-8010-809b-ccba75560c76 to turn it into a much more fully fleshed article that described not just the proof itself, but its connection with prior literature and with a tighter narrative structure. This resulted in a new writeup of the proof https://drive.google.com/file/d/1MRQfcHhrYMfMTvlZcMC3zEK7aOrUyHiQ/view?usp=sharing that had less of the feel of a generic AI-produced document, and which I judge to be at a level of writing within ballpark of an acceptable standard for a research paper, although there is still room for further improvement. (I review this text at https://www.erdosproblems.com/forum/thread/728#post-2852 ). (4/5)
My preference would still be for the final writeup for this result to be primarily human-generated in the most essential portions of the paper, though I can see a case for delegating routine proofs to some combination of AI-generated text and Lean code. But to me, the more interesting capability revealed by these events is the ability to rapidly write and rewrite new versions of a text as needed, even if one was not the original author of the argument.
This is sharp contrast to existing practice where the effort required to produce even one readable manuscript is quite time-consuming, and subsequent revisions (in response to referee reports, for instance) are largely confined to local changes (e.g., modifying the proof of a single lemma), with large-scale reworking of the paper often avoided due both to the work required and the large possibility of introducing new errors. However, the combination of reasonably competent AI text generation and modification capabilities, paired with the ability of formal proof assistants to verify the informal arguments thus generated, allows for a much more dynamic and high-multiplicity conception of what a writeup of an argument is, with the ability for individual participants to rapidly create tailored expositions of the argument at whatever level of rigor and precision is desired.
Presumably one would still want to have a singular "official" paper artefact that is held to the highest standards of writing; but this primary paper could now be accompanied by a large number of secondary alternate versions of the paper that may be somewhat looser and AI-generated in nature, but could hold additional value beyond the primary document. (5/5)
Addendum: as portions of my text above have been quoted out of context, I would like to also draw attention to the various caveats listed at https://github.com/teorth/erdosproblems/wiki/AI-contributions-to-Erd%C5%91s-problems regarding the extent to which one can draw broader conclusions about AI mathematics capabilities from the progress in solving Erdos problems.
@tao What would the additional value of the generated documents be?
@tao Are there any current plans on creating objective measures of 'similarity' to existing results? For myself and maybe others unfamiliar with these specific fields, this progress is difficult to gauge independently of these posts and the Erdos problem forum.
For example, on this thread, https://www.erdosproblems.com/forum/thread/AI%20Contributions#post-2455 it was commented that
"[When] an AI can one-shot a proof/counterexample—producing it straight away in a way that even humans find genuinely clever—I take it as a strong hint that the result is probably already in the literature."
which seemed to garner some agreement. Measuring to what precise extent some AI result was already present in the literature or not already present in the literature seems important to say there has been an improvement in 'genuine cleverness'. Considering there were similar results found in the literature as you stated above and as stated here https://www.erdosproblems.com/forum/thread/728#post-2824, this question of similarity seems important now and in the future.
The kind of thing I have in mind is something along the lines of mutual information, but in a proof theoretic context (e.g. how long is the shortest lean formalized proof given lean formalized versions of existing similar proofs in the literature or something like that). Which seems more feasible to empirically measure now than it once was, given the nature of the AI assisted Lean formalization work being done.
@tao I guess, we should fasten seatbelts, this is a very bright signal!
Rather philosophical question: Is 'informal' always a suitable adjective (regarding no explicit context given) or can we make always precise arguments whether 'semiformal' fits the landscape better? Despite I have some critiques on the MIT-paper about the 'platonic representation hypothesis', there seems to be at least some weak correspondence(s) across likelihood spaces, independent from solid/minimal model assumptions.
@tao we have a similar problem with coding AI assistants. The ability of AI to continually refactor globally makes it hard for coders to maintain a handle on the codebase.
In both cases it suggests that the code/paper is now a secondary format and that people should be working on a higher-level document. Something that AI can work from to generate various incarnations of code/proofs.
@TomL The "problem" you speak of is not a failing of programmers, it's a failing of the random text generator. The "changes" are large because there is no "understanding" of the existing code, just a different input and different internal state (that you don't control) resulting in a different outcome that requires complete revalidation of the newly generated system. Great! You've replaced engineering with roulette and the monkeys are controlling your pacemaker. Best of luck with that.
Aye! amazing display of new technology allowing better collaboration from diverse scientists groups. Plus assisting in rethinking!
Q: The rewriting capabilities how much need of context keeping, and how much of improved prompts?
Today also enjoyed reading this post that explains beautifully, some alike approaches, challenges, and quality assurance measures, but instead when using #LLMs #vibecoding #GenAI for #softwareengineering in corporate regulated domains solutions:
, Here’s the nuance: while generation is probabilistic, the produced code can still be deterministic once compiled and executed — if we enforce the right controls.’
by #AjitPatil
loving the vibe where every sentence feels like an AI trying to win a grant proposal from another AI just sprinkle enough 'nuance' and 'regulated domains' and boom—funded. #LLM #GenAI #buzzwordSoup
Prompting ChatGPT 5.2 ExtThk produced a one shot suitable proof for Open Erdős Problem 460 best summarized as:
For every n ≥ 3, the “good-index” restricted sum
S≤(n) := ∑
i≥1:
∃ p prime, p≤ai, p|(n−ai)
1
ai
also diverges to +∞.
• For every n ∈ N, the complementary “bad-index” subseries
S>(n) := ∑
i≥1:
∀ p prime, p≤ai, p∤(n−ai)
1
ai
is finite (hence convergent).
My favorite part about this proof is how many times ai says ai to solve for ai. I believe this is not coincidental that this recursiveness is quietly beautiful.
For n ≥ 3, the greedy coprimality condition forces the difference values bi := n − ai to be
pairwise coprime and nonzero. This makes it impossible to “avoid” b = −q once q is a
sufficiently large prime: any earlier bi is too small in absolute value (and nonzero) to be
divisible by q. Therefore a = n + q must occur for every prime q > n − 1. The sum S(n) then
dominates a shifted tail of ∑
q prime 1/q, which diverges. A technical rigor point is that the
clean inequality 1/(n + q) ≥ (1/2)(1/q) is used only for primes q > n.
@tao for problems like erdos problems how would you analyze or visual a large scale ai sweep from ai of say thousands of problems. I know for the equational theories project you use a graph but how could you do that for something like number theory or combinatorics
@allendist57 We have some statistics and graphs for the Erdos problem database as a whole; see https://github.com/teorth/erdosproblems . But for now, it seems the most productive setup is to have individual web pages and discussion forums for each problem, see https://www.erdosproblems.com/forum/
@tao do you think you could use an embedding to group similar problems together
@tao @allendist57 I've been working on leangenius.org with the idea that people might enjoy commenting on lines of the proof as a UX (similar to rap genius) - e.g. https://leangenius.org/proof/erdos-728-factorial-divisibility
It's been a fun side project and might spark more ideas on how to communicate results in the age of generative AI. (One idea -perhaps users could configure their desired level of pedantry and AI annotations could automatically be shown or hidden)
@tao my apologies, it looks like you already have problems public on GitHub which is great. That said my earlier comment might be useful for like determining the difficulty of the problem. I think math should have its own GitHub, call it like mathhub or something lol. GitHub is good, that said I don’t think it’s the best system for your goals.
@tao You mentioned, “Unfortunately, it is hard to tell in advance which category a given problem falls into, short of an expert literature review.”
I was thinking: what if we had a public system where Erdos problems are written in Lean? People could create accounts and self-report their level of math education, then attack whichever problems they want. You could then determine the difficulty of a problem based on how many people attempted it and what their education levels were.
Once you have that system set up, you could incorporate AI running in the background to try and solve these Erdos problems. If it succeeds, great. If it fails, it could generate a report on the problem, which would be useful for a human when they eventually encounter it.
I was also thinking about having the AI create something like a web of related problems. That way, when a human solves a problem, the AI might attack other connected problems using the ideas from the human’s proof.
Basically, the AI tries to solve the problems first; if it works, good, but if not, there is a report. Then, the AI maps out which problems are related to each other (it’s not a perfect connection, but it’s better than nothing). Then, the public attacks those problems. If they succeed at one, the AI "turns on" like the terminator lol and tries to attack all the connected problems using the ideas in the human-generated proof. It might create a domino effect.
Jan 11th Prompting ChatGPT 5.2 ExtThk produced a one shot suitable proof for Open Erdős Problem 460.
https://www.erdosproblems.com/460
Erdős posed Problem #460 in the setting of greedy coprimality constructions in combinatorial number theory (see, e.g., [ Erd77, EG80]), and a current formulation is recorded in the Erdős Problems database [ Blo]. The problem concerns a parameter n ∈ N and a greedily defined integer sequence (ak), and asks whether a reciprocal series diverges as n → ∞, as well as the behavior of two natural restricted subseries.
A key point for absolute robustness is that the literature appears in (at least) two closely related normalizations:
• Erdős Problems Database normalization (used in [Blo]): a0 = n, a1 = 1, and for k ≥ 2 choose the least integer m > ak−1 such that gcd(n − m, n − ai) = 1 for all 1 ≤ i < k.
• Erdős–1977 normalization (as commonly quoted from [Erd77]): a0 = 0, a1 = 1, and for k ≥ 2 choose the least integer m > ak−1 such that gcd(n − m, n − ai) = 1 for all 0 ≤ i < k (i.e. including i = 0, which enforces gcd(n − m, n) = 1). These choices affect small-n behavior (notably n = 2), so results are stated with the normalization explicit.
For the remainder of the paper we present the full proof under the database normaliza-tion (since it matches [ Blo ]), and we then record the precise changes under the Erdős–1977 normalization. In particular, the core prime-embedding argument is proved in a form that simultaneously covers both normalizations for all n ≥ 3 by embedding primes q > n.
Can someone in Number Theory arXiv please endorse me so I can post the paper for review?
@tao Since this is pretty far from my field, I was wondering if you had any sense of how significant a result this is? If there were no AI involvement (say a grad student solved this problem) what kind of journal would you suggest submitting it to? Proceedings of the AMS?
@noahsnyder @tao And this is great news because?
I have yet to meet the mathematician who is looking forward to becoming an LLM prompter and verifier instead of actually proving things themselves. I guess I'm not sufficiently AI-hyped to understand why this will be a good thing...
@noahsnyder By itself, I would not rate this as a publishable result; papers are not just about solutions to problems, but placing these solutions in the broader context of the field. So I could see this solution becoming part of a more substantial paper surveying the state of various related problems involving these sorts of binomial coefficient divisibility problems, describing what the standard methods can accomplish (including the solution to this problem as an example), and discussing the remaining open problems in the field (https://www.erdosproblems.com/376 is a fairly well known one for instance). I could see AI tools playing a role in systematically exploring the application of such well understood methods to reasonably large classes of problems at a scale which would either be impracticable or inefficient to have human mathematicians work out by traditional means.
@noahsnyder An update: as it turns out, the AI-generated solution to https://www.erdosproblems.com/728 can be modified to also answer the closely related problem https://www.erdosproblems.com/729; and as part of the community discussion of these results, a previously unreported link to another similar problem https://www.erdosproblems.com/401 was unearthed, which seems likely to be solved by these AI tools soon as well. A more challenging variant problem https://www.erdosproblems.com/400 is now being looked at, which would likely require more sophisticated arguments. With these developments, I would revise my previous assessment; while the solution of just 728 would not be sufficient for a research-level paper, a broader paper developing some general results that can handle all of these problems (and perhaps some other questions not posed by Erdos) in a unified fashion would be a reasonable candidate for a paper publishable in a reputable journal.
@tao do you have some send dog how similar these problems are to existing literature as well as to each other. They seem quite similar and also were part of larger papers.