An interesting (unscientific) experiment on #MathOverflow from a few months ago, where a user gave 15 different MO problems for o1 to answer, with the aim of verifying and then rewriting the answer into a presentable form if the AI generated answer was correct. The outcome was: one question answered correctly, verified, and rewritten; one question given a useful lead, which led the experimenter to find a more direct answer; one possibly correct answer that the experimenter was not able to verify; and the remainder described as "a ton of time consuming chaos", in which the experimenter spent much time trying to verify a hallucinated response before giving up. https://meta.mathoverflow.net/questions/6114/capabilities-and-limits-of-ai-on-mathoverflow This success rate largely tracks with my own experience with these tools. At present this workflow remains less efficient than traditional pen-and-paper approaches; but with some improvement in the success rate, and (more importantly) an improved ability to detect (and then reject) hallucinated responses, I could see one soon reaching a point where a non-trivial fraction of the easier problems in MO could be resolved by a semi-automated method.
I found the discussion for possible AI disclosure policies for MO in the post to also be interesting.
@tao As a moderator, let me stress that this experiment is not something I'd like to see replicated en masse. Maybe in a year or two it might be worth revisiting to check on progress of the tools.
@tao You have to get it to code. o1 does not code. Use Claude.
@brandon According to the original post, Claude and Gemini performed strictly worse than o1 for this experiment.
@tao https://claude.ai/share/e75d2ad3-0ebf-4a63-a527-9ab5a96ef002
You have to get it to code, and there are different models. He doesn't say which models he used in either Gemini or Claude. For Google he should have used Google Studio. Although Google studio gets really bad fast but it's good at validating hallucinations from other models. It's all about the prompt engineering. Although it has it's runs of misleading information and it's not great at all problems. Different models for different situations. Although, the theory can be strong, if you don't get it to code it, and validate that the code ran, you can't trust it. Giving it specific problems, without sharing the prompt, is a bit misleading. I agree with you though. But I do think the models that code out the solution are a lot better at being consistent with their answers.
Also I read your paper on the Hall-Montgomery constant for n greater than or equal to five. What a trip. Then I spent all day trying to prove Hall-Montgomery wrong. Computational overload. I had no idea. Thought I had something, and learned a new lesson about computational sampling. Great job on making that argument airtight. I spent a little time trying to figure out why your constant exists for your open question. So far, dead end.
@brandon Most questions on MathOverflow cannot be solved through code, as they often involve infinite spaces, or unbounded numerical quantities. In the long term, when formal proof assistants improve and become more integrated with AI, it may become possible for AI to generate answers to MathOverflow type questions written in such languages; but this is still several years away from being feasible at current trends.
@tao It sure is a lot fun though. For whatever it can't do, it makes up for in entertainment value. I would even cancel my Netflix if my wife would let me. lol
@tao I wonder if such an AI answer would be something good. I can see in the near future AI being used to verify proofs written in a human way rather than using LEAN, for example (although my experience with LEAN is short). That being said I don't really know what to think of the usage of AI for the MO answers, in my personal experience MO and IRC (not as good as MO) are better for the development of ones mathematical abilities, usually AI's are susceptible to agree with what one write, giving that the so-called prompt engineering is something relevant it can also mislead those who don't know how to write a correct prompt to get the correct answer.
It's something good that the OP has done this unscientific experiment, at least the general public can be more aware of the current state of AI answers.