Gemini Pro (the medium-sized version of the model) is now available to interact with via Bard.

Here’s a fun and impressive demo video showing off Gemini’s multi-modal capabilities:

[Edit, December 8, 2025 at 5:54am EST: This demo video is potentially misleading.]

[Edit #2, October 16, 2025 at 2:10pm EST: Google eventually unlisted the video, which means it's still playable if you already have the link.]

How Gemini compares to GPT-4, according to Google DeepMind:

21

0
0

Reactions

0
0
Comments7


Sorted by Click to highlight new comments since:

warning - mildly spicy take

In the wake of the release, I was a bit perplexed by how much of Tech Twitter (answered by own question there) really thought this a major advance.

But in actuality a lot of the demo was, shall we say, not consistently candid about Gemini's capabilities (see here for discussion and here for the original).

At the moment, all Google have released is a model inferior to GPT-4 (though the multi-modality does look cool), and have dropped an I.O.U for a totally-superior-model-trust-me-bro to come out some time next year.

Previously some AI risk people confidently thought that Gemini would be substantially superior to GPT-4. As of this year, it's clearly not. Some EAs were not sceptical enough of a for-profit company hosting a product announcement dressed up as a technical demo and report.

There have been a couple of other cases of this overhype recently, notably 'AGI has been achieved internally' and 'What did Ilya see?!!?!?' where people jumped to assuming a massive jump in capability on the back on very little evidence, but in actuality there hasn't been. That should set off warning flags about 'epistemics' tbh.

On the 'Benchmarks' - I think most 'Benchmarks' that large LLMs use, while the contain some signal, are mostly noisy due to the significant issue of data contamination (papers like The Reversal Curse indicate this imo), and that since LLMs don't think as humans do we shouldn't be testing them in similar ways. Here are two recent papers - one from Melanie Mitchell, one about LLMs failing to abstract and generalise, and another by Jones & Bergen[1] from UC San Diego actually empirically performing the Turing Test with LLMs (the results will shock you)

I think this announcement should make people think near term AGI, and thus AIXR, is less likely. To me this is what a relatively continuous takeoff world looks like, if there's a take off at all. If Google had announced and proved a massive leap forward, then people would have shrunk their timelines even further. So why, given this was a PR-fueled disappointment, should we not update in the opposite direction?

Finally, to get on my favourite soapbox, dunking on the Metaculus 'Weakly General AGI' forecast:

  • 13% of the community prediction is already in the past (x < Dec 2023). Lol, lmao.
  • Also judging by Cumulative Probability:
    • ~20% likely in 2024 (really??!?!?!? if only this was real that'd be free money for sceptics)
    • ~16% likely in 2025
    • Median Date March 2026
    • The operationalisation of points 1 and 3 to my mind make this nearly ~0-1% in that time frame
      • Number 1 is an adversarial Turing Test. LLMs, especially with RLHF, are like the worst possible systems at this. I'm not even in kidding, in the paper I linked above sometimes ELIZA does better
      • Number 3 requires SAT tests (or, i guess, tests with overlapping Questions and Answers) not be in the training data. The current paradigm relies on scooping up everything, and I don't know how much fidelity the model makers have in filtering data out. Also, it's unlikely they'd ever show you the data they trained on as these models aren't proprietary. So there's know way of knowing if a model can meet point 3!
      • 1 & 3 makes me think a lot of AGI forecasts are from vibes and not looking at the question operationalisations and the technical performance of models

tl;dr - Gemini release is disappointing. Below many people's expectations of its performance. Should downgrade future expectations. Near term AGI takeoff v unlikely. Update downwards on AI risk (YMMV).

  1. ^

    I originally thought this was a paper by Mitchell, this was a quick system-1 take that was incorrect, and I apologise to Jones and Bergen. 

I think this announcement should make people think near term AGI, and thus AIXR, is less likely. To me this is what a relatively continuous takeoff world looks like, if there's a take off at all. If Google had announced and proved a massive leap forward, then people would have shrunk their timelines even further. So why, given this was a PR-fueled disappointment, should we not update in the opposite direction?

[...]
Gemini release is disappointing. Below many people's expectations of its performance. Should downgrade future expectations. Near term AGI takeoff v unlikely. Update downwards on AI risk (YMMV).

I think the update here should be pretty small. I'm unsure if you disagree. I would also think the update should be pretty small if gemini is notably better than GPT4, but not wildly better. It seems plausible to me that people would (incorrectly) have a large update toward shorter timelines if gemini was merely substantially better than GPT4, but we don't have to make the same mistake in the other direction.

It's worth noting there is some asymmetry in the likely updates with a high probability of a mild negative update on near term AI and a low probability of a large positive update toward powerful near term AI. E.g., even if google were to explode and never release a better LLM than gemini, this would be a relatively smaller update than if they were to release transformatively powerful AI.

Hey Ryan, thanks for your engagement :) I'm going to respond to your replies in one go if that's ok

#1:

It's worth noting there is some asymmetry in the likely updates with a high probability of a mild negative update on near term AI and a low probability of a large positive update toward powerful near term AI.

This is a good point. I think my argument would point to larger updates for people who put susbtantial probability on near term AGI in 2024 (or even 2023)! Where do they shift that probability in their forecast? I think just dropping it uniformly over their current probability would be suspect to me. So maybe it'd wouldn't be a large update for somebody already unsure what to expect from AI development, but I think it should probably be a large update for the ~20% expecting 'weak AGI' in 2024 (more in response #3)

#2:

Further, manifold doesn't seem that wrong here on GPT4 vs gemini? See for instance, this market: 

Yeah I suppose ~80%->~60% is a decent update, thanks for showing me the link! My issue here would be the resolution criteria realy seems to be CoT on GSM8K, which is almost orthogonal to 'better' imho, especially given issues accounting for dataset contamination - though I suppose the market is technically about wider perception rather than technical accuracy. I think I was basing a lot of my take on the response on Tech Twitter which is obviously unrepresentative, and prone to hype. But there were a lot of people I generally regard as smart and switched-on who really over-reacted in my opinion. Perhaps the median community/AI-Safety researcher response was more measured.

#3:

As in, the operationalization seems like a very poor definition for "weakly general AGI" and the tasks being forecast don't seem very important or interesting.

I'm sympathetic to this, but Metaculus questions are generally meant to be resolved according a strict and unambiguous criteria afaik. So if someone thinks that weakly general AGI is near, but that it wouldn't do well at the criteria in the question, then they should have longer timelines than the current community response to that question imho. The fact that this isn't the case to me indicates that many people who made a forecast on this market aren't paying attention to the details of the resolution and how LLMs are trained and their strengths/limitations in practice. (Of course, if these predictors think that weak AGI will happen from a non-LLM paradigm then fine, but then i'd expect the forecasting community to react less to LLM releases)

I think where I absolutely agree with you is that we need different criteria to actually track the capabilities and properties of general AI systems that we're concerned about! The current benchmarks available seem to have many flaws and don't really work to distinguish interesting capabilities in the trained-on-everything era of LLMs. I think funding, supporting, and popularising research into what 'good' benchmarks would be and creating a new test would be high impact work for the AI field - I'd love to see orgs look into this!

B

Can't we just use an SAT test created after the data cutoff?...You can see the technical report for more discussion on data contamination (though account for bias accordingly etc.)

For the Metaculus question? I'd be very upset if I had a longer-timeline prediction that failed because this resolution got changed - it says 'less than 10 SAT exams' in the training data in black and white! The fact that these systems need such masses of data to do well is a sign against their generality to me.

I don't doubt that the Gemini team is aware of issues of data contamination (they even say so at the end of page 7 in the technical report), but I've become very sceptical about the state of public science on Frontier AI this year. I'm very much in a 'trust, but verify' mode and the technical report is to me more of a fancy press-release that accompanied the marketing than an honest technical report. (which is not to doubt the integrity of the Gemini research and dev team, just to say that I think they're losing the internal tug-of-war with Google marketing & strategy)

#4:

This doesn't seem to be by Melanie Mitchell FYI. At least she isn't an author.

Ah good spot. I think I saw Melanie share it on twitter, and assumed she was sharing some new research of hers (I pulled together the references fairly quickly). I still think the results stand but I appreciate the correction and have amended my post.

<>    <>    <>    <>    <>

I want to thank you again for the interesting and insightful questions and prompts. They definitely made me think about how to express my position slightly more clearly (at least, I hope I make more sense to you after this reponse, even if we don't agree on everything) :)

Thanks for the response!

A few quick responses:

it says 'less than 10 SAT exams' in the training data in black and white

Good to know! That certainly changes my view of whether or not this will happen soon, but also makes me think the resolution criteria is poor.

I think funding, supporting, and popularising research into what 'good' benchmarks would be and creating a new test would be high impact work for the AI field - I'd love to see orgs look into this!

You might be interested in the recent OpenPhil RFP on benchmarks and forecasting.

Perhaps the median community/AI-Safety researcher response was more measured.

People around me seemed to have a reasonably measured response.

I think we'll probably get a pretty big update about the power of LLM scaling in the next 1-2 years with the release of GPT5. Like, in the same way that each of GPT3 and GPT4 were quite informative even for the relatively savvy. 

[Unimportant]

here are two recent papers from Melanie Mitchell, [...] and another actually empirically performing the Turing Test with LLMs (the results will shock you

This doesn't seem to be by Melanie Mitchell FYI. At least she isn't an author.

Previously some AI risk people confidently thought that Gemini would be substantially superior to GPT-4.

I think this slightly misrepresents the corresponding article and the state of the forecasts. The quote from the linked article is:

By all reports, and as one would expect, Google’s Gemini looks to be substantially superior to GPT-4. We now have more details on that, and also word that Google plans to deploy it in December, Manifold gives it 82% to happen this year and similar probability of being superior to GPT-4 on release.

This doesn't seem to exhibit that much confidence in "gemini being substantially superior"? I expect that if Zvi gave specific probabilites, they would be pretty reasonable.

ETA: I retract my claim about Zvi, on further examination, he seems pretty wrong here. That said, manifold doesn't seem to have done too badly.

Further, manifold doesn't seem that wrong here on GPT4 vs gemini? See for instance, this market: 

The forecast has updated from 80% to about 60%, which doesn't seem like much of an update.

I agree that we should update down on google competence and near term AGI, but it just doesn't seem like that big of an update yet?

Finally, to get on my favourite soapbox, dunking on the Metaculus 'Weakly General AGI' forecast:

I think the forecast seems broadly reasonable, but the question and title seem quite poor. As in, the operationalization seems like a very poor definition for "weakly general AGI" and the tasks being forecast don't seem very important or interesting.

I think GPT-4V likely already achieves 2 (winograd) and 3 (SAT) while 4 (montezuma's revenge) seems plausible for GPT-4V, though unclear. Beyond this, 1 (turing test) seems to be extremely dependent on the extent to which the judge is competently adversarial and whether or not anyone actually finetunes a powerful model to perform well on this task. This makes me think that this could plausibly resolve without any more powerful models, but might not happen because no one bothers running a turing test seriously.

  • Number 3 requires SAT tests (or, i guess, tests with overlapping Questions and Answers) not be in the training data. The current paradigm relies on scooping up everything, and I don't know how much fidelity the model makers have in filtering data out. Also, it's unlikely they'd ever show you the data they trained on as these models aren't proprietary. So there's know way of knowing if a model can meet point 3!

Can't we just use an SAT test created after the data cutoff? Also, my guess is that the SAT results discussed in the GPT-4 blog post (which are >75th percentile) aren't particularly data contaminated (aside from the fact that different SAT exams are quite similar which is the same for human students). You can see the technical report for more discussion on data contamination (though account for bias accordingly etc.)

Curated and popular this week
 ·  · 13m read
 · 
It’s to build the skills required to solve the problems that you want to solve in the world [I am a career advisor at 80,000 Hours. This post is adapted from a talk I gave on career capital to some ambitious altruistic students. If you prefer slides, you can access them here. These ideas are informed by my work at 80k but reflects my personal views.] I’m often asked how to have an impactful fulfilling career. My four word answer is “get good, be known.” My “fits on a postcard” answer is something like this: 1. Identify a problem with a vast scale of harm that is neglected at current margins and that is tractable to solve. Millions will die of preventable diseases this year, billions of animals will be tortured, severe tail risks like nuclear war and catastrophic pandemics still exist, and we might be on the cusp of a misaligned intelligence explosion. You should find an important problem to work on. 2. Obsessively improve at the rare and valuable skills to solve this problem and do so in a legible way for others to notice. Leverage this career capital to keep the flywheel going— skills allow you to solve more problems, which builds more skills. Rare and valuable roles require rare and valuable traits, so get so good they can’t ignore you. Unfortunately, some ambitious and altruistic young people that I speak to seem to have implicitly developed a model that looks more like this: 1. Identify a problem with a vast scale of harm, that is neglected at current margins, and that is tractable to solve. 2. Get a job from the 80,000 Hours job board at a capital-E capital-A Effective Altruist organization right out of college, as fast as possible, otherwise feel like a failure, oh god, oh god... I empathize with this feeling. Ambitious people who care about reducing risk and suffering in the world understandably think it’s the most important thing they can be doing, and often hold themselves to a high standard when trying to get there. Before properly entering the w
 ·  · 6m read
 · 
I really enjoyed reading the "why I donate" posts in the past week, so much so that I felt compelled to add my reflections, in case someone finds my reasons as interesting as I found theirs. 1. My money must be spent on something, might as well spend it on the most efficient things The core reason I give is something that I think is under-represented in the other posts: the money I have and earn will need to be spent on something, and it feels extremely inefficient and irrational to spend it on my future self when it can provide >100x as much to others. To me, it doesn't seem important whether I'm in the global top 10% or bottom 10%, or whether the money I have is due to my efforts or to the place I was born. If it can provide others 100x as much, it just seems inefficient/irrational to allocate it to myself. The post could end here, but there are other secondary reasons/perspectives on why I personally donate that I haven't seen commonly discussed. 2. Spending money is voting on how the global economy allocates its resources In 2017, I read Wealth: The Toxic Byproduct by Kevin Simler. Surprisingly, I don't think it has ever been posted on this forum. I disagree with some of it, but the core points really changed how I think about wealth, earning, and spending. The post is very well written and enjoyable, but it's 2400 words, so copy-pasting some snippets: > A thought experiment — the Congolese Trading Window: > > Suppose one day you wake up to find a large pile of Congolese francs. [...] A window [...] pushes open to reveal the unfamiliar sights of a Congolese outdoor market [...] a man approaches your window. [...] He's asking if you'd like to buy his grain for 500 francs. > > What should you do? [...] > Your plan is to buy grain whenever you think the price is poised to go up in the near future, and sell whenever you think the price is poised to go down. > [...] > Imagine a particular bag of grain that you bought at T1 for 200 francs, and then sold at T
 ·  · 6m read
 · 
Note: This post was crossposted from the Coefficient Giving Farm Animal Welfare Research Newsletter by the Forum team, with the author's permission. The author may not see or respond to comments on this post. ---------------------------------------- It can feel hard to help factory-farmed animals. We’re up against a trillion-dollar global industry and its army of lobbyists, marketeers, and apologists. This industry wields vast political influence in nearly every nation and sells its products to most people on earth. Against that, we are a movement of a few thousand full-time advocates operating on a shoestring. Our entire global movement — hundreds of groups combined — brings in less funds in a year than one meat company, JBS, makes in two days. And we have the bigger task. The meat industry just wants to preserve the status quo: virtually no regulation and ever-growing demand for factory farming. We want to upend it — and place humanity on a more humane path. Yet, somehow, we’re winning. After decades of installing battery cages, gestation crates, and chick macerators, the industry is now removing them. Once-dominant industries, like fur farming, are collapsing. And advocates are building momentum toward bigger reforms for all farmed animals. Here are my top ten wins from this year: 1. Liberté et Égalité, for Chickens. France’s largest chicken producer, the LDC Group, committed to adopting the European Chicken Commitment for its two flagship brands by 2028 — a shift that French advocacy group L214 estimates will cover 40% of the national chicken market, or up to 400 million birds each year. Across the Channel, British supermarket chain Waitrose transitioned all its own-brand chicken to comply with the parallel UK Better Chicken Commitment. 2. Guten Cluck! The Wurst Is Over for German Animals. Germany’s top retailer, Edeka, committed to making all of its own-brand chicken products compliant with Germany’s equivalent of the European Chicken Commitment by 2030