Regarding AI & actual comprehension: Physics IQ Benchmark

ximossi · December 15, 2025, 10:23am

As the topics of LLMs and “AI” are very prominent around here, I’d like to point to this current study:

“Physics IQ” is a benchmark study by a few researchers from an institute in Sofia, Bulgaria. They formulate a spectrum to analyze the actual comprehension of physics of video generating AI systems. For this they compare a few of those and point out the difference between being able to somehow “predict” and understand physics by watching them (as humans do), and on the other hand just statistically doing something that looks good.

I found it easy to follow and well presented. Of course, this only applies to video generation, but in general this applies to all LLM/AI systems there are right now: still just a statistical model without actual comprehension.

dragonforge-dev · December 16, 2025, 3:43pm

So it was really interesting at first. I listened to the podcast and I’m thinking man these people are really funny and engaging, almost as if the discussion is scripted. It was an AI-generated podcast. The voices were reminiscent of a tech podcast I used to listen to like 20 years ago. But there were a few things missing.

There was no advertising, and I don’t mean like they cut the commercials out. I mean they never mentioned their names, or the name of the podcast. Especially at the very end this was sus. You always want people to know how to find more if you like a podcast.
It was completely scripted with no human foibles you typically hear in a podcast. Podcasters typically have little things they do that are clearly mistakes that get left in on purpose because it helps us connect to the people.
They never mentioned the qualifications of the male voice who sounded like a deep-voiced podcaster, but at the end was “a guest” but then also thanked people for being there.
It had an LLM “Oh yes, I agree with everything you say!” vibe. Especially at the end when they got off-topic and were talking more generally about physics.
The expert level and understanding of the female voice changed over the duration of the podcast - not like they were learning, just all of a sudden they knew more or less. (Again, felt very scripted - specifically by an AI.)

Having said all that, I am very suspicious of this entire thing. I do not think it’s real.

There is a white paper abstract with no attached white paper. A fancy one-page website does not a white paper make.
The “authors” are hyperlinked, and if you look at them, the github pages are SUUUUPER fancy. Almost no one does that with their github profile unless they’re selling something.
The scholar.google.com pages are for unverified people. If they worked for Google they would have better profiles.
The only person who seems real is the last person. Especially since the repo is his.
There are no references for the “podcast”. (Again indicating it was AI generated.)
The “authors” do not match the contributors in the GitHub repo.

My conclusion is that this tool might be real. I’ve never heard of ELLIS, so I do not know if it is real. But I believe that this is one guy who is making this look more “official” than it is to get traction on it. It makes me suspicious about the validity of the work. It also makes me wonder if an agenda is being pushed.

mrcook · December 16, 2025, 7:16pm

That podcast is 100% AI generated voices lol and going through that github page it felt way too sterile for a collaboration project

ximossi · December 17, 2025, 9:11am

Thank you for your chain of thought - I think it is very valuable to be critical nowadays. I do share your critique on the whole podcast thing, maybe even their presentation of the project. I did not listen to the podcast fully (maybe the first minute). I, too, was put off by the voices. I am not to say if they are generated, or just heavily processed (which is also not unusual for amateur recordings nowadays).

Why they call it a podcast is above me: it is not a series or anything, but one 15 minute piece.

All in all: I still appreciate their effort to make their work more accessible - even if it involves use of generative tools. At least they try.

Still, it is real and also valuable. Research is not only finding new things but also establishing standards for to help future research analyze and communicate. This paper does this by introducing metrics for measuring video generation and an evaluation process for it - while also summarizing other existing methods.

You can find the whole paper here: https://arxiv.org/pdf/2501.09038

There is a link right beneath the title

Again: I think you are right in being critical towards.. well, anything

But the paper still seems fine.

ximossi · December 17, 2025, 9:13am

As stated above: I did not listen to the podcast. I now did skim it and it’s horrible. The paper is not, though.

dragonforge-dev · December 17, 2025, 5:23pm

Thanks, totally missed that link.

I will concede the paper seems real. I looked into the site, and it appears to be a Cornell website for non-peer reviewed scholarly articles. So it’s very well-researched forum post IMO.

I am still suspicious of how it was advertised. I honestly wonder if the page itself was an A/B test of some sort on how people will take real content published using a bunch of AI tools.