Near the end of the AI puzzle competition between various models, I also asked the following:
Q: Which mouse walks on 2 legs?
A: Mickey Mouse
Q: Which duck walks on 2 legs?
A:
Think about the answer and then see how the models responded at https://mihai.page/ai-2025-9/
5.5.2025 05:48Near the end of the AI puzzle competition between various models, I also asked the following:Q: Which mouse walks on 2 legs?A: Mickey...It's finally here. I analyze QwQ and Deepsek on the 3 math puzzles problem and finish the round of benchmarks I ran in January. It was interesting to see how all these models behave on easy, hard and moderate difficulty math puzzles.
Read the last article at https://mihai.page/ai-2025-8/
14.4.2025 06:03It's finally here. I analyze QwQ and Deepsek on the 3 math puzzles problem and finish the round of benchmarks I ran in January. It was...Last week we launched v1.0 of model_signing library (and CLI). A blog post that has more technical details and links to a demo notebook can be found on the Sigstore blog: https://blog.sigstore.dev/model-transparency-v1.0/
12.4.2025 15:54Last week we launched v1.0 of model_signing library (and CLI). A blog post that has more technical details and links to a demo notebook can...Yesterday we launch v1.0 of model signing library, taming the wild west of model formats and deserialization vulnerabilities. You can read more about why this is needed and why we picked Sigstore as main signing method at https://security.googleblog.com/2025/04/taming-wild-west-of-ml-practical-model.html
5.4.2025 23:00Yesterday we launch v1.0 of model signing library, taming the wild west of model formats and deserialization vulnerabilities. You can read...Although this week I found out that Gemini 2.5 Pro solves 2 out of the 3 problems correctly and nearly gets there for the hardest one, I still continue to analyze the answers from the models that were tested back in January. Today, I look at 4 llama models (via Perplexity) and there are 2 more models left for the next article.
29.3.2025 23:29Although this week I found out that Gemini 2.5 Pro solves 2 out of the 3 problems correctly and nearly gets there for the hardest one, I...Since Gemini 2.5 Pro was released today I asked it (once, no prompt engineering) to solve the problems from my currently running series of tests.
P1: solved after 59s of thinking
P2: partial solution after 67s of thinking
P3: solved after 153s of thinking
For P2: model identified some patterns, but not the exact one that would give the answer.
For P3: model identified the sequences and intersected them, but model also computed the sums correctly, using modular arithmetic.
Overall, this model is really strong! The best I've seen so far in the competition I run on my blog. Congratulations to all who worked on it!
See https://mihai.page/ai-2025-1/ for the problems, and other articles on my blog for the performance of the other tested models.
26.3.2025 03:01Since Gemini 2.5 Pro was released today I asked it (once, no prompt engineering) to solve the problems from my currently running series of...After testing OpenAI, Gemini, and Claude models, it is time to look at how 5 different models from the Mistral family answer the 3 math puzzles I proposed nearly 2 months ago. Almost done with reporting all these experiments.
9.3.2025 05:58After testing OpenAI, Gemini, and Claude models, it is time to look at how 5 different models from the Mistral family answer the 3 math...After testing OpenAI and Gemini models on the 3 puzzle problems proposed in January on my blog, it is time to look at how Claude models answer them. Tested only versions 3 and 3.5 since I ran the scripts back in Jan, but even so the models performed quite well.
Read more on my blog: https://mihai.page/ai-2025-5/
4.3.2025 01:51After testing OpenAI and Gemini models on the 3 puzzle problems proposed in January on my blog, it is time to look at how Claude models...It's finally here. I managed to read 10 million characters from Gemini models trying to answer my puzzles and published all the scores and some outputs to my blog.
In fact, I even learned something from this. Gemini taught me 2 different math theorems.
24.2.2025 02:40It's finally here. I managed to read 10 million characters from Gemini models trying to answer my puzzles and published all the scores...Over the past 9 days I read over 173k lines of output produced by OpenAI models to score them on the 3 problems I proposed for my testing of them on my blog. Read the entire summary, with scores, model outputs, and easter eggs at https://mihai.page/ai-2025-3/
27.1.2025 15:12Over the past 9 days I read over 173k lines of output produced by OpenAI models to score them on the 3 problems I proposed for my testing of...Second part of the 2025 AI puzzle competition on my blog introduces the 3 problems that I will ask various LLMs to solve: https://mihai.page/ai-2025-1/
16.1.2025 15:35Second part of the 2025 AI puzzle competition on my blog introduces the 3 problems that I will ask various LLMs to solve:...It's time to start blogging again. Like the last time I restarted blogging, I'll play with AIs and look at how LLMs solve puzzles. Unlike the last article (1.5 years ago!) I won't post everything in a humongous article. So, today is just the first part: setting up the context, the rules, and the scaffolding: https://mihai.page/ai-2025-0/
15.1.2025 15:36It's time to start blogging again. Like the last time I restarted blogging, I'll play with AIs and look at how LLMs solve puzzles....Really looking forward to the AI and security work and to the new blog posts. I have some in draft already.
Also, two nice curiosities to end the post: there are exactly 10 abelian groups of order 2025 and if you add all product numbers in the multiplication table you get 2025 (related to the initial fact of this post).
1.1.2025 16:02Really looking forward to the AI and security work and to the new blog posts. I have some in draft already.Also, two nice curiosities to end...This year I felt a lot like IPv4. Exhausted for a long time, yet more and more was wanted. I somehow got into a state where I had to work for two separate teams, on slightly related projects and the other team both doesn't care about OSS and is very demanding that their project needs to finish yesterday. I need to fix this at most by April.
1.1.2025 16:02This year I felt a lot like IPv4. Exhausted for a long time, yet more and more was wanted. I somehow got into a state where I had to work...Happy new year. Since 2025 is a perfect square (45 * 45), let 2025 be a perfect year for you too!
I know I want to manage my time better this year, do more OSS contributions (2853 on GitHub in the past year), read more books (barely 22), more podcasts (2213 hours), more OSS conferences and AI and security work. And, more blog posts, sadly I did exactly 0 in the last year.
1.1.2025 16:02Happy new year. Since 2025 is a perfect square (45 * 45), let 2025 be a perfect year for you too!I know I want to manage my time better this...Model storage under attack (https://techcrunch.com/2024/05/31/hugging-face-says-it-detected-unauthorized-access-to-its-ai-model-hosting-platform/). Models are uninspectable, so the only solution to prevent tampering is to sign them.
OpenSSF has a model signing SIG as part of the AI/ML WG. Both biweekly meetings are in the OpenSSF calendar.
Also, https://github.com/sigstore/model-transparency
2.6.2024 13:14Model storage under attack...Given that the field of AI is evolving rapidly, it is also repeating the same mistakes of traditional software (with regards to security at least), but at an accelerated pace. To secure applications that use AI, we have to think holistically, starting from the data that is used to train the model and protecting the entire chain all the way to the application built around the model.
https://research.google/pubs/securing-the-ai-software-supply-chain/
2.5.2024 13:11Given that the field of AI is evolving rapidly, it is also repeating the same mistakes of traditional software (with regards to security at...My talk at PackagingCon 2023 (Berlin) got published to YouTube: https://www.youtube.com/watch?v=oIAJLV9l01Q&list=PLl386dCR5QGTElF3MbltCJupNG1lHK4Nr&index=20
I recommend watching the entire playlist, and let me know if you're interested more into the ML supply chain. Looking forward to seeing you at the AI/ML working group under OpenSSF and/or on the repository itself.
30.11.2023 15:26My talk at PackagingCon 2023 (Berlin) got published to YouTube:...How do you react to vulnerabilities? Do you wait until a scanner picks it up (O(weeks))? Do you grep SBOMs (assuming you follow the EO)?
Why not use GUAC and get answers fast? Both blast radius and an update plan so once the patch drops you can patch.
https://www.kusari.dev/blog/terror-of-curl
16.10.2023 23:22How do you react to vulnerabilities? Do you wait until a scanner picks it up (O(weeks))? Do you grep SBOMs (assuming you follow the EO)?Why...The new CISA #JCDC guideliness on "Improving Security of Open Source Software in Operational Technology and Industrial Control Systems" came together with input from Google's Open Source Security Team (GOSST)
https://cisa.gov/sites/default/files/2023-10/Fact_Sheet_Improving_OSS_in_OT_ICS_508c.pdf
16.10.2023 23:01The new CISA #JCDC guideliness on "Improving Security of Open Source Software in Operational Technology and Industrial Control...⬆️
⬇️