The "learning through making" approach is really good. When I've explained LLMs to non-technical people, the breakthrough moment is usually when they see temperature in action. High temperature = creative but chaotic, low temperature = predictable but boring. You can't just describe that.
What I'd add: the lesson about hallucinations should come early, not just in the RAG module. Kids (and adults) need to internalize "confident-sounding doesn't mean correct" before they get too comfortable. The gap between fluency and accuracy is the thing that trips everyone up.
Growing up in Indonesia, we had "Kuku kaki kakekku kayak kuku kaki kakakku" (my grandfather's toenails look like my older sibling's toenails). The repetitive k-sounds are brutal.
What's interesting is how tongue twisters reveal what's phonetically tricky in each language. English struggles with s/sh transitions ("she sells seashells"). Indonesian targets the k-cluster combinations.
Curious if there's research on whether practicing tongue twisters in a second language actually helps with accent reduction, or if it's just party tricks.
> "So DuckDB was developed to allow queries for bigish data finally without the need for a cluster to simplify data analysis... and we now put it to a cluster?"
This is a fair point, but I think there's a middle ground. DuckDB handles surprisingly large datasets on a single machine, but "surprisingly large" still has limits. If you're querying 10TB of parquet files across S3, even DuckDB needs help.
The question is whether Ray is the right distributed layer for this. Curious what the alternative would be—Spark feels like overkill, but rolling your own coordination is painful.
The Unicode approach seems backwards in hindsight, but I wonder if it was the only practical path forward at the time. Getting Apple, Google, Samsung, and Microsoft to agree on exact pixel-level designs would've been a nightmare. Code points at least let everyone participate without vendor lock-in.
What's interesting is how the market solved it anyway—everyone just converged on Apple's designs because that's what users expected. Not through spec, but through sheer gravity.
The market hasn't solved it though, there's plenty of emoji where the difference between the Apple / Google / Samsung / Microsoft / Twemoji is divergent enough that it expresses a different sentiment.
The trajectory export feature is smart. Evaluation and training data collection in the same tool.
I'm curious how the benchmarks handle non-determinism. Real GUIs have loading states, animations, popups that appear sometimes but not always. Does cuabench control for that, or is variance just part of the measurement?
Also interested in what "Windows Arena" tests specifically. Windows has so many edge cases - UAC prompts, driver install dialogs, random update notifications. Those feel like the hard mode for computer-use agents.
Thanks - trajectory export was key for us since most teams want both eval and training data.
On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.
Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.
The shift from "click and hope" to explicit post-conditions is the right framing.
We've been building agent-based automation and the reliability problem is brutal. An agent can be 95% accurate on each step, but chain ten steps together and you're at 60% success rate. That's not usable.
Curious about the failure modes though. What happens when the verification itself is wrong? Like, the cart shows updated on screen but the verification layer checks a stale element?
Absolutely agree on the compounding error point - that’s exactly what pushed us toward verification.
On “verification wrong”: we try hard to keep predicates grounded and re-evaluated, not “check a cached handle”. Assertions do re-snapshot / re-query during each retry, and we scope them to signals that should change (URL, existence/state of an element, text/value).
If the page is flaky/stale, the assertion just won’t prove the condition within the retry window and we fail with artifacts such as frames of clip (if ffmpeg available) rather than claiming success.
There are still edge cases (virtualized DOM, optimistic UI, async updates), but in those cases the goal is the same: make the failure explicit and debuggable with artifacts and time-travel traces, not silently drift.
I love projects like this. Taking something trivially simple and asking "but what if we really optimized it?"
The material science discussion in these comments is fascinating. Never thought about how the contact point geometry matters so much. Diamond tip makes intuitive sense for hardness, but then you need something it can spin on without scratching...
The Chrome Extensions support is the interesting part here. That's often the dealbreaker for using mobile devices as computer replacements.
Google's had this weird situation where Android and ChromeOS overlap more every year. At some point maintaining two operating systems with converging feature sets seems wasteful.
My guess: ChromeOS probably survives for the education market where manageability matters more than capabilities. But for consumers? Android on a big screen with keyboard and mouse might just be good enough.
I'm running AdGuard in Chromium right now. I don't see any ads, even on YouTube. May I ask what did you mean?
Not that I don't think MV3 is limited, but.. we're comparing this against MV2, right? It was already missing basic functionality like full filtering of http responses, I remember a bug about not seeing POST bodies being open for 10+ years..
Nice work on getting this running in the browser. The fact that it works at all with WebGL/camera APIs is impressive. I always expect browser-based video stuff to be janky, but the demos look smooth.
Thank you! I only had macOS to test with. After all the effort going back and forth making sure fixes for one browser didn't break another, it was a rude awakening that the same APIs in the same browser worked differently on different OSs.
Nice writeup. The Exists subquery approach is definitely the cleanest.
One thing worth mentioning: if you're hitting this problem frequently, it might be worth reconsidering the query patterns themselves. We had a similar issue at work where we kept adding `.distinct()` everywhere, and eventually realized we were doing the filtering wrong upstream.
The PostgreSQL-specific `distinct(*fields)` with the ORDER BY restriction is one of those things that trips people up. The error message isn't great either. "SELECT DISTINCT ON expressions must match initial ORDER BY expressions" is technically correct but doesn't explain why or what to do about it.
Good call recommending Exists as the default approach. It's more explicit about intent too.
What I'd add: the lesson about hallucinations should come early, not just in the RAG module. Kids (and adults) need to internalize "confident-sounding doesn't mean correct" before they get too comfortable. The gap between fluency and accuracy is the thing that trips everyone up.