Hacker Newsnew | past | comments | ask | show | jobs | submit | rchaves's commentslogin

Hello HN!

tl;dr: We built Scenario, an open-source testing library for AI agents. It simulates real conversations with your agent, its code-driven, and lets you assert anything mid-dialogue. Repo: https://github.com/langwatch/scenario Docs: https://scenario.langwatch.ai/

I'm Rogerio, founder of LangWatch, I've been helping many customers building LLM applications in this past two years and worked with Alex on this.

Most of the efforts for LLM quality so far were about evaluations, single-turn, there was nothing actually good to test agents, it all felt forced, but we believe we cracked it now, we have built an agent testing library that test your agent by simulating a user and playing a conversation back and forth with it.

One of the key challenges there was that we had to make it compatible with all the 273+ AI frameworks (and counting) there are. Luckliy AG-UI protocol popped up recently, standardizing agents frameworks and UI interactions, this is perfect, because at the end of the day, we want our user simulator to "see" just the same that the user sees.

So we made Scenario in a way that is really easy to connect to any agent no matter the tech stack, from a simple string <-> string connection, to openai standard messages format, to AG-UI.

The other key challenge was to balance testing the open-endedness of agents vs having reliable cases you want to test, so we worked a lot on thinking through the autopilot simulation vs the fully scripted one, and here again, the goal was complete interoperability. At the end of the day, the design we achieved was simply having lambdas, that you can call at any point of the test, so it's just code, where you can connect any other evaluation or assertion tool you want, we are not restrictive.

Check out the repo and the docs, we would love to get some feedback in here!

Repo: https://github.com/langwatch/scenario Docs: https://scenario.langwatch.ai/


well I think hype is not bad per se, I'd do it even if not trying to make a buck, it's okay (up to a point) to hype up something so that eventually it finds a problem where it fits well, but yeah, I'm still waiting on this one


same here, but I would even avoid "strong arguments" because that's what we all have been doing so far

what I want is real use cases, show me real-world production examples from established companies where multi-agent collaboration helped them better than a simple agent + tools and deterministic workflows


is this multi-agent collaboration though, or is it just a workflow? All examples you listed seem to have pretty deterministic control flows (write then validade, context exceeded, after each response, etc)

when I think of multi-agent collaboration I think of also the control flow and handover to be defined by the agents themselves, this is the thing I have yet to see examples of in production, and the premise that I also don't buy yet


You’re right that it’s a fuzzy line. That said, if you can make the contract/handoff between agents deterministic, you’ll always get better results by doing that, compared to letting the agents try to handle it through inference, since there will always be some error rate.

For this reason, I think that for at least the next couple years, even very advanced agent systems are likely to have a lot of deterministic control flow and glue in their guts. To me, that doesn’t make them “not multi-agent”. Rather, this is how you can build multi-agent systems that actually work in reality. But much of it comes down to semantics, admittedly.


Nah it's just a marketing problem, "GPT" and "ChatGPT" names is the biggest asset OpenAI has, people have expectations so high for GPT-5 that they cannot burn this name unless it's something truly majestic, bordering AGI at the very least. Until they are confident enough that people will be blown off by it, it's better to continue building up the hype


The Half Life 3 of the SaaS/zirp era.


I’ve seen people spending 10 minutes to test things by hand, would have taken them less to write and run a test, specially with AI now

When writing test actually makes it faster to code, THEN it’s worth it. You can even throw the tests away later, doesn’t matter


Not true. Kent Back’s 3X is a much better take, test and good practices for what is high risk and hard to change, move fast for most of it on the rest to try to find that black swan as soon as possible.

Yes, I do feel this time is different and that I am a top-notch coder (I feel more comfortable sounding like a jerk when on hackernews), 1 year later into my startup, codebase is big, but I’m actually coding faster than ever, as foundation code is more and more complete.

One of the huge reasons for it beyond the right architecture is type safety. Someone that was well seasoned in strongly-typed FP but is now pragmatic can move incredibly fast with enormous safety by just adding the most cost-benefit type strictness, and being flexible on where it doesn’t pay off.



Yes exactly, from the experience he had as Facebook massively scaling up, while all good practices were thrown down the window (except for foundation and critical parts) and extreme go horse php being written as fast as possible, until it was too big and a migration was needed

I had the exact same experience at Booking, but with terrible Perl instead of php, still, built massively successful business until one day the movement to better practices becomes inevitable

The opposite of what uncle bob says


Lol why is this comment written as if the article is about your startup?

> Yes, I do feel this time is different and that I am a top-notch coder (I feel more comfortable sounding like a jerk when on hackernews)

Cool then I guess I can be a jerk back You built a TS frontend on top of a much much more popular framework. You have no USP. So how exactly is "this time different" for you?

> One of the huge reasons for it beyond the right architecture is type safety

The delulu is very high - as if this matters in the least.


Can you please not post like this? We're trying for curious, respectful conversation here—not people putting each other down or ridiculing others.

https://news.ycombinator.com/newsguidelines.html


Erm, he wrote the article with “you” to invoke the feeling of the reader thinking about their own use case, which I did

Different because I ran without good practices before, got more and more messed up over time, grinded to a halt, and it was all terrible. Then worked for 10 years, doing TDD for most of it, as well as pairing and other good practices, everything was better, started condemning people that don’t do it similar to uncle bob. Kept moving forward, entered an environment where everything was “wrong” but still somehow worked very well. Now I’m mature enough that I can have both: skip “good practices” and still not break things

Finally, yes of course the types fucking matter


yeah I guess base models without built-it CoT are not going away, exactly because you might want to tune it yourself. If DSPy (or similar) evolves to allow the same or similar than OpenAI did with o1, that will be quite powerful, but we still need the big foundational models powering it all

on the other hand, if cementing techniques in the models becomes a trend, we might see various models around with each technique for us to pick and choose beyond CoT without need for us to guide the model ourselves, then what's left for us to optimize is the prompts on what we want, and the routing the combination of those in a nice pipeline

still the principle of DSPy stays the same, have a dataset to evaluate, let the machine trial an error prompts, hyperparameters and so on, just switch around different techniques (possibly automating that too), and get measurable, optimizable results


Nope, the trains are not often late, this is just in Germany


Is there a pattern in Germany engineering? Airport Berlin, Stuttgart 21, constantly late running trains? Is Germany over-engineered and over-regulated, what's the issue?


Trains being late is simply an overrated problem. This is not something that DB should optimize for. It would be easy to make sure every train is punctual by running one Munich-Berlin-Hamburg train a day.

Trains are late because they run a demanding schedule, trying to serve every big city (this is not like France where everything is supposed to go through Paris, this is a country that unified late with many powerful regions who all want direct connections between them).


Trains being late isn't that overrated. Its not the most important thing, but it is important. Because being late messes up the whole system. One train being late can cascade to lots of other trains being late. Causing system degridation.

There is not one single issue with the German network. Yes, the multi polarity you mentioned is one of them. But that's not the main issue, really more a feature.

One of the issues is that early on DB wanted to continue to make money with cargo, so they insisted on having cargo (and other trains) use the same lines. Rather then building new lines that nobody else uses. This choice over long term degraded to a situation where lots of high speed lines are just major upgrades of existing lines.

That leads to the network not having as much capacity as it could have had. And you have ICE trains blocking lots of the tracks for other trains, and the other way around. And even if the ICE can go, it can't go full speed because some parts aren't upgrade enough. You constantly go from a speed line, to an upgraded line, to a normal line, to a city approach.

While its great that in Germany the ICE goes to the main station, unlike in France where Paris doesn't even have a real main station. It also leads to ICE trains having to use an excessive amount of the old city network. And there are always issues there. Just last week our train was lolligaging around the approaches to Hamburg.

There is also just a general maintenance issue, personal and so on.

That all said, having traveled threw Germany a lot with the train just in the last couple months, overall, I think its amazing. But could be so much better still.


Schedule isn’t the root cause. DB has serious problems with staffing, because their workforce is going to retire soon in large numbers and finding replacement is hard. They have problems with infrastructure: it is telling that a heavy rain, moderate snowfall or burning grass can knock out a major route. Both are management problems.


Not really. It's just heavily used and so the faults are very obvious, and we Germans are extremely quick to complain. Objectively there's not much of a difference and for almost all European countries it's complaint at a high level[1]

Of course Stuttgart 21 is objectively a mess but Germany is big, there's more than just a handful of engineering projects at any given time. We brought a pretty substantial amount of LNG capacity online in just two years but that never gets any press compared to the latest airport chaos.

[1]https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....


It's just poorly governed at every level by geriatric politicians/managers and has an enormous older population keeping them there.


If you look closely it actually does give multiple instructions per screenshot! However it cannot get too far, because the screen changes under it. For example when it starts typing a tweet, the tweet box expands and the send button moves, so it tries to click it but it's not longer there, it needs to take another screenshot to see because it's kinda executing those steps "in the dark"

we could try to patch an "interpolation" kinda of thing for change, but also, I'm curious to see if the multi-modal models that are coming out supporting video would be able to actually just "watch the video" in real time, this would be the ultimate solution


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: