Benchmarking AI with Stanislav Khromov
---
00:00:18 – 00:00:20  
Welcome to Svelte Radio.
00:00:22 – 00:00:34  
Hello everyone! Welcome to another episode of Svelte Radio. Today we bring you yet another guest. I'm joined, but before that, I’m going to say hello to my beautiful co‑host, Anthony. Hello.
00:00:35 – 00:00:48  
Hi. Brittany isn’t here today, so we only have one extra person other than me, of course. But we have a very exciting guest today—Stanislav! Hello, welcome.
00:00:49 – 00:01:02  
Hello. Thanks. Nice being here.
00:01:04 – 00:01:06  
You’re a Svelte ambassador?
00:01:07 – 00:01:12  
You’ve been heavily involved in the SvelteBench stuff—benchmarking Svelte and LLMs.
00:01:13 – 00:01:17  
But before we get into that, let’s hear from you.
00:01:18 – 00:01:22  
Who are you and how did you get into Svelte?
00:01:23 – 00:01:32  
We were talking about this just before the show and trying to determine when I actually started using Svelte and got to know it.
00:01:33 – 00:01:35  
It was around 2022.
00:01:36 – 00:01:38  
I work in an editorial product.
00:01:39 – 00:01:41  
We do newspaper stuff, visualizations, and articles.
00:01:42 – 00:01:50  
I heard about this amazing new framework by Rich Harris when he worked at the New York Times, I believe. We were already experimenting with it.
00:01:51 – 00:02:04  
It was pre‑SvelteKit 1.0; we used a 0.x version at work and also a lot of vanilla Svelte in our visualizations.
00:02:06 – 00:02:21  
The kind of visualizations you’re talking about are election graphing, etc.—the stuff you’d do on a newspaper website. Those were the things Rich originally made Svelte for: really performant and easy to write quickly.
00:02:22 – 00:02:36  
Now SvelteKit is much bigger than that—you can basically get anything.
00:02:37 – 00:02:47  
You got into Svelte around the SvelteKit beta. Was this before they switched from the old routing system to the new one?
00:02:48 – 00:03:07  
I barely remember the old system, but it wasn’t underscore‑underscore page or something; we didn’t have a +page yet. I think it was just the route name .svelte and index. Or maybe .html—that’s Svelte 2.
00:03:07 – 00:03:12  
Right, that was Svelte 2. I mean it was Sapper really; I suppose but yeah.
00:03:12 – 00:03:17  
Because we used Sapper so I know all about it, which is not much fun. It’s crazy to hear you’re still using Sapper.
00:03:18 – 00:03:24  
Well, we’re migrating away.
00:03:25 – 00:03:30  
Wow. Puru is migrating away almost single‑handedly—brilliant.
00:03:31 – 00:03:35  
But the underscore was for certain files, like layouts or something?
00:03:36 – 00:03:40  
I can’t remember.
00:03:41 – 00:03:42  
Or was it canceling a layout?
00:03:43 – 00:03:44  
It’s been a while since I did that stuff. But yeah, it had a weird syntax.
00:03:45 – 00:03:47  
Yeah, but it was still great.
00:03:48 – 00:04:07  
The new system is better for sure. I just remember that Sapper was already on the way out when SvelteKit 0.x was in the making—copy‑pasted into a new project with Snowpack.
00:04:08 – 00:04:14  
It’s so cool to hear these OG people remembering all that stuff.
00:04:15 – 00:04:20  
I came in like unbothered, you know, fresh to the scene with SvelteKit 1.0.
00:04:21 – 00:04:55  
Yeah but look at what you’ve got—amazing to see how up‑skilled and stuff is crazy. I would say you’re right about being OG. We meet a lot of people who have been in it for various amounts of time, but the thing that amazes me is everyone goes “do you remember old school days of Svelte?” and they say yes, and then we talk about double braces, etc.
00:04:56 – 00:05:02  
I think I joined the Discord when it had just been moved from whatever they were using before—Discord. So I don’t remember what it was before.
00:05:03 – 00:05:17  
There was something before—Gitter or something like that. Yes, Gitter is the one. They tried a couple of different things.
00:05:18 – 00:05:31  
How many people have on the Discord today? Does anyone know?
00:05:32 – 00:05:34  
I think it’s like 80,000.
00:05:35 – 00:05:38  
Yes, some crazy amount. It’s the number of active people as well, which is amazing at any one time.
00:05:39 – 00:06:03  
Super fun to see how big it’s been growing over the years.
00:06:04 – 00:06:59  
So you got into Svelte and have been using it for a while because it feels like you not only used it at work but also got interested in side projects, right? You built some capacitor apps. We mentioned it before the show started. But then now you’ve most recently worked on SvelteBench and the MCP.
00:07:00 – 00:07:13  
So let’s talk a bit about SvelteBench. What is it?
00:07:14 – 00:07:17  
Well, SvelteBench is a way to grade an LLM on how well it knows about Svelte 5 specifically because when we had Svelte 4 and ChatGPT came out in 2022 or ’23, it was all around there.
00:07:18 – 00:07:36  
When did the world change? It feels like it’s been around forever, but it really hasn’t, right?
00:07:37 – 00:07:44  
But somehow, because I guess Svelte was kind of HTML‑like, I feel Svelte 4 was pretty okay for LMs to write.
00:07:45 – 00:07:55  
Then when Svelte 5 came and I started testing it out, it was so rough to do AI‑assisted coding with it because I was hitting a wall against a brick. I was just hitting my head against a brick wall for like two weeks.
00:07:56 – 00:08:13  
Then I was on the Svelte.dev site and loaded it up in Network Inspector and found this file called content.json, which powers the search in the top menu. I realized that all of the Svelte docs are in that file.
00:08:14 – 00:08:31  
So I tried dragging and dropping that file into my AI and it got a lot better at doing Svelte 5. That was when I first realized we need to maybe teach LLMs stuff for it to work really well with Svelte 5.
00:08:32 – 00:08:43  
Yeah, interesting to see.
00:08:44 – 00:09:03  
So I guess Svelte hasn’t changed a lot since Svelte 3 came out in 2019, right? From Svelte 3 to Svelte 5 that was probably five years of no changes other than small changes to how transitions worked.
00:09:04 – 00:09:20  
So the LLMs also had a long time to train on Svelte 4 or Svelte 3 code. They probably have a lot of the old stuff in their memory, I don’t know if you would call it layers or learnings.
00:09:21 – 00:09:41  
It’s understandable that LLMs have a hard time writing Svelte 5 because there isn’t a lot of Svelte 5 out there yet compared to Svelte 3 and Svelte 4.
00:09:42 – 00:09:59  
What surprised you?
00:10:00 – 00:10:07  
Actually, before that, how did you build SvelteBench? Did you use a lot of LLM to scaffold the UI?
00:10:08 – 00:10:12  
And yeah, that’s meta.
00:10:13 – 00:10:17  
That’s very meta. I did.
00:10:18 – 00:10:28  
When I sat down to do SvelteBench the first time, I needed a way to basically ask the LLM to write the Svelte component and then test it. I thought maybe we could use something like VTest for this.
00:10:29 – 00:10:32  
I wrote my prompt and pressed enter; it was a total disaster because the LLM (Claude, I think at that point) couldn’t figure out how to connect dynamically, get code from an LLM, and put it into VTest. That was something the LLM had never seen before or didn’t have enough examples for.
00:10:33 – 00:10:50  
So I ended up taking a weekend and spending real engineering time in quotes to actually connect hooking up VTest with an AI SDK that could dynamically generate a component, then VTest could test it. Once that worked, the AI saved so much time because the base was already done.
00:10:51 – 00:11:01  
I needed to scale up and write new tests for different runes and Svelte 5 functions. It did perfectly. I made a beautiful dashboard to show off the data—amazing, saved a lot of time.
00:11:02 – 00:11:10  
Over time, I learned when you should stop trying to get the LLM to do something and just do it yourself because it won’t figure some stuff out. I’ve recently learned that writing PRDs (product requirements documents) is a great way to work with LLMs.
00:11:11 – 00:11:20  
A PRD outlines functionality you want and how it should be implemented, splitting the work into phases. Before coding, iterate on the PRD document, maybe split up some functionality even more. That helps a lot.
00:11:21 – 00:11:30  
You don’t want to give the AI a prompt like “build me a tool for communicating with people” and have it generate the whole application. You need to be very specific.
00:11:31 – 00:12:05  
So how do you usually work with AIs or LLMs? Do you split it up? Write out a list of requirements or just YOLO it?
00:12:06 – 00:12:44  
I think everyone has a different way that works for them. I have two modes: an exploratory mode where you’re not coding, just asking to come up with ideas and features; and a low‑level mode.
00:12:45 – 00:13:01  
So I don’t do exactly what you do, Kev. I don’t write those descriptions but it’s more like when I’ve decided that, “okay, I want this feature,” I brainstormed it. Then I’ll write a pretty low‑level step‑by‑step thing—maybe not file level, but refactor this function, add this one, do it in this way that already exists in a similar structure elsewhere.
00:13:02 – 00:13:48  
I give it guidance and see it more like you have the dough that is your source code. You’re trying to mold it into shape over time with different prompts rather than just updating one function at a time. At the end, you check if it’s what you want—yes or no.
00:13:49 – 00:14:02  
If you’re interested, I wanted to talk a little bit about how the actual SvelteBench works.
00:14:03 – 00:14:07  
I was going to get there. I was just interested in hearing how you worked with LLMs before because we got into that. So yeah, tell us about it. Because before I started this, I had only the vaguest idea of what an LLM benchmark even is.
00:14:08 – 00:14:19  
It’s actually quite interesting to become more skeptical of benchmarks. When I see a really good benchmark score now, I’m thinking “okay, but what is it actually testing?” Some benchmarks have the LLM write a piece of code, like a to‑do app, and another LLM grades that code. Is that a good benchmark? It depends on how generous the grading LLM is.
00:14:20 – 00:15:03  
I wanted something fairly reproducible—run the same code multiple times and always get the same score. That’s why I hooked in VTest.
00:15:04 – 00:15:05  
But if you’re doing like, so what the benchmark is, it’s a bunch of prompts: “use the state rune to make a to‑do list.” Then I ask it to pre‑fill some data tags on certain elements so we can test those with VTest. The code comes in; we check that there should be a button. When you click that button in VTest, a new entry is added to the to‑do list.
00:15:06 – 00:15:30  
We’re doing proper unit testing. But the downside is you can’t go completely crazy with it—you can’t tell it to make a full to‑do app however it wants and then objectively test it. You need to be a little bit fenced in and tell it, “okay, add these test tags on these different elements so we can actually test it properly.”
00:15:31 – 00:15:55  
But surprisingly that has worked fairly well because now many LLMs score basically have a perfect score on the benchmark. I wasn’t sure if it would work, but over time, the benchmark scores for—top scoring one has 93.3%.
00:15:56 – 00:16:18  
That’s like—on that 4.5, actually. So that one basically almost has a perfect score. It still messes up some things.
00:16:19 – 00:16:32  
LLMs are inherently not deterministic. If you run the same prompt through, you’ll get different results every time. I was reading about how different benchmarks work. I found a paper by OpenAI where they proposed this metric called pass@k.
00:16:33 – 00:17:02  
All it really means is you run the test several times and then check how many times it actually succeeded. If you have pass@1, it means that you ran it once and it worked. Pass@10 means you ran it ten times and it worked at least one time.
00:17:03 – 00:17:07  
So I’m just going to see here so that I’m not lying to you. Yeah, exactly. Pass@1 is the hardest kind of score to clear, then pass@10. We also have pass@10 in the benchmark. And we do actually see that some LLMs succeed maybe 90% or 70% of the time.
00:17:08 – 00:18:07  
And that’s quite interesting data. For example, Claude 4.5 passes almost everything except the inspect room. Then there’s Sonnet 4. I think I was thinking about it. Oh, I see. So 4.5 is the latest one, I guess.
00:18:08 – 00:18:23  
There is a button at the top that says V1 results because we changed the test a little bit to fix a bug. In the V1 results, there are more models but the inspect test is actually broken so no model could pass it. That’s why the max score is 89% and not more than 90% because there were ten tests in total.
00:18:24 – 00:19:02  
How many models have you tested now? I know that every time there’s a new open‑source model, I just come running and say, “Stanislav, you need to benchmark this.” Yeah, I didn’t know how fun it was going to be. In the end, I was just testing new major models that came out because it’s pretty easy.
00:19:03 – 00:19:05  
Once you have the benchmark, you just change the model parameter name and rerun it again. It gives you a result that you can commit into the repo.
00:19:06 – 00:19:12  
But then I also had this guy, Max, come in recently. Max has been really good about testing models from Open Router. Open Router has a lot of—open‑source model and a lot of these Chinese models that don’t have a big foothold in the Western market, like Z.ai is one of those companies and Quen is another.
00:19:13 – 00:19:26  
And those models are really good as well. And you can run them locally if you have a very good computer.
00:19:27 – 00:20:05  
What’s been really cool is seeing how the closed‑source model has improved over time, especially Sonnet and Opus models from Anthropic. Gemini has also improved a lot. And then recently we actually got some of these open‑weights models scoring really well. The second model now is Kimi K2, which is an open‑weights model. You can’t run it locally because it’s one terabyte in size and needs one terabyte of graphics memory to run. But you could technically download it and run it locally.
00:20:06 – 00:20:13  
You could rent a couple of servers and then run it, right? That means you can ask it anything without worrying about leaking data to all these providers and having them train on your code. That’s pretty big for a lot of companies, I’d say.
00:20:14 – 00:21:04  
And it’s pretty interesting that we have an open‑weights one as the number two. And you’re not seeing like… I don’t see OpenAI. Well, okay, GPT‑5 is number nine. Yeah. GPT, somehow, I don’t know what it is with OpenAI.
00:21:05 – 00:22:04  
I think it’s actually really sad. I actually think it probably has done a lot of damage to the Svelte ecosystem that open AI models, for some reason, have been so terrible at Svelte 5. They’ve all scored basically at the bottom. They can’t even write a state rune or derived rune; they just write React components.
00:22:05 – 00:23:03  
They’re over‑tuned for React because it’s the most common use case, right? So if most people use React anyway, why waste time training on Svelte?
00:23:04 – 00:24:07  
I mean, in a sense, it would make sense to train it for the most common use case. Because you need fewer training parameters and the model is probably smaller so they can run it more cheaply.
00:24:08 – 00:25:03  
So if most people use React anyway, why waste time training on Svelte? I don’t know. But it kind of makes sense if they want to run it as cheaply as possible.
00:25:04 – 00:26:07  
And that’s why I mean that they’ve done a type of damage to the Svelte community. It is the default. All their models are the default models in 99% of tools. If you use Copilot, you get GPT‑4.1, which by the way is a terrible model.
00:26:08 – 00:27:04  
It’s like a tiny, cheap, awful model that can— I think it passed. It is basically at the bottom of the SvelteBench list. It’s all the way to the bottom, which is quite impressive. Worse than even some 8 billion‑parameter models that you can run on a toaster or your phone.
00:27:05 – 00:28:07  
And I mean, they have gotten better. Like I said, GPT‑5 scores pretty well on the benchmarks. So it’s not all bad, but it took them so long to get this working, which is a bit sad.
00:28:08 – 00:29:02  
So you mentioned this guy, Max, that has come in and done Open Router benchmarks. How do you look at the…?
00:29:03 – 00:30:04  
Open Router is interesting because you can find models that are not quantized and you can find models that are quantized. So do you pick the least quantized model or the most quantized? Because that’s also something that you should probably present on the benchmark as well, right?
00:30:05 – 00:31:04  
Sometimes the provider that you get through Open Router tells you what quantization it is. I’ve seen some benchmarks of whether it’s Kimi 2 or Z.ai; if you run it through the official API, it’s much better than if you run it through Open Router.
00:31:05 – 00:32:07  
We’re trying to learn and understand. Open Router is like a marketplace. So if you have an open model, you have a bunch of providers, and they will all host it. When you call the Open Router API, it basically gives you a round‑robin—whoever is cheapest right now or whoever has the least load right now.
00:32:08 – 00:33:07  
And what we noticed was that the Open Router version of Kimi 2 was much worse than the official. I don’t remember if it was Kimi or another model, but it was one of the big open‑source models.
00:33:08 – 00:34:04  
We learned with Max about this quantization that some providers quantize, some do not even specify the quantization. Quantization basically means trimming off some precision to fit into a smaller amount of memory, which makes it cheaper to run but also worsens performance.
00:34:05 – 00:35:04  
So what we do now is try to choose either the official provider, which we know will be the best because that’s their official one. If there isn’t one available in Open Router, we try to choose the one that clearly states what the precision level of the model is—usually 8‑bit or sometimes 16‑bit.
00:35:05 – 00:36:04  
No, that totally makes sense. And it’s like you can’t really do anything about the cases where they don’t tell you that it’s quantized, but it is, right? That’s just fraud basically.
00:36:05 – 00:37:02  
We try to choose a single provider now when we run these tests so we won’t get a mixture of different providers. It’s actually sad also because open models have gotten a worse score historically just because they don’t have one provider; they have many providers and it’s harder to verify the quality.
00:37:03 – 00:38:07  
So anything else about how this works? No, I think we covered it and it’s pretty easy. Maybe we’ll talk more in the future, but we’re thinking about how to improve SvelteBench. We’re thinking about version 2 with MCP support.
00:38:08 – 00:39:02  
We might talk about that later. Oh, interesting. Yeah, we’re going to talk about the MCP, right? So I was going to ask, what’s next for SvelteBench?
00:39:03 – 00:40:04  
So speaking of the MCP, should we talk about it? Does that make sense? Yeah.
00:40:05 – 00:41:02  
Paolo… Has any of you tried it? Oh, I use it all the time, like with Claude. It’s great. So I usually use Claude code when I do features. Most recently for the Svelte Society website, I started implementing a social media scheduling poster in the admin dashboard.
00:41:03 – 00:42:07  
The MCP has been great because even if it’s Sonnet 4.5, it still writes weird Svelte code sometimes. It auto‑fixes itself until it’s correct. That’s very nice.
00:42:08 – 00:43:02  
What does the MCP do? I mentioned autofixing, but maybe we should just tell people why we use it. We use the documentation to get better results—basically we merge our docs into a markdown file and serve it at a known location like /lms.txt. The LLM can call the list‑documentation‑sections tool and fetch sections.
00:43:03 – 00:44:07  
We made an open‑source MCP for Svelte that had the documentation in it. You could list all the documentation sections and fetch one or more. Then, say, “get the documentation for these things” or let the LLM decide what you need.
00:44:08 – 00:45:04  
We started talking in the ambassador’s chat about promoting this to an official thing. Paolo and I worked on an official implementation. We also added autofixing—if you submit code that the LLM wrote, you can get suggestions through static analysis of what it did wrong.
00:45:05 – 00:46:07  
For example, if the LLM writes on:click instead of on:click, the autofixer will return “change on:click to on:click.” That’s actionable for the LLM. Paolo used Acorn to build the AST and extended compiler tools so it could surface those messages programmatically.
00:46:08 – 00:47:07  
We also added a playground feature—ask the LLM to create a Svelte playground, generate a link, click it, see the component directly in your browser. Cloud Code will one day support this, so you can write a map tool and see a preview directly.
00:47:08 – 00:48:07  
We’re also working on version 2 with MCP support to test distilled docs versus full docs. We’ll compare performance—if the difference is small, we can confidently use distilled docs.
00:48:08 – 00:49:02  
So that’s it for the picks. I’ll pick Tech Ingredients as my pick—it’s a channel about tech. And I also have a personal pick: a Japanese‑style hibachi grill I bought recently—1200 °C, insane. I’ll test it next week.
00:49:03 – 00:50:04  
Where can people find you online? Stanislav?
00:50:05 – 00:51:02  
I made a new website with a pronounceable URL—stanislav.garden. You’ll find my digital garden with all my cool links. I’m on Bluesky, Twitter, GitHub, LinkedIn.
00:51:03 – 00:52:04  
Thank you for joining us. Thank you, Stanislav, for coming on and talking about SvelteBench and MCPs. It was a lot of fun. Thanks to everyone listening. We’ll talk next week. Goodbye!
 
    
 
    
     
    
    