Microsoft Completes the Agent Evaluation Stack, NVIDIA Turns Korean Personas into Sovereign Data
This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts
Microsoft completed a public benchmark stack across web actions, webpage generation, and long-horizon delegation [P0]; NVIDIA turned Korean personas into a compliant sovereign data layer for localized Agents [P0]; and Google DeepMind plus NVIDIA kept pushing real-world data toward multi-view physical environments [P1]. Strongest demand signal this week: Agent evaluation and verifier data.
Key Findings
This week's 4 high commercial value findings
This week Microsoft effectively put three layers of Agent evaluation on the table at once. `microsoft/WebTailBench` evaluates computer-using agents with 609 hand-verified real web tasks plus 111 refusal tasks across 11 main categories and 7 safety categories. `microsoft/MM-WebGen-Bench` extends evaluation into multimodal webpage generation with 120 curated prompts spanning 11 scenes, 11 visual styles, and mixed video / image / chart compositions. `microsoft/delegate52` pushes the benchmark boundary further into delegated workflows, with a public release that still covers 234 work environments across 48 professional document domains. At the same time, `surgeai/GDP.pdf` turns PDF parsing into a benchmark of 50 real PDFs with up to 30 rubric criteria per example. Combined with Hugging Face’s April 16 `Ecom-RLVE` write-up, the public ecosystem is clearly moving beyond “can the model answer?” toward “can the agent complete the task inside a verifiable environment?”.
NVIDIA released `nvidia/Nemotron-Personas-Korea` and followed with an April 21 Hugging Face article explaining why it matters. This is not a generic persona prompt pack. It is a synthetic persona layer grounded in official Korean statistics and seed sources such as KOSIS, the Supreme Court of Korea, the National Health Insurance Service, and the Korea Rural Economic Institute. The accompanying article says the dataset covers all 17 Korean provinces, 25 districts, roughly 209K unique names, and 2K+ occupation categories, while being designed around PIPA-style governance without exposing real PII. In practice, that means persona data is starting to encode region, profession, honorific norms, institutional context, and legal boundaries — not just tone of voice.
`google/RSRCC`, released on April 15, exposes 126k rows of remote-sensing change understanding data as before/after imagery paired with natural-language questions and answers. In parallel, Google DeepMind introduced `Gemini Robotics-ER 1.6` and explicitly highlighted stronger spatial reasoning and multi-view understanding in its official post. NVIDIA’s `PhysicalAI-Robotics-NuRec` adds simulator-ready 3DGUT USD assets, meshes, and occupancy maps that can be used directly in Isaac Sim. Together, these signals show that physical-world data is moving away from static image understanding and toward multi-view, actionable, simulation-ready, and verifiable assets.
`google/WaxalNLP` still sits above 10k downloads and packages ASR/TTS coverage for multiple African languages, while Meta’s `facebook/bouquet` turns translation quality into a benchmark that spans 266 languoids. The combined message is that global Agents do not just need translated UI strings. They need speech IO, culture-sensitive wording, and evaluation frameworks that reflect local quality standards. Public data is moving from cheap parallel corpora toward deployable localization stacks.
Demand Signals
Infer training data demands from model releases
Download Movers
Datasets with the largest download changes this week
| Dataset | Downloads | Weekly Growth |
|---|---|---|
| allenai/WildDet3D-Data | 3,621 | +1460.8% |
| microsoft/AVGen-Bench | 2,843 | +65.7% |
| Anthropic/EconomicIndex | 15,786 | +20.3% |
| google/WaxalNLP | 10,582 | -10.6% |
| nvidia/PhysicalAI-Robotics-Open-H-Embodiment | 43,989 | -39.7% |
Want to discuss this issue?
Auto-generated by AI Dataset Radar · Updated weekly
AI Dataset Radar →