Radar Brief Week 20, 2026 · 2026-04-14 — 2026-04-21

Microsoft Completes the Agent Evaluation Stack, NVIDIA Turns Korean Personas into Sovereign Data

This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts

0
Valuable Datasets
0
Related Papers
0
Blog Posts
0
Active Repos
One-line Summary

Microsoft completed a public benchmark stack across web actions, webpage generation, and long-horizon delegation [P0]; NVIDIA turned Korean personas into a compliant sovereign data layer for localized Agents [P0]; and Google DeepMind plus NVIDIA kept pushing real-world data toward multi-view physical environments [P1]. Strongest demand signal this week: Agent evaluation and verifier data.

Key Findings

This week's 4 high commercial value findings

P0 Microsoft assembled a three-part Agent benchmark stack across web actions, webpage generation, and long-horizon delegated editing between April 14 and April 20, 2026 [P0]

This week Microsoft effectively put three layers of Agent evaluation on the table at once. `microsoft/WebTailBench` evaluates computer-using agents with 609 hand-verified real web tasks plus 111 refusal tasks across 11 main categories and 7 safety categories. `microsoft/MM-WebGen-Bench` extends evaluation into multimodal webpage generation with 120 curated prompts spanning 11 scenes, 11 visual styles, and mixed video / image / chart compositions. `microsoft/delegate52` pushes the benchmark boundary further into delegated workflows, with a public release that still covers 234 work environments across 48 professional document domains. At the same time, `surgeai/GDP.pdf` turns PDF parsing into a benchmark of 50 real PDFs with up to 30 rubric criteria per example. Combined with Hugging Face’s April 16 `Ecom-RLVE` write-up, the public ecosystem is clearly moving beyond “can the model answer?” toward “can the agent complete the task inside a verifiable environment?”.

Business impact → The scarce data asset is no longer plain SFT text. It is now verifiers, rubrics, refusal boundaries, environment construction, and long-horizon delegation trajectories. The highest-value layer is becoming the full evaluation substrate behind Agent deployment.
P0 NVIDIA’s Nemotron-Personas-Korea shows that Agent data is shifting from generic prompts toward identity, jurisdiction, and context [P0]

NVIDIA released `nvidia/Nemotron-Personas-Korea` and followed with an April 21 Hugging Face article explaining why it matters. This is not a generic persona prompt pack. It is a synthetic persona layer grounded in official Korean statistics and seed sources such as KOSIS, the Supreme Court of Korea, the National Health Insurance Service, and the Korea Rural Economic Institute. The accompanying article says the dataset covers all 17 Korean provinces, 25 districts, roughly 209K unique names, and 2K+ occupation categories, while being designed around PIPA-style governance without exposing real PII. In practice, that means persona data is starting to encode region, profession, honorific norms, institutional context, and legal boundaries — not just tone of voice.

Business impact → Localized Agents now need to behave like locals, not just speak the local language. High-value data will increasingly be sovereign persona layers with demographic grounding, compliance boundaries, and institution-aware scenario design.
P1 Google DeepMind and NVIDIA keep pushing real-world data toward multi-view physical environments: RSRCC, Gemini Robotics-ER 1.6, and NuRec resonated in the same week [P1]

`google/RSRCC`, released on April 15, exposes 126k rows of remote-sensing change understanding data as before/after imagery paired with natural-language questions and answers. In parallel, Google DeepMind introduced `Gemini Robotics-ER 1.6` and explicitly highlighted stronger spatial reasoning and multi-view understanding in its official post. NVIDIA’s `PhysicalAI-Robotics-NuRec` adds simulator-ready 3DGUT USD assets, meshes, and occupancy maps that can be used directly in Isaac Sim. Together, these signals show that physical-world data is moving away from static image understanding and toward multi-view, actionable, simulation-ready, and verifiable assets.

Business impact → Real-world data services will keep getting more expensive and more strategic, especially in multi-view perception, spatial reasoning, simulator-ready asset production, temporal change labeling, and robotics validation.
P2 Localization data keeps moving up the stack: WaxalNLP and bouquet extend globalization from translation into speech and quality evaluation [P2]

`google/WaxalNLP` still sits above 10k downloads and packages ASR/TTS coverage for multiple African languages, while Meta’s `facebook/bouquet` turns translation quality into a benchmark that spans 266 languoids. The combined message is that global Agents do not just need translated UI strings. They need speech IO, culture-sensitive wording, and evaluation frameworks that reflect local quality standards. Public data is moving from cheap parallel corpora toward deployable localization stacks.

Business impact → The center of gravity in multilingual data is shifting from low-cost translation pairs toward higher-value speech data, localized evaluation, and culture-aware human judgment.

Demand Signals

Infer training data demands from model releases

Data Type Intensity Trend Related Signals
Agent evaluation and verifier datasets
Very High ↑ New
WebTailBench, MM-WebGen-Bench, DELEGATE52, GDP.pdf, and Ecom-RLVE all point to verifiable Agent evaluation infrastructure
Long-horizon delegated document-edit trajectories
Very High ↑ New
DELEGATE52 publishes 234 public work environments for delegated workflows
Sovereign persona and demographic grounding data
Very High ↑ New
Nemotron-Personas-Korea turns official demographic structure into an Agent-ready persona layer
Multi-view robotics and simulator-ready 3D assets
High ↑ New
Gemini Robotics-ER 1.6 emphasizes multi-view understanding while NuRec ships Isaac Sim-ready assets
Remote-sensing temporal change understanding data
High ↑ New
RSRCC packages 126k rows of before/after geospatial reasoning data
Verifiable environments for commerce and support Agents
High ↑ New
Ecom-RLVE provides 8 verifiable environments with a 12-axis difficulty curriculum
Multilingual ASR / TTS corpora
High ↑ New
WaxalNLP keeps showing that speech localization is still a major gap
Translation quality and culture-aware evaluation data
Medium ↑ New
bouquet expands translation-quality benchmarking to 266 languoids
In-the-wild 3D detection and stereo depth data ↓ Dropped Present in previous issue, absent this issue
Multi-dimensional reward model training data ↓ Dropped Present in previous issue, absent this issue
Controllable synthetic data recipes ↓ Dropped Present in previous issue, absent this issue
Economic-index style real usage traces ↓ Dropped Present in previous issue, absent this issue
Domain RLHF for medical and finance ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset Downloads Weekly Growth
allenai/WildDet3D-Data 3,621 +1460.8%
microsoft/AVGen-Bench 2,843 +65.7%
Anthropic/EconomicIndex 15,786 +20.3%
google/WaxalNLP 10,582 -10.6%
nvidia/PhysicalAI-Robotics-Open-H-Embodiment 43,989 -39.7%

Want to discuss this issue?

Kai
Kai Founder & CEO
苏文
苏文 AI Documentation & Release Engineer
陆明哲
陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →