W20 AI Data Intelligence

One-line Summary

Microsoft completed a public benchmark stack across web actions, webpage generation, and long-horizon delegation [P0]; NVIDIA turned Korean personas into a compliant sovereign data layer for localized Agents [P0]; and Google DeepMind plus NVIDIA kept pushing real-world data toward multi-view physical environments [P1]. Strongest demand signal this week: Agent evaluation and verifier data.

Key Findings

This week's 4 high commercial value findings

P0 Microsoft assembled a three-part Agent benchmark stack across web actions, webpage generation, and long-horizon delegated editing between April 14 and April 20, 2026 [P0]

This week Microsoft effectively put three layers of Agent evaluation on the table at once. `microsoft/WebTailBench` evaluates computer-using agents with 609 hand-verified real web tasks plus 111 refusal tasks across 11 main categories and 7 safety categories. `microsoft/MM-WebGen-Bench` extends evaluation into multimodal webpage generation with 120 curated prompts spanning 11 scenes, 11 visual styles, and mixed video / image / chart compositions. `microsoft/delegate52` pushes the benchmark boundary further into delegated workflows, with a public release that still covers 234 work environments across 48 professional document domains. At the same time, `surgeai/GDP.pdf` turns PDF parsing into a benchmark of 50 real PDFs with up to 30 rubric criteria per example. Combined with Hugging Face’s April 16 `Ecom-RLVE` write-up, the public ecosystem is clearly moving beyond “can the model answer?” toward “can the agent complete the task inside a verifiable environment?”.

Business impact → The scarce data asset is no longer plain SFT text. It is now verifiers, rubrics, refusal boundaries, environment construction, and long-horizon delegation trajectories. The highest-value layer is becoming the full evaluation substrate behind Agent deployment.

P0 NVIDIA’s Nemotron-Personas-Korea shows that Agent data is shifting from generic prompts toward identity, jurisdiction, and context [P0]

NVIDIA released `nvidia/Nemotron-Personas-Korea` and followed with an April 21 Hugging Face article explaining why it matters. This is not a generic persona prompt pack. It is a synthetic persona layer grounded in official Korean statistics and seed sources such as KOSIS, the Supreme Court of Korea, the National Health Insurance Service, and the Korea Rural Economic Institute. The accompanying article says the dataset covers all 17 Korean provinces, 25 districts, roughly 209K unique names, and 2K+ occupation categories, while being designed around PIPA-style governance without exposing real PII. In practice, that means persona data is starting to encode region, profession, honorific norms, institutional context, and legal boundaries — not just tone of voice.

Business impact → Localized Agents now need to behave like locals, not just speak the local language. High-value data will increasingly be sovereign persona layers with demographic grounding, compliance boundaries, and institution-aware scenario design.

P1 Google DeepMind and NVIDIA keep pushing real-world data toward multi-view physical environments: RSRCC, Gemini Robotics-ER 1.6, and NuRec resonated in the same week [P1]

`google/RSRCC`, released on April 15, exposes 126k rows of remote-sensing change understanding data as before/after imagery paired with natural-language questions and answers. In parallel, Google DeepMind introduced `Gemini Robotics-ER 1.6` and explicitly highlighted stronger spatial reasoning and multi-view understanding in its official post. NVIDIA’s `PhysicalAI-Robotics-NuRec` adds simulator-ready 3DGUT USD assets, meshes, and occupancy maps that can be used directly in Isaac Sim. Together, these signals show that physical-world data is moving away from static image understanding and toward multi-view, actionable, simulation-ready, and verifiable assets.

Business impact → Real-world data services will keep getting more expensive and more strategic, especially in multi-view perception, spatial reasoning, simulator-ready asset production, temporal change labeling, and robotics validation.

P2 Localization data keeps moving up the stack: WaxalNLP and bouquet extend globalization from translation into speech and quality evaluation [P2]

`google/WaxalNLP` still sits above 10k downloads and packages ASR/TTS coverage for multiple African languages, while Meta’s `facebook/bouquet` turns translation quality into a benchmark that spans 266 languoids. The combined message is that global Agents do not just need translated UI strings. They need speech IO, culture-sensitive wording, and evaluation frameworks that reflect local quality standards. Public data is moving from cheap parallel corpora toward deployable localization stacks.

Business impact → The center of gravity in multilingual data is shifting from low-cost translation pairs toward higher-value speech data, localized evaluation, and culture-aware human judgment.

Demand Signals

Infer training data demands from model releases

Agent evaluation and verifier datasets

Very High ↑ New

WebTailBench, MM-WebGen-Bench, DELEGATE52, GDP.pdf, and Ecom-RLVE all point to verifiable Agent evaluation infrastructure

Long-horizon delegated document-edit trajectories

Very High ↑ New

DELEGATE52 publishes 234 public work environments for delegated workflows

Sovereign persona and demographic grounding data

Very High ↑ New

Nemotron-Personas-Korea turns official demographic structure into an Agent-ready persona layer

Multi-view robotics and simulator-ready 3D assets

High ↑ New

Gemini Robotics-ER 1.6 emphasizes multi-view understanding while NuRec ships Isaac Sim-ready assets

Remote-sensing temporal change understanding data

High ↑ New

RSRCC packages 126k rows of before/after geospatial reasoning data

Verifiable environments for commerce and support Agents

High ↑ New

Ecom-RLVE provides 8 verifiable environments with a 12-axis difficulty curriculum

Multilingual ASR / TTS corpora

High ↑ New

WaxalNLP keeps showing that speech localization is still a major gap

Translation quality and culture-aware evaluation data

Medium ↑ New

bouquet expands translation-quality benchmarking to 266 languoids

In-the-wild 3D detection and stereo depth data ↓ Dropped Present in previous issue, absent this issue

Multi-dimensional reward model training data ↓ Dropped Present in previous issue, absent this issue

Controllable synthetic data recipes ↓ Dropped Present in previous issue, absent this issue

Economic-index style real usage traces ↓ Dropped Present in previous issue, absent this issue

Domain RLHF for medical and finance ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset	Downloads	Weekly Growth
allenai/WildDet3D-Data	3,621	+1460.8%
microsoft/AVGen-Bench	2,843	+65.7%
Anthropic/EconomicIndex	15,786	+20.3%
google/WaxalNLP	10,582	-10.6%
nvidia/PhysicalAI-Robotics-Open-H-Embodiment	43,989	-39.7%

Want to discuss this issue?

Kai Founder & CEO

苏文 AI Documentation & Release Engineer

陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →

Microsoft Completes the Agent Evaluation Stack, NVIDIA Turns Korean Personas into Sovereign Data

Key Findings

Demand Signals

Download Movers