W21 AI Data Intelligence

One-line Summary

Allen AI massively exposed the MolmoAct2 robotics data cluster on 2026-05-04, with the number of robotics datasets rising from 7 to 99 during the scan period [P0]; NVIDIA continuously released Physical AI and coding Agent data from 2026-05-01 to 2026-05-06, with both video anomaly and software trajectory activity heating up [P0]; downloads of Google's non-English speech data accelerated, with google/WaxalNLP growing 83.8% during the comparison period [P1]. Strongest data demand signal this week: robot operation trajectories and language instructions.

Key Findings

This week's 5 high commercial value findings

P0 Allen AI massively exposed the MolmoAct2 robotics data cluster on 2026-05-04, with the number of robotics datasets rising from 7 to 99 during the scan period [P0]

Allen AI had a total of 100 datasets enter the scan this week, many of which were MolmoAct2-BimanualYAM subsets collected from 2025-11-24 to 2026-01-27 and exposed in a concentrated release alongside the model on 2026-05-04. Representative datasets include allenai/MolmoAct2-SO100_101-Dataset, downloads 119, likes 3, date 2026-05-04; allenai/24112025-yam-01, downloads 1,495, date 2025-11-24; allenai/31122025-tablebuss-12, downloads 472, date 2025-12-31; and allenai/16012026-scan-13, downloads 459, likes 1, date 2026-01-16. In the change data, the robotics category increased from 7 to 99, a net weekly gain of 92.

Business implication → This is not scattered open-source activity, but a signal that “robot action trajectories + language instructions + video/temporal sequences” are becoming systematic training assets for frontier labs. For the data industry, the scarcest resource is not collection hardware, but the human judgment workflow that turns bimanual manipulation, task decomposition, failure retries, and language intent into trainable samples. Knowlyr can prioritize high-judgment-density steps in embodied data pipelines, such as embodied task decomposition, operation intent description, failure cause attribution, and video segment quality inspection, emphasizing that “enabling people to earn income by contributing judgment” is irreplaceable in robotics data.

P0 NVIDIA continuously released Physical AI and coding Agent data from 2026-05-01 to 2026-05-06, with both video anomaly and software trajectory activity heating up [P0]

nvidia/PhysicalAI-Traffic-Anomaly-Reasoning was released on 2026-05-01, with 316 downloads and 6 likes, containing 44,040 pseudo-labeled multi-task annotations, 3,670 CCTV traffic video clips, and about 26.1 hours of video. nvidia/PhysicalAI-VANTAGE-Bench was released on 2026-05-04 with 19 downloads; its subset, nvidia/PhysicalAI-VANTAGE-Bench-Subset, was released on 2026-05-05 with 6 downloads. Both target fixed-infrastructure camera video understanding. Meanwhile, nvidia/SWE-Zero-openhands-trajectories was released on 2026-04-17 with 483 downloads and 3 likes, containing 318k agent trajectories; nvidia/SWE-Hero-openhands-trajectories was released the same day with 133 downloads and 3 likes, containing 34k agent trajectories.

Business implication → NVIDIA is simultaneously betting on “physical-world video reasoning” and “software engineering Agent trajectories,” indicating that high-value training data is shifting from static samples to process data. Whether it is traffic anomaly judgment or code-fix trajectories, humans are needed to define what counts as an anomaly, a valid step, and successful completion. Knowlyr can shift its service focus from one-off sample production to judgment-centric data products such as trajectory review, event segmentation, success criteria design, and trigger conditions for agent help-seeking.

P1 Downloads of Google's non-English speech data accelerated, with google/WaxalNLP growing 83.8% during the comparison period [P1]

google/WaxalNLP was released on 2026-01-19 and currently has 19,454 downloads and 224 likes. Compared with the previous period, it grew from 10,582 to 19,454, a net increase of 8,872 or 83.8%, making it the only clearly captured Download Mover this period. The dataset covers African languages, with tasks including automatic-speech-recognition and text-to-speech, and sources including UGSpeechData, DigitalUmuganda/AfriVoice, and original.

Business implication → Multilingual speech, especially low-resource speech, remains scarce, and the download surge indicates the market is again looking for “high-quality, non-English, deployable” speech training sets. The core barrier in speech data is not recording itself, but human judgment in transcription consistency, accent coverage, code-switching, and noisy-scene classification. Knowlyr can use this momentum to enter high-value services such as dialect/low-resource language speech quality inspection, transcription arbitration, and speaker attribute classification.

P1 Meta released new social reasoning and AI-generated content detection evaluation sets on 2026-05-06, as evaluation data continues shifting toward open-ended subjective judgment [P1]

facebook/SCRuB-dataset was released on 2026-05-06 with 16 downloads and 0 likes, targeting rubric-based evaluation for socially sensitive, open-ended essay prompts. facebook/beyond_the_lab_neurips_paper was released the same day with 0 downloads and 0 likes; its tags explicitly include AI-generated visual content detection, human-labeled dataset, and multi-signal evaluation. In the same period, internlm/WildClawBench, released on 2026-03-24, reached 7,683 downloads and 54 likes, also pointing to real-world Agent evaluation.

Business implication → Evaluation data is shifting from “single-answer benchmarks” to “open tasks with scoring rubrics,” and this type of data naturally depends on consistency design in human judgment. Whoever can turn subjective tasks into stable, auditable, trainable scoring systems will control the next generation of alignment and evaluation infrastructure. Knowlyr should strengthen capabilities in rubric design, multi-annotator arbitration, sensitive content grading, and cross-cultural evaluation to serve post-training and safety evaluation needs.

P2 LAION added 35 synthetic datasets this period, with code and RL environment data beginning to appear as integrated sets [P2]

Change data shows the synthetic category increased from 3 to 35, a net gain of 32. New additions include laion/BVD-AV-55M, downloads 15, with no date expanded in the main table; laion/openswe-tasks-patched-v5, downloads 31; laion/swegym-tasks-patched-validated-v2, downloads 21; laion/exp_rpt_softwareheritage-large-v2, downloads 99; as well as code datasets laion/exp_rpt_codenet-python-v2, downloads 14; laion/exp_rpt_exercism-python-v2, downloads 13; and laion/exp_flat25_pseudocode-v2, downloads 17. On the paper side, OpenSearch-VL, A^2TGPO, Think, then Score, and XL-SafetyBench all appeared in a concentrated burst from 2026-05-06 to 2026-05-07.

Business implication → Scaled synthetic data is no longer new; the real new signal is that “synthetic task environments + validation sets + trajectory/reward methods” are starting to appear as complete packages. This trend will lower the price of simple data supply, but increase the value of human judgment in validation, error screening, bias detection, and reward modeling. Rather than competing head-on with pure synthetic data at scale, Knowlyr should move into more defensible segments such as synthetic data verification, hard-sample filtering, and reward signal calibration.

Demand Signals

Infer training data demands from model releases

Robot operation trajectories and language instructions

Very strong ↑ New

Allen AI had 100 datasets this period · the robotics category rose from 7 to 99; MolmoAct2-SO100_101-Dataset was released on 2026-05-04

Video anomaly understanding / infrastructure camera data

Very strong ↑ New

NVIDIA released TAR, VANTAGE-Bench, and Subset in succession from 2026-05-01 to 2026-05-05

Coding Agent trajectories

Strong ↑ New

nvidia/SWE-Zero-openhands-trajectories contains 318k trajectories · downloads 483; SWE-Hero contains 34k trajectories

Open-ended social evaluation and safety rubrics

Strong ↑ New

Meta SCRuB and beyond_the_lab_neurips_paper were released on 2026-05-06; the XL-SafetyBench paper was released on 2026-05-07

Multilingual / low-resource speech

Strong ↑ New

google/WaxalNLP downloads 19,454, up +8,872 from the previous period · growth rate 83.8%

Synthetic code tasks and validation sets

Strong ↑ New

LAION's synthetic category rose from 3 to 35, with integrated releases such as openswe-tasks, swegym, and softwareheritage

Multimodal search and web Agent data

Medium ↑ New

The OpenSearch-VL paper was released on 2026-05-06, and WildClawBench has 7,683 downloads

Video reward modeling / preference learning data

Medium ↑ New

Think, then Score was released on 2026-05-07, emphasizing the decoupling of reasoning and scoring in video reward modeling

Agent evaluation and verifier data ↓ Dropped Present in previous issue, absent this issue

Long-horizon delegated document editing trajectories ↓ Dropped Present in previous issue, absent this issue

Sovereign Persona / population-distribution grounding data ↓ Dropped Present in previous issue, absent this issue

Multi-view robotics and simulator-ready 3D assets ↓ Dropped Present in previous issue, absent this issue

Remote sensing temporal change understanding data ↓ Dropped Present in previous issue, absent this issue

Verifiable environment data for e-commerce / customer service Agents ↓ Dropped Present in previous issue, absent this issue

Multilingual speech ASR / TTS data ↓ Dropped Present in previous issue, absent this issue

Translation quality and cultural adaptation evaluation data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset	Downloads	Weekly Growth
google/WaxalNLP	19,454	+83.8%

Want to discuss this issue?

Kai Founder & CEO

苏文 AI Documentation & Release Engineer

陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →

92 additional robotics datasets in a single scanHuman judgment is becoming the bottleneck in embodied training

Key Findings

Demand Signals

Download Movers

92 additional robotics datasets in a single scan
Human judgment is becoming the bottleneck in embodied training