92 additional robotics datasets in a single scan
Human judgment is becoming the bottleneck in embodied training
This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts
Allen AI massively exposed the MolmoAct2 robotics data cluster on 2026-05-04, with the number of robotics datasets rising from 7 to 99 during the scan period [P0]; NVIDIA continuously released Physical AI and coding Agent data from 2026-05-01 to 2026-05-06, with both video anomaly and software trajectory activity heating up [P0]; downloads of Google's non-English speech data accelerated, with google/WaxalNLP growing 83.8% during the comparison period [P1]. Strongest data demand signal this week: robot operation trajectories and language instructions.
Key Findings
This week's 5 high commercial value findings
Allen AI had a total of 100 datasets enter the scan this week, many of which were MolmoAct2-BimanualYAM subsets collected from 2025-11-24 to 2026-01-27 and exposed in a concentrated release alongside the model on 2026-05-04. Representative datasets include allenai/MolmoAct2-SO100_101-Dataset, downloads 119, likes 3, date 2026-05-04; allenai/24112025-yam-01, downloads 1,495, date 2025-11-24; allenai/31122025-tablebuss-12, downloads 472, date 2025-12-31; and allenai/16012026-scan-13, downloads 459, likes 1, date 2026-01-16. In the change data, the robotics category increased from 7 to 99, a net weekly gain of 92.
nvidia/PhysicalAI-Traffic-Anomaly-Reasoning was released on 2026-05-01, with 316 downloads and 6 likes, containing 44,040 pseudo-labeled multi-task annotations, 3,670 CCTV traffic video clips, and about 26.1 hours of video. nvidia/PhysicalAI-VANTAGE-Bench was released on 2026-05-04 with 19 downloads; its subset, nvidia/PhysicalAI-VANTAGE-Bench-Subset, was released on 2026-05-05 with 6 downloads. Both target fixed-infrastructure camera video understanding. Meanwhile, nvidia/SWE-Zero-openhands-trajectories was released on 2026-04-17 with 483 downloads and 3 likes, containing 318k agent trajectories; nvidia/SWE-Hero-openhands-trajectories was released the same day with 133 downloads and 3 likes, containing 34k agent trajectories.
google/WaxalNLP was released on 2026-01-19 and currently has 19,454 downloads and 224 likes. Compared with the previous period, it grew from 10,582 to 19,454, a net increase of 8,872 or 83.8%, making it the only clearly captured Download Mover this period. The dataset covers African languages, with tasks including automatic-speech-recognition and text-to-speech, and sources including UGSpeechData, DigitalUmuganda/AfriVoice, and original.
facebook/SCRuB-dataset was released on 2026-05-06 with 16 downloads and 0 likes, targeting rubric-based evaluation for socially sensitive, open-ended essay prompts. facebook/beyond_the_lab_neurips_paper was released the same day with 0 downloads and 0 likes; its tags explicitly include AI-generated visual content detection, human-labeled dataset, and multi-signal evaluation. In the same period, internlm/WildClawBench, released on 2026-03-24, reached 7,683 downloads and 54 likes, also pointing to real-world Agent evaluation.
Change data shows the synthetic category increased from 3 to 35, a net gain of 32. New additions include laion/BVD-AV-55M, downloads 15, with no date expanded in the main table; laion/openswe-tasks-patched-v5, downloads 31; laion/swegym-tasks-patched-validated-v2, downloads 21; laion/exp_rpt_softwareheritage-large-v2, downloads 99; as well as code datasets laion/exp_rpt_codenet-python-v2, downloads 14; laion/exp_rpt_exercism-python-v2, downloads 13; and laion/exp_flat25_pseudocode-v2, downloads 17. On the paper side, OpenSearch-VL, A^2TGPO, Think, then Score, and XL-SafetyBench all appeared in a concentrated burst from 2026-05-06 to 2026-05-07.
Demand Signals
Infer training data demands from model releases
Download Movers
Datasets with the largest download changes this week
| Dataset | Downloads | Weekly Growth |
|---|---|---|
| google/WaxalNLP | 19,454 | +83.8% |
Want to discuss this issue?
Auto-generated by AI Dataset Radar · Updated weekly
AI Dataset Radar →