Allen AI Turns Web Agent Data into a Product Line
LAION Open-Sources coderforge and r2egym in Bulk
This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts
From 2026-03-21 to 2026-03-24, Allen AI turned MolmoWeb into a full data stack of “web trajectories + grounding + QA + ASR + Native models” [P0]; on 2026-03-25, LAION open-sourced coderforge, r2egym, swesmith, and terminal corpus in bulk, shifting post-open-source training from “single datasets” to “composable corpus warehouses” [P0]; NVIDIA’s long-video and Physical AI data continued to lead, and the strongest public demand remains for data that is “long-horizon + executable + reviewable” [P1]. Strongest data demand signal this week: web action trajectory data.
Key Findings
This week's 5 high commercial value findings
Allen AI continued rapidly expanding the MolmoWeb series this week. On 2026-03-21, it released `allenai/MolmoWeb-SyntheticQA`; the Hugging Face page shows about 2.11M rows, with 563 downloads and 6 likes so far. On 2026-03-22, it released `allenai/MolmoWeb-SyntheticGround`, currently at 327 downloads and 5 likes. On 2026-03-23, it released `allenai/OLMoASR-Mix`; its dataset card describes it as a large-scale public internet audio-text pool with about 1M hours of audio, currently at 339 downloads. During the same week, it also consecutively released two Native models, `allenai/MolmoWeb-4B-Native` and `allenai/MolmoWeb-8B-Native`, which together with the earlier `MolmoWeb-SyntheticTrajs / HumanSkills / HumanTrajs` form a complete recipe spanning web QA, element grounding, action trajectories, speech recognition, and execution models.
On 2026-03-25, LAION launched a concentrated batch of datasets for code agents and RL post-training: `laion/coderforge-preview-unified` currently has 673 downloads, and its Hugging Face page shows 413k rows; `laion/nemotron-terminal-corpus-unified` has 610 downloads; `laion/r2egym-unified`, `laion/swesmith-unified`, `laion/allenai-sera-unified`, and their corresponding 316 / 1k / 10k / 100k shards all went live the same day. Then from 2026-03-25 to 2026-03-28, LAION successively released multiple batches of `Qwen3-8B` derivative models trained on these datasets, such as `swesmith-*__Qwen3-8B`, `coderforge-*__Qwen3-8B`, `r2egym-*__Qwen3-8B`, and `sft__Kimi-2-5-swesmith-oracle-maxeps-32k__Qwen3-8B`, forming a public pipeline of “dataset → shards → training mixtures → derivative models.”
Among newly added datasets this week, `nvidia/LongGroundedThoughts-video-datagen` was released on 2026-03-23 and currently has 647 downloads and 5 likes, with tags directly pointing to video understanding; `nvidia/ffs_stereo4d` currently has 2,052 downloads; `nvidia/MMOU` rose from 504 to 1,389 within a week, a growth rate of 175.6%. More importantly, robotics data continues scaling up: `nvidia/PhysicalAI-Robotics-Open-H-Embodiment` currently has 51,101 downloads, up 36.5% from the previous period; `nvidia/PhysicalAI-Robotics-Manipulation-Kitchen-Demos` currently has 33,045 downloads, up 58.5% from the previous period.
The `Anthropic/EconomicIndex` dataset was updated on 2026-03-24. Its dataset card explicitly states that it added new analysis and learning curves based on Opus 4.5/4.6; it currently has 11,726 downloads and 494 likes. On the other side, Together released `togethercomputer/aurora` on 2026-03-27; its dataset card shows 619,177 samples covering five major domains: code, math, reasoning, chat, and finance, and it currently has 5 downloads and 1 like. The former represents the productization of “real-world task adoption and user behavior trajectories” into analyzable data assets, while the latter shows that “multi-domain mixed instruction corpora” are still expanding rapidly.
The paper “CUA-Suite,” submitted on 2026-03-25, proposes a large-scale ecosystem for Computer-Use Agents: about 10,000 human demonstration tasks covering 87 applications, about 55 hours of video, and 6 million frames, with additional UI-Vision and GroundCUA resources. This week’s paper list also includes “ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment,” “Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Learning,” “Improving Safety Alignment via Balanced Direct Preference Optimization,” and “UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience.” These works all emphasize video demonstrations, failed experiences, implicit feedback, and safety preferences.
Demand Signals
Infer training data demands from model releases
Download Movers
Datasets with the largest download changes this week
| Dataset | Downloads | Weekly Growth |
|---|---|---|
| nvidia/MMOU | 1,389 | +175.6% |
| nvidia/PhysicalAI-Robotics-Manipulation-Kitchen-Demos | 33,045 | +58.5% |
| nvidia/PhysicalAI-Robotics-Open-H-Embodiment | 51,101 | +36.5% |
| allenai/molmospaces | 7,684 | -11.2% |
| nvidia/HiLiftAeroML | 975 | -18.8% |
Want to discuss this issue?
Auto-generated by AI Dataset Radar · Updated weekly
AI Dataset Radar →