W18 AI Data Intelligence

One-line Summary

From 2026-03-21 to 2026-03-24, Allen AI turned MolmoWeb into a full data stack of “web trajectories + grounding + QA + ASR + Native models” [P0]; on 2026-03-25, LAION open-sourced coderforge, r2egym, swesmith, and terminal corpus in bulk, shifting post-open-source training from “single datasets” to “composable corpus warehouses” [P0]; NVIDIA’s long-video and Physical AI data continued to lead, and the strongest public demand remains for data that is “long-horizon + executable + reviewable” [P1]. Strongest data demand signal this week: web action trajectory data.

Key Findings

This week's 5 high commercial value findings

P0 From 2026-03-21 to 2026-03-24, Allen AI turned MolmoWeb into a full data stack of “web trajectories + grounding + QA + ASR + Native models” [P0]

Allen AI continued rapidly expanding the MolmoWeb series this week. On 2026-03-21, it released `allenai/MolmoWeb-SyntheticQA`; the Hugging Face page shows about 2.11M rows, with 563 downloads and 6 likes so far. On 2026-03-22, it released `allenai/MolmoWeb-SyntheticGround`, currently at 327 downloads and 5 likes. On 2026-03-23, it released `allenai/OLMoASR-Mix`; its dataset card describes it as a large-scale public internet audio-text pool with about 1M hours of audio, currently at 339 downloads. During the same week, it also consecutively released two Native models, `allenai/MolmoWeb-4B-Native` and `allenai/MolmoWeb-8B-Native`, which together with the earlier `MolmoWeb-SyntheticTrajs / HumanSkills / HumanTrajs` form a complete recipe spanning web QA, element grounding, action trajectories, speech recognition, and execution models.

Business implication → Web Agent training is upgrading from “screenshots + single-step clicks” to a data product line built on “continuous web states + element grounding + action trajectories + voice input + native execution models.” What is truly scarce is not raw web pages, but human judgment signals on target elements, operational intent, and action success or failure. For Knowlyr, this means web grounding, step acceptance, failed-trajectory correction, and voice-to-web-operation tasks can be packaged into standardized data services.

P0 On 2026-03-25, LAION open-sourced coderforge, r2egym, swesmith, and terminal corpus in bulk, shifting post-open-source training from “single datasets” to “composable corpus warehouses” [P0]

On 2026-03-25, LAION launched a concentrated batch of datasets for code agents and RL post-training: `laion/coderforge-preview-unified` currently has 673 downloads, and its Hugging Face page shows 413k rows; `laion/nemotron-terminal-corpus-unified` has 610 downloads; `laion/r2egym-unified`, `laion/swesmith-unified`, `laion/allenai-sera-unified`, and their corresponding 316 / 1k / 10k / 100k shards all went live the same day. Then from 2026-03-25 to 2026-03-28, LAION successively released multiple batches of `Qwen3-8B` derivative models trained on these datasets, such as `swesmith-*__Qwen3-8B`, `coderforge-*__Qwen3-8B`, `r2egym-*__Qwen3-8B`, and `sft__Kimi-2-5-swesmith-oracle-maxeps-32k__Qwen3-8B`, forming a public pipeline of “dataset → shards → training mixtures → derivative models.”

Business implication → Competition in the open-source community is shifting from “who has a breakout benchmark” to “who can continuously produce reusable, sliceable, and mixable post-training corpus warehouses.” Most of these corpora target code execution, terminal operations, SWE fixes, RL environments, and agent rollouts, and naturally require human acceptance, error attribution, and hard-example filtering. The high-value entry points for Knowlyr are to provide quality gates, slice segmentation, human review, and failed-case relabeling for such corpora.

P1 NVIDIA’s long-video and Physical AI data continue to lead, and the strongest public demand remains data that is “long-horizon + executable + reviewable” [P1]

Among newly added datasets this week, `nvidia/LongGroundedThoughts-video-datagen` was released on 2026-03-23 and currently has 647 downloads and 5 likes, with tags directly pointing to video understanding; `nvidia/ffs_stereo4d` currently has 2,052 downloads; `nvidia/MMOU` rose from 504 to 1,389 within a week, a growth rate of 175.6%. More importantly, robotics data continues scaling up: `nvidia/PhysicalAI-Robotics-Open-H-Embodiment` currently has 51,101 downloads, up 36.5% from the previous period; `nvidia/PhysicalAI-Robotics-Manipulation-Kitchen-Demos` currently has 33,045 downloads, up 58.5% from the previous period.

Business implication → Physical AI has not reduced its dependence on real demonstrations and long-video supervision despite advances in simulation and synthetic data. Instead, it places even more emphasis on judgment-intensive signals such as “continuous observation,” “action boundaries,” and “whether a task was truly completed.” For Knowlyr, robot teleoperation, long-video alignment, action success/failure review, and temporal reward modeling remain the highest-margin data directions worth prioritizing.

P1 Anthropic updates Economic Index while Together releases Aurora; real usage trajectories and multi-domain mixed SFT are becoming the two ends of post-training supply [P1]

The `Anthropic/EconomicIndex` dataset was updated on 2026-03-24. Its dataset card explicitly states that it added new analysis and learning curves based on Opus 4.5/4.6; it currently has 11,726 downloads and 494 likes. On the other side, Together released `togethercomputer/aurora` on 2026-03-27; its dataset card shows 619,177 samples covering five major domains: code, math, reasoning, chat, and finance, and it currently has 5 downloads and 1 like. The former represents the productization of “real-world task adoption and user behavior trajectories” into analyzable data assets, while the latter shows that “multi-domain mixed instruction corpora” are still expanding rapidly.

Business implication → Post-training data supply is splitting into two extremes: on one end, behavioral trajectories, task exposure, and implicit feedback from real product usage; on the other, high-coverage, multi-domain instruction mixtures that can be fed directly into training. Knowlyr can bridge these two ends: converting real business behavior into trainable, evaluable, auditable data structures, while adding quality review and domain boundaries to general mixed corpora.

P2 This week’s papers point to one conclusion: the bottleneck for Computer Use, GUI Agents, and preference learning has returned to “continuous human trajectories” and “implicit feedback” [P2]

The paper “CUA-Suite,” submitted on 2026-03-25, proposes a large-scale ecosystem for Computer-Use Agents: about 10,000 human demonstration tasks covering 87 applications, about 55 hours of video, and 6 million frames, with additional UI-Vision and GroundCUA resources. This week’s paper list also includes “ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment,” “Privacy-Preserving Reinforcement Learning from Human Feedback via Decoupled Reward Learning,” “Improving Safety Alignment via Balanced Direct Preference Optimization,” and “UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience.” These works all emphasize video demonstrations, failed experiences, implicit feedback, and safety preferences.

Business implication → Preference learning is no longer just traditional pairwise labeling, and GUI/Computer Use is no longer about isolated coordinate clicks. The industry is increasingly treating “how humans operate, why they fail, and what users’ implicit preferences are” as the next training bottleneck. Whoever can continuously produce this kind of high-information-density judgment data will be closer to the training entry point for the next generation of Agents.

Demand Signals

Infer training data demands from model releases

Web action trajectory data

Very strong ↑ New

From 2026-03-21 to 2026-03-24, Allen AI continuously expanded MolmoWeb QA · Ground · Native models and supporting trajectory data

GUI grounding / Screen parsing data

Very strong ↑ New

`MolmoWeb-SyntheticGround` and CUA-Suite both emphasize UI element grounding · screenshot understanding and grounding

Computer Use continuous video demonstrations

Very strong ↑ New

CUA-Suite proposes 10,000 human demonstration tasks · 87 applications · 55 hours of video and 6 million frames

Code Agent / terminal post-training corpora

Very strong ↑ New

On 2026-03-25, LAION launched coderforge · swesmith · nemotron-terminal-corpus · r2egym in bulk

Robot teleoperation and embodied demonstration data

Very strong ↑ New

Open-H-Embodiment has 51,101 downloads; Kitchen-Demos has 33,045 downloads, with weekly growth still high

Long-video reasoning and long-horizon multimodal data

Strong ↑ New

The growth of `LongGroundedThoughts-video-datagen` and `MMOU` jointly points to demand for long-video supervision

Implicit preference and real usage feedback data

Strong ↑ New

`EconomicIndex` updates learning curves; the papers ImplicitRM / Privacy-Preserving RLHF appear at the same time

Multi-domain mixed SFT data

Strong ↑ New

Together `aurora` uses 619,177 samples to cover code · math · chat · commonsense · finance

Vision Agent benchmarks

Strong ↑ New

`WildClawBench` · `AVGen-Bench` · `olmOCR-bench-1.5-preview` continue raising the evaluation bar

Speech-to-execution pipeline data

Medium ↑ New

`OLMoASR-Mix` shows that speech recognition is starting to be integrated into the Agent data stack, rather than remaining a standalone ASR track

Video understanding/tracking data ↓ Dropped Present in previous issue, absent this issue

GUI grounding and mobile operation data ↓ Dropped Present in previous issue, absent this issue

Post-training RL/preference data ↓ Dropped Present in previous issue, absent this issue

General SFT and code reasoning data ↓ Dropped Present in previous issue, absent this issue

Robot teleoperation demonstration data ↓ Dropped Present in previous issue, absent this issue

Visual reward and verifiable evaluation sets ↓ Dropped Present in previous issue, absent this issue

Long-video audio-visual evaluation benchmarks ↓ Dropped Present in previous issue, absent this issue

High-quality multilingual translation evaluation ↓ Dropped Present in previous issue, absent this issue

Persona and social distribution simulation data ↓ Dropped Present in previous issue, absent this issue

Observational user feedback data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset	Downloads	Weekly Growth
nvidia/MMOU	1,389	+175.6%
nvidia/PhysicalAI-Robotics-Manipulation-Kitchen-Demos	33,045	+58.5%
nvidia/PhysicalAI-Robotics-Open-H-Embodiment	51,101	+36.5%
allenai/molmospaces	7,684	-11.2%
nvidia/HiLiftAeroML	975	-18.8%

Want to discuss this issue?

Kai Founder & CEO

苏文 AI Documentation & Release Engineer

陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →

Allen AI Turns Web Agent Data into a Product LineLAION Open-Sources coderforge and r2egym in Bulk

Key Findings

Demand Signals

Download Movers

Allen AI Turns Web Agent Data into a Product Line
LAION Open-Sources coderforge and r2egym in Bulk