W07 AI Data Intelligence

One-line Summary

NVIDIA goes all-in on embodied AI data pipeline, Allen AI Molmo2 video understanding dataset cluster released, Reward Model / RLHF papers surge. Strongest data demand signal this week: Robotic Manipulation Data.

Key Findings

This week's 5 high commercial value findings

P0 NVIDIA Goes All-In on Embodied AI Data Pipeline (2026-02-10)

NVIDIA released/updated 7 datasets + 26 models this week, the most active of all organizations. Datasets focus on two directions: Robotics simulation: `nvidia/PhysicalAI-Robotics-Kitchen-Sim-Demos` (2/10), `nvidia/RoboCasa-Cosmos-Policy`, `nvidia/LIBERO-Cosmos-Policy` — all serving the Cosmos Policy project, building a closed-loop from simulation to policy learning; Speech TN/ITN: `nvidia/Numb3rs` (2/6) — speech numeral normalization benchmark

Business implications → NVIDIA is systematically building the data infrastructure for Physical AI. On the model side, `personaplex-7b-v1` (228K downloads, 1,731 likes) demonstrates massive demand for speech-to-speech capabilities. Data service companies should focus on two growth areas: robotic manipulation data (kitchen/manipulation scenarios) and speech data.

P0 Allen AI Molmo2 Video Understanding Dataset Cluster Released (2025-12-07~12-16, Still Updating This Week)

Allen AI released 4 video-related datasets: `Molmo2-VideoPoint`, `Molmo2-VideoPointEval`, `Molmo2-VideoCountEval`, `Molmo2-CapEval`, forming a complete video grounding + counting + captioning evaluation framework. Additionally, `pointer-retrieval` (new on 2/10) and `asta-summary-citation-counts` serve as utility datasets.

Business implications → Video understanding data is a hot track in 2026. Allen AI is staking its position with open-source data + evaluation benchmarks, which will inevitably drive demand for more video VLM training data.

P1 Reward Model / RLHF Papers Surge (2026-02-06~02-09)

Eight RLHF/preference learning papers this week, with key trends: `compar:IA` (2/6) — French government-level LLM arena collecting French preference data, multilingual RLHF data demand has officially reached the national level; `WildReward` (2/9) — mining implicit reward signals from online interactions to reduce human labeling costs; `Fairness Aware Reward Optimization` (2/8) — demographic biases propagate through reward models, creating demand for fairness labeling; `Joint Reward Modeling` (2/7) — visual reward models for image editing, expanding multimodal RLHF data demand

Business implications → RLHF data is expanding from English monolingual to multilingual, from text to vision, and from manual labeling to semi-automated. Data service companies need to build multilingual preference data collection capabilities as soon as possible.

P1 StepFun Releases Step-3.5-Flash + Dual Evaluation Benchmarks (2026-02-01~02-09)

StepFun released the `Step-3.5-Flash` model (249K downloads, 560 likes), along with: `stepfun-ai/GEBench` (2/9) — GUI interaction generation evaluation benchmark; `stepfun-ai/CF-Div2-Stepfun` (2/9) — competitive programming evaluation benchmark

Business implications → Chinese AI labs are proactively building evaluation ecosystems rather than relying solely on overseas benchmarks. GUI interaction data is a critical bottleneck for agent deployment.

P2 OpenAI Launches GPT-5.3-Codex + Tests ChatGPT Ads (2026-02-05~02-10)

GPT-5.3-Codex went live (2/5), focused on code generation; OpenAI's blog announced testing ChatGPT ads (2/10); the `openai/gdpval` dataset is active (28,361 downloads) — evaluating AI performance across 44 professions and 220 real-world tasks

Business implications → OpenAI is simultaneously advancing monetization (ads) and capability boundary evaluation (gdpval). The latter suggests they are systematically assessing AI's impact on the labor market, which could affect the data labeling industry itself.

Demand Signals

Infer training data demands from model releases

Robotic Manipulation Data

Rising High ↑ New

NVIDIA 3 robotics datasets · Meta JEPA-WMS · lerobot/piper-collect · BAAI/ToucHD-Sim

Multimodal Preference Data

Rising High ↑ New

7 RLHF papers · Qwen RationaleRM · visual reward model papers

Speech / ASR Data

Rising ↑ New

Mistral Voxtral real-time ASR · NVIDIA Numb3rs · Google WaxalNLP

Code Data

Rising ↑ New

OpenAI GPT-5.3-Codex · StepFun CF-Div2 programming benchmark · Together Aurora-Spec-Coder

Video Understanding Data

Rising ↑ New

Allen AI 4 Molmo2 video datasets · Meta EgoAVU

GUI / Agent Data

Rising ↑ New

StepFun GEBench GUI evaluation · Databricks Agent Bricks GA

Multilingual Data

🟢 Stable ↑ New

Google WaxalNLP African languages · compar:IA French preference data

Code Agent Data ↓ Dropped Present in previous issue, absent this issue

Robotics / Embodied AI Data ↓ Dropped Present in previous issue, absent this issue

Document OCR Data ↓ Dropped Present in previous issue, absent this issue

RLHF Preference Data ↓ Dropped Present in previous issue, absent this issue

Multilingual Speech Data ↓ Dropped Present in previous issue, absent this issue

Safety / Content Moderation Data ↓ Dropped Present in previous issue, absent this issue

Synthetic Visual Data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset	Downloads	Weekly Growth
nvidia/RoboCasa-Cosmos-Policy	1,332	+39.6%
Qwen/RationaleRM	881	+16.8%
nvidia/HiLiftAeroML	992	+16.2%
google/WaxalNLP	7,465	+2.6%
nvidia/LIBERO-Cosmos-Policy	2,221	+2.2%

Deep Dive — DataRecipe

This week's 3 high-value datasets reverse-analyzed (auto-generated by DataRecipe)

Qwen/RationaleRM

300 samples · 14 fields · Hard

6.0/10

Data Structure

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality gates

Low Risk Data may become outdated over time → Establish continuous update mechanisms

microsoft/CancerGUIDE

165 samples · 3 fields · Hard

6.0/10

Data Structure

Risk Assessment

Medium Risk Requires domain experts; talent acquisition may be difficult → Build talent pipeline early or consider outsourcing partnerships

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality gates

Low Risk Data may become outdated over time → Establish continuous update mechanisms

amazon/doc_split

300 samples · 3 fields · Hard

6.0/10

Data Structure

Risk Assessment

Medium Risk Requires domain experts; talent acquisition may be difficult → Build talent pipeline early or consider outsourcing partnerships

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality gates

Low Risk Data may become outdated over time → Establish continuous update mechanisms

Analyzed 3 datasets this week · 83.9% human effort · all Hard difficulty

Want to discuss this issue?

Kai Founder & CEO

苏文 AI Documentation & Release Engineer

陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →

Video Understanding Data SurgesRLHF Enters the Multimodal Era

Key Findings

Demand Signals

Download Movers

Deep Dive — DataRecipe

Data Structure

Risk Assessment

Data Structure

Risk Assessment

Data Structure

Risk Assessment

Video Understanding Data Surges
RLHF Enters the Multimodal Era