Radar Brief Week 8, 2026 · 2026-02-04 — 2026-02-11

Code Agent Data Explosion
Embodied AI Data Standards Elevate

This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts

0
Valuable Datasets
0
Related Papers
0
Blog Posts
0
Active Repos
One-line Summary

NVIDIA's full-stack embodied AI data pipeline, Allen AI Molmo2 video understanding dataset cluster release, Reward Model / RLHF paper surge. Top data demand signal this week: Robotics Manipulation Data.

Key Findings

This week's 5 high commercial value findings

P0 NVIDIA's Full-Stack Embodied AI Data Pipeline (2026-02-10)

NVIDIA released/updated 7 datasets + 26 models in a single week, the most active organization. Datasets focused on two directions: Robotics simulation: `nvidia/PhysicalAI-Robotics-Kitchen-Sim-Demos` (2/10), `nvidia/RoboCasa-Cosmos-Policy`, `nvidia/LIBERO-Cosmos-Policy` — all serving the Cosmos Policy project, building a closed loop from simulation to policy learning; Speech TN/ITN: `nvidia/Numb3rs` (2/6) — speech number normalization benchmark.

Business implications: NVIDIA is systematically building Physical AI data infrastructure. On the model side, `personaplex-7b-v1` (228K downloads, 1,731 likes) demonstrates massive demand for speech-to-speech. Data service companies should focus on two growth directions: robotics manipulation data (kitchen/manipulation scenarios) and speech data.
P0 Allen AI Molmo2 Video Understanding Dataset Cluster Release (2025-12-07~12-16, Still Updating This Week)

Allen AI released 4 video-related datasets: `Molmo2-VideoPoint`, `Molmo2-VideoPointEval`, `Molmo2-VideoCountEval`, `Molmo2-CapEval`, forming a complete video grounding + counting + captioning evaluation system. Also released `pointer-retrieval` (2/10, new) and `asta-summary-citation-counts`, two utility datasets.

Business implications: Video understanding data is a hot track in 2026. Allen AI is staking its position through open-source data + evaluation benchmarks, which will inevitably drive more video VLMs to need training data.
P1 Reward Model / RLHF Paper Surge (2026-02-06~02-09)

8 RLHF/preference learning papers this week, key trends: `compar:IA` (2/6) — French government-level LLM arena collecting French preference data, multilingual RLHF data demand has officially entered the national level; `WildReward` (2/9) — mining implicit reward signals from online interactions, reducing human labeling costs; `Fairness Aware Reward Optimization` (2/8) — demographic biases propagate through reward models, creating fairness labeling demand; `Joint Reward Modeling` (2/7) — visual reward models for image editing, expanding multimodal RLHF data demand.

Business implications: RLHF data is expanding from English monolingual to multilingual, from text to visual, from human labeling to semi-automated. Data service companies need to rapidly build multilingual preference data collection capabilities.
P1 StepFun (阶跃星辰) Releases Step-3.5-Flash + Dual Evaluation Benchmarks (2026-02-01~02-09)

StepFun released `Step-3.5-Flash` (249K downloads, 560 likes) model, alongside: `stepfun-ai/GEBench` (2/9) — GUI interaction generation evaluation benchmark; `stepfun-ai/CF-Div2-Stepfun` (2/9) — competitive programming evaluation benchmark.

Business implications: Chinese AI Labs are proactively building evaluation ecosystems, no longer relying solely on overseas benchmarks. GUI interaction data is a critical bottleneck for Agent deployment.
P2 OpenAI Launches GPT-5.3-Codex + Tests ChatGPT Ads (2026-02-05~02-10)

GPT-5.3-Codex launched (2/5), focused on code generation; OpenAI blog announced testing ChatGPT advertising (2/10); `openai/gdpval` dataset active (28,361 downloads) — evaluating AI performance across 44 occupations and 220 real-world tasks.

Business implications: OpenAI is simultaneously advancing monetization (advertising) and capability boundary evaluation (gdpval). The latter suggests systematic assessment of AI's impact on the labor market, which may affect the data labeling industry itself.

Demand Signals

Infer training data demands from model releases

Data Type Intensity Trend Related Signals
Robotics Manipulation Data Strong Rising → Continuing NVIDIA 3 robotics datasets · Meta JEPA-WMS · lerobot/piper-collect · BAAI/ToucHD-Sim
Multimodal Preference Data Strong Rising → Continuing 7 RLHF papers · Qwen RationaleRM · Visual reward model papers
Speech/ASR Data Rising → Continuing Mistral Voxtral real-time ASR · NVIDIA Numb3rs · Google WaxalNLP
Code Data Rising → Continuing OpenAI GPT-5.3-Codex · StepFun CF-Div2 programming benchmark · Together Aurora-Spec-Coder
Video Understanding Data Rising → Continuing Allen AI 4 Molmo2 video datasets · Meta EgoAVU
GUI/Agent Data Rising → Continuing StepFun GEBench GUI evaluation · Databricks Agent Bricks GA
Multilingual Data Stable → Continuing Google WaxalNLP African languages · compar:IA French preference data

Download Movers

Datasets with the largest download changes this week

Dataset Downloads Weekly Growth
nvidia/RoboCasa-Cosmos-Policy 1,332 +39.6%
Qwen/RationaleRM 881 +16.8%
nvidia/HiLiftAeroML 992 +16.2%
google/WaxalNLP 7,465 +2.6%
nvidia/LIBERO-Cosmos-Policy 2,221 +2.2%

Deep Dive — DataRecipe

This week's 3 high-value datasets reverse-analyzed (auto-generated by DataRecipe)

Qwen/RationaleRM
300 samples · 14 fields · Hard
6.0/10
🟢 Recommended to Replicate

Data Structure

domain language context response1 response2 overall_preference individual_preference human-checklist model-low_deceptive_alignment-checklist

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds
Low Risk Data may become outdated over time → Establish continuous update mechanisms
microsoft/CancerGUIDE
165 samples · 3 fields · Hard
6.0/10
🟢 Recommended to Replicate

Data Structure

patient_id patient_note label

Risk Assessment

Medium Risk Requires domain experts; talent acquisition may be challenging → Build talent pipeline in advance, or consider outsourcing partnerships
Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds
Low Risk Data may become outdated over time → Establish continuous update mechanisms
amazon/doc_split
300 samples · 3 fields · Hard
6.0/10
🟢 Recommended to Replicate

Data Structure

doc_id total_pages subdocuments

Risk Assessment

Medium Risk Requires domain experts; talent acquisition may be challenging → Build talent pipeline in advance, or consider outsourcing partnerships
Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds
Low Risk Data may become outdated over time → Establish continuous update mechanisms

3 datasets analyzed this week · 83.9% human labor share · All Hard difficulty

Want to discuss this issue?

Kai" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
Kai Founder & CEO
苏文" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
苏文 AI 文档与发布工程师
陆明哲" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
陆明哲 AI 产品经理

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →