Radar Brief Week 7, 2026 · 2026-02-04 — 2026-02-11

Video Understanding Data Surges
RLHF Enters the Multimodal Era

This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts

0
Valuable Datasets
0
Related Papers
0
Blog Posts
0
Active Repos
One-line Summary

NVIDIA goes all-in on embodied AI data pipeline, Allen AI Molmo2 video understanding dataset cluster released, Reward Model / RLHF papers surge. Strongest data demand signal this week: Robotic Manipulation Data.

Key Findings

This week's 5 high commercial value findings

P0 NVIDIA Goes All-In on Embodied AI Data Pipeline (2026-02-10)

NVIDIA released/updated 7 datasets + 26 models this week, the most active of all organizations. Datasets focus on two directions: Robotics simulation: `nvidia/PhysicalAI-Robotics-Kitchen-Sim-Demos` (2/10), `nvidia/RoboCasa-Cosmos-Policy`, `nvidia/LIBERO-Cosmos-Policy` — all serving the Cosmos Policy project, building a closed-loop from simulation to policy learning; Speech TN/ITN: `nvidia/Numb3rs` (2/6) — speech numeral normalization benchmark

Business implications → NVIDIA is systematically building the data infrastructure for Physical AI. On the model side, `personaplex-7b-v1` (228K downloads, 1,731 likes) demonstrates massive demand for speech-to-speech capabilities. Data service companies should focus on two growth areas: robotic manipulation data (kitchen/manipulation scenarios) and speech data.
P0 Allen AI Molmo2 Video Understanding Dataset Cluster Released (2025-12-07~12-16, Still Updating This Week)

Allen AI released 4 video-related datasets: `Molmo2-VideoPoint`, `Molmo2-VideoPointEval`, `Molmo2-VideoCountEval`, `Molmo2-CapEval`, forming a complete video grounding + counting + captioning evaluation framework. Additionally, `pointer-retrieval` (new on 2/10) and `asta-summary-citation-counts` serve as utility datasets.

Business implications → Video understanding data is a hot track in 2026. Allen AI is staking its position with open-source data + evaluation benchmarks, which will inevitably drive demand for more video VLM training data.
P1 Reward Model / RLHF Papers Surge (2026-02-06~02-09)

Eight RLHF/preference learning papers this week, with key trends: `compar:IA` (2/6) — French government-level LLM arena collecting French preference data, multilingual RLHF data demand has officially reached the national level; `WildReward` (2/9) — mining implicit reward signals from online interactions to reduce human labeling costs; `Fairness Aware Reward Optimization` (2/8) — demographic biases propagate through reward models, creating demand for fairness labeling; `Joint Reward Modeling` (2/7) — visual reward models for image editing, expanding multimodal RLHF data demand

Business implications → RLHF data is expanding from English monolingual to multilingual, from text to vision, and from manual labeling to semi-automated. Data service companies need to build multilingual preference data collection capabilities as soon as possible.
P1 StepFun Releases Step-3.5-Flash + Dual Evaluation Benchmarks (2026-02-01~02-09)

StepFun released the `Step-3.5-Flash` model (249K downloads, 560 likes), along with: `stepfun-ai/GEBench` (2/9) — GUI interaction generation evaluation benchmark; `stepfun-ai/CF-Div2-Stepfun` (2/9) — competitive programming evaluation benchmark

Business implications → Chinese AI labs are proactively building evaluation ecosystems rather than relying solely on overseas benchmarks. GUI interaction data is a critical bottleneck for agent deployment.
P2 OpenAI Launches GPT-5.3-Codex + Tests ChatGPT Ads (2026-02-05~02-10)

GPT-5.3-Codex went live (2/5), focused on code generation; OpenAI's blog announced testing ChatGPT ads (2/10); the `openai/gdpval` dataset is active (28,361 downloads) — evaluating AI performance across 44 professions and 220 real-world tasks

Business implications → OpenAI is simultaneously advancing monetization (ads) and capability boundary evaluation (gdpval). The latter suggests they are systematically assessing AI's impact on the labor market, which could affect the data labeling industry itself.

Demand Signals

Infer training data demands from model releases

Data Type Intensity Trend Related Signals
Robotic Manipulation Data
Rising High ↑ New
NVIDIA 3 robotics datasets · Meta JEPA-WMS · lerobot/piper-collect · BAAI/ToucHD-Sim
Multimodal Preference Data
Rising High ↑ New
7 RLHF papers · Qwen RationaleRM · visual reward model papers
Speech / ASR Data
Rising ↑ New
Mistral Voxtral real-time ASR · NVIDIA Numb3rs · Google WaxalNLP
Code Data
Rising ↑ New
OpenAI GPT-5.3-Codex · StepFun CF-Div2 programming benchmark · Together Aurora-Spec-Coder
Video Understanding Data
Rising ↑ New
Allen AI 4 Molmo2 video datasets · Meta EgoAVU
GUI / Agent Data
Rising ↑ New
StepFun GEBench GUI evaluation · Databricks Agent Bricks GA
Multilingual Data
🟢 Stable ↑ New
Google WaxalNLP African languages · compar:IA French preference data
Code Agent Data ↓ Dropped Present in previous issue, absent this issue
Robotics / Embodied AI Data ↓ Dropped Present in previous issue, absent this issue
Document OCR Data ↓ Dropped Present in previous issue, absent this issue
RLHF Preference Data ↓ Dropped Present in previous issue, absent this issue
Multilingual Speech Data ↓ Dropped Present in previous issue, absent this issue
Safety / Content Moderation Data ↓ Dropped Present in previous issue, absent this issue
Synthetic Visual Data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset Downloads Weekly Growth
nvidia/RoboCasa-Cosmos-Policy 1,332 +39.6%
Qwen/RationaleRM 881 +16.8%
nvidia/HiLiftAeroML 992 +16.2%
google/WaxalNLP 7,465 +2.6%
nvidia/LIBERO-Cosmos-Policy 2,221 +2.2%

Deep Dive — DataRecipe

This week's 3 high-value datasets reverse-analyzed (auto-generated by DataRecipe)

Qwen/RationaleRM
300 samples · 14 fields · Hard
6.0/10
🟢 Recommended for Replication

Data Structure

domain language context response1 response2 overall_preference individual_preference human-checklist model-low_deceptive_alignment-checklist

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality gates
Low Risk Data may become outdated over time → Establish continuous update mechanisms
microsoft/CancerGUIDE
165 samples · 3 fields · Hard
6.0/10
🟢 Recommended for Replication

Data Structure

patient_id patient_note label

Risk Assessment

Medium Risk Requires domain experts; talent acquisition may be difficult → Build talent pipeline early or consider outsourcing partnerships
Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality gates
Low Risk Data may become outdated over time → Establish continuous update mechanisms
amazon/doc_split
300 samples · 3 fields · Hard
6.0/10
🟢 Recommended for Replication

Data Structure

doc_id total_pages subdocuments

Risk Assessment

Medium Risk Requires domain experts; talent acquisition may be difficult → Build talent pipeline early or consider outsourcing partnerships
Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality gates
Low Risk Data may become outdated over time → Establish continuous update mechanisms

Analyzed 3 datasets this week · 83.9% human effort · all Hard difficulty

Want to discuss this issue?

Kai
Kai Founder & CEO
苏文
苏文 AI Documentation & Release Engineer
陆明哲
陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →