Radar Brief Week 10, 2026 · 2026-02-05 — 2026-02-12

GPT-5.2 Enters Scientific Discovery
Data Recipe Engineering Accelerates

This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts

0
Valuable Datasets
0
Related Papers
0
Blog Posts
0
Active Repos
One-line Summary

Allen AI releases Sera code agent trajectory dataset, advancing open-source code Agent training ecosystem; NVIDIA releases PhysicalAI kitchen robotics demo dataset, 600 hours of real manipulation data open-sourced; Meta releases EgoAVU first-person audio-video understanding dataset, opening new data track. Top data demand signal this week: Code Agent Trajectory Data.

Key Findings

This week's 5 high commercial value findings

P0 Allen AI Releases Sera Code Agent Trajectory Dataset, Advancing Open-Source Code Agent Training Ecosystem (2026-02-10/11)

Details: Allen AI released 6 Sera series datasets (Sera-4.5A-Django-T1/T2, Sera-4.5A-Sympy-T1/T2, Sera-4.5A-Sphinx-T1/T2) on February 10-11, 2026, covering Django, Sympy, and Sphinx — three major open-source projects — with over 136K code modification trajectories. These datasets were generated using GLM-4.5-Air as the teacher model, employing the SVG (Synthetic Verification-Guided) method, containing complete function-level code modification trajectories, patches, and verification results. Quality control uses a two-round verification mechanism: round one (T1) with unconstrained recall, round two (T2) with recall fixed at 0.5.

Business implications: 1. Labeling paradigm innovation: The SVG method breaks through traditional human labeling bottlenecks by using automated verification to ensure code modification correctness, providing a replicable technical path for large-scale code agent training data production. 2. Open-source competition intensifying: Allen AI freely releasing 136K high-quality code trajectories directly impacts the commercial code data service market. Data service companies need to establish differentiated advantages in data scale, domain coverage, or labeling quality. 3. Synthetic data mainstreaming: The successful use of GLM-4.5-Air (not a top-tier model) to generate training data validates the "mid-tier model + verification mechanism" synthetic data approach, lowering the cost threshold for data production. 4. Vertical domain opportunity: Sera datasets focus on three specific open-source projects, suggesting enterprise code agent training requires large amounts of fine-tuning data for specific codebases — a commercial opportunity for "customized enterprise code datasets."
P0 NVIDIA Releases PhysicalAI Kitchen Robotics Demo Dataset, 600 Hours of Real Manipulation Data Open-Sourced (2026-02-10)

Details: NVIDIA released PhysicalAI-Robotics-Kitchen-Sim-Demos on February 10, 2026, containing 600 hours of human teleoperation demonstration data across 316 different tasks with 55K trajectories. Data collected using Franka Panda robot + Omron mobile base, following LeRobot format, providing complete action, state, and sensor data. Also released SAGE-10k dataset (2025-12-31) with 10K interactive indoor scenes covering 50 room types.

Business implications: 1. Embodied AI data bar raised: The release of 600 hours of real robotics manipulation data significantly raises the "baseline" for robotics datasets. Commercial data suppliers still at "a few hundred trajectories" scale will rapidly lose competitiveness. 2. Hardware-data binding trend: NVIDIA is building a "hardware-data-algorithm" closed-loop ecosystem by providing standardized hardware solutions (Franka Panda + Omron) with matching datasets. Data service companies need to consider partnerships with mainstream robotics hardware manufacturers. 3. Scene standardization demand: SAGE-10k's 50 room types indicate that robotics training requires large-scale diverse scene data, creating a "3D scene generation + robotics action annotation" service opportunity. 4. Format standardization trend: LeRobot format is becoming the de facto standard for robotics datasets. Data service companies must ensure output data is compatible with this format.
P1 Meta Releases EgoAVU First-Person Audio-Video Understanding Dataset, Opening New Data Track (2026-01-09)

Details: Meta released facebook/EgoAVU_data on January 9, 2026, focusing on first-person perspective audio-video joint understanding. The dataset uses a scalable automated data engine, containing QA pairs, audio, and video multimodal annotations, designed for training AI models that understand human daily activities.

Business implications: 1. Emerging data type: First-person audio-video data is a critical training resource for AR/VR and embodied AI, but market supply is scarce — providing a competition-light new track for data service companies. 2. Collection device opportunity: First-person data requires specialized wearable devices (like Meta's smart glasses) for collection. Data service companies can partner with hardware manufacturers to establish collection infrastructure. 3. Automated data engine: Meta's emphasis on "scalable automated data engine" implies that large-scale data production must rely on automated toolchains. The efficiency disadvantage of traditional human labeling will be further amplified. 4. Scenario diversity demand: Daily activity understanding requires covering numerous life scenarios (cooking, repairs, socializing, etc.), providing new business directions for crowdsourced labeling platforms.
P1 DataChef Paper Proposes RL-Optimized Data Recipe Method, Data Mixing Ratios Become New Focus (2026-02-11)

Details: The DataChef paper published on February 11, 2026 proposes using reinforcement learning to optimize LLM training data recipes. The method automatically searches for optimal mixing ratios across different data sources via RL algorithms, significantly improving model performance. Concurrently, another paper published the same day, "Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning," found that repeating high-quality data is more effective than simply scaling data volume in long chain-of-thought fine-tuning.

Business implications: 1. Data quality assessment demand: The DataChef method's prerequisite is accurately evaluating the quality and characteristics of different data sources, creating new market demand for data evaluation services and quality scoring tools. 2. Small-scale premium data approach: The conclusion that Data Repetition > Data Scaling points small and mid-sized data service companies in the right direction — rather than pursuing massive low-quality data, focus on producing small volumes of high-quality, reusable premium datasets. 3. Data recipe consulting services: Enterprise clients need professional services to determine optimal data mixing for their specific tasks. Data service companies can offer "data recipe optimization consulting" beyond just selling raw data. 4. Synthetic data granularity control: These studies suggest future data production needs finer-grained control (difficulty distribution, style consistency) rather than simple volume scaling.
P2 GLM-5 744B Model Released, Chinese LLMs Enter 700B-Parameter Era (February 2026)

Details: Zhipu AI (Z.ai) released GLM-5 in February 2026 with 744B parameters, becoming China's largest open-source LLM. Twitter and social media discussion about GLM-5 reached extreme heat levels, with multiple tech communities sharing technical details.

Business implications: 1. Chinese data demand explosion: A 700B-parameter model's pre-training requires tens of terabytes of high-quality Chinese data, creating enormous commercial opportunities for Chinese data suppliers — especially vertical domain, high-quality conversation, code, and multimodal Chinese data. 2. RLHF/alignment data gap: Alignment difficulty for ultra-large models grows exponentially, requiring massive high-quality preference data and red-team testing data — a high-value market for RLHF data labeling service providers. 3. Domestic substitution accelerating: GLM-5's release reduces Chinese enterprises' dependence on overseas models, but also means Chinese data demand will primarily be consumed by domestic models. Data service companies need to strengthen partnerships with domestic model vendors. 4. Evaluation dataset demand: Rapid model capability improvement causes existing benchmarks to saturate quickly, creating new demand for "saturation-resistant, high-difficulty evaluation datasets."

Demand Signals

Infer training data demands from model releases

Data Type Intensity Trend Related Signals
Code Agent Trajectory Data Very Strong → Continuing Allen AI releases 136K Sera dataset entries; Meta JiTTesting blog hints at large-scale code Agent deployment; GitHub repos openai/codex · anthropics/skills etc. extremely active
Robotics Demonstration Data Very Strong → Continuing NVIDIA 600-hour kitchen task data; Allen AI molmospaces embodied AI ecosystem; Boston Dynamics CEO transition hints at commercialization acceleration; Datatang partners with Lingxinqiaoshou for embodied AI
Multimodal Video Data Strong → Continuing Allen AI Molmo2 series 6 video datasets; Meta EgoAVU first-person audio-video; 11 multimodal datasets accounting for 30.6% of weekly total
RLHF/Preference Data Strong ↑ New GLM-5 744B and other ultra-large model alignment demand; 6 RLHF-related papers; Reddit discusses RLHF safety training
Synthetic Data Strong → Continuing 8 synthetic datasets; Allen AI publishes SVG method; NVIDIA Data Designer integrates HuggingFace; Argilla · distilabel and other synthetic data tools active on GitHub
Math Reasoning Data Medium ↑ New NVIDIA Nemotron-Math-v2 long-context math data; Stepfun CF-Div2 competitive programming data; Gemini Deep Think focuses on math/scientific discovery
Evaluation Benchmark Data Medium → Continuing OpenAI gdpval economic value evaluation; Stepfun GEBench GUI generation evaluation; 3 evaluation-type datasets; Stanford HAI symposium discusses "better AI testing"
Multilingual Speech Data Medium → Continuing Google WaxalNLP African languages; Datatang Dolphin 40 languages + 22 dialects; Microsoft Paza low-resource language speech benchmark; NVIDIA Numb3rs TN/ITN speech data
3D Scene/Asset Data Medium → Continuing NVIDIA SAGE-10k indoor scenes; Meta ShapeR 3D reconstruction; Project Genie interactive world generation; Allen AI molmospaces scene library
Long-Context Data Medium ↑ New NVIDIA Nemotron-Math-v2 long-context; Together AI Cache-aware inference optimization hints at long-context application growth; Paper "When to Memorize and When to Stop" discusses long-context reasoning
RLHF/Safety Alignment Data ↓ Dropped Present in previous issue, absent this issue
Scientific Reasoning Data ↓ Dropped Present in previous issue, absent this issue
GUI/Agent Interaction Data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset Downloads Weekly Growth
stepfun-ai/GEBench 225 +2712.5%
nvidia/earth2studio-assets 417 +2352.9%
microsoft/VITRA-TeleData 650 +1020.7%
google/WaxalNLP 8,203 +9.9%
openai/gdpval 29,190 +2.9%

Deep Dive — DataRecipe

This week's 3 high-value datasets reverse-analyzed (auto-generated by DataRecipe)

Qwen/RationaleRM
300 samples · 14 fields · Hard
6.0/10
🟢 Recommended to Replicate

Data Structure

domain language context response1 response2 overall_preference individual_preference human-checklist model-low_deceptive_alignment-checklist

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds
Low Risk Data may become outdated over time → Establish continuous update mechanisms
microsoft/CancerGUIDE
165 samples · 3 fields · Hard
6.0/10
🟢 Recommended to Replicate

Data Structure

patient_id patient_note label

Risk Assessment

Medium Risk Requires domain experts; talent acquisition may be challenging → Build talent pipeline in advance, or consider outsourcing partnerships
Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds
Low Risk Data may become outdated over time → Establish continuous update mechanisms
amazon/doc_split
300 samples · 3 fields · Hard
6.0/10
🟢 Recommended to Replicate

Data Structure

doc_id total_pages subdocuments

Risk Assessment

Medium Risk Requires domain experts; talent acquisition may be challenging → Build talent pipeline in advance, or consider outsourcing partnerships
Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds
Low Risk Data may become outdated over time → Establish continuous update mechanisms

3 datasets analyzed this week · 83.9% human labor share · All Hard difficulty

Want to discuss this issue?

Kai" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
Kai Founder & CEO
苏文" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
苏文 AI 文档与发布工程师
陆明哲" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
陆明哲 AI 产品经理

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →