GPT-5.2 Enters Scientific Discovery
Data Recipe Engineering Accelerates
This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts
Allen AI releases Sera code agent trajectory dataset, advancing open-source code Agent training ecosystem; NVIDIA releases PhysicalAI kitchen robotics demo dataset, 600 hours of real manipulation data open-sourced; Meta releases EgoAVU first-person audio-video understanding dataset, opening new data track. Top data demand signal this week: Code Agent Trajectory Data.
Key Findings
This week's 5 high commercial value findings
Details: Allen AI released 6 Sera series datasets (Sera-4.5A-Django-T1/T2, Sera-4.5A-Sympy-T1/T2, Sera-4.5A-Sphinx-T1/T2) on February 10-11, 2026, covering Django, Sympy, and Sphinx — three major open-source projects — with over 136K code modification trajectories. These datasets were generated using GLM-4.5-Air as the teacher model, employing the SVG (Synthetic Verification-Guided) method, containing complete function-level code modification trajectories, patches, and verification results. Quality control uses a two-round verification mechanism: round one (T1) with unconstrained recall, round two (T2) with recall fixed at 0.5.
Details: NVIDIA released PhysicalAI-Robotics-Kitchen-Sim-Demos on February 10, 2026, containing 600 hours of human teleoperation demonstration data across 316 different tasks with 55K trajectories. Data collected using Franka Panda robot + Omron mobile base, following LeRobot format, providing complete action, state, and sensor data. Also released SAGE-10k dataset (2025-12-31) with 10K interactive indoor scenes covering 50 room types.
Details: Meta released facebook/EgoAVU_data on January 9, 2026, focusing on first-person perspective audio-video joint understanding. The dataset uses a scalable automated data engine, containing QA pairs, audio, and video multimodal annotations, designed for training AI models that understand human daily activities.
Details: The DataChef paper published on February 11, 2026 proposes using reinforcement learning to optimize LLM training data recipes. The method automatically searches for optimal mixing ratios across different data sources via RL algorithms, significantly improving model performance. Concurrently, another paper published the same day, "Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning," found that repeating high-quality data is more effective than simply scaling data volume in long chain-of-thought fine-tuning.
Details: Zhipu AI (Z.ai) released GLM-5 in February 2026 with 744B parameters, becoming China's largest open-source LLM. Twitter and social media discussion about GLM-5 reached extreme heat levels, with multiple tech communities sharing technical details.
Demand Signals
Infer training data demands from model releases
Download Movers
Datasets with the largest download changes this week
| Dataset | Downloads | Weekly Growth |
|---|---|---|
| stepfun-ai/GEBench | 225 | +2712.5% |
| nvidia/earth2studio-assets | 417 | +2352.9% |
| microsoft/VITRA-TeleData | 650 | +1020.7% |
| google/WaxalNLP | 8,203 | +9.9% |
| openai/gdpval | 29,190 | +2.9% |
Deep Dive — DataRecipe
This week's 3 high-value datasets reverse-analyzed (auto-generated by DataRecipe)
Data Structure
Risk Assessment
Data Structure
Risk Assessment
Data Structure
Risk Assessment
3 datasets analyzed this week · 83.9% human labor share · All Hard difficulty
Want to discuss this issue?
Auto-generated by AI Dataset Radar · Updated weekly
AI Dataset Radar →