Details: Allen AI released 6 Sera series datasets (Sera-4.5A-Django-T1/T2, Sera-4.5A-Sympy-T1/T2, Sera-4.5A-Sphinx-T1/T2) on February 10-11, 2026, covering Django, Sympy, and Sphinx — three major open-source projects — with over 136K code modification trajectories. These datasets were generated using GLM-4.5-Air as the teacher model, employing the SVG (Synthetic Verification-Guided) method, containing complete function-level code modification trajectories, patches, and verification results. Quality control uses a two-round verification mechanism: round one (T1) with unconstrained recall, round two (T2) with recall fixed at 0.5.
Business implications: 1. Labeling paradigm innovation: The SVG method breaks through traditional human labeling bottlenecks by using automated verification to ensure code modification correctness, providing a replicable technical path for large-scale code agent training data production. 2. Open-source competition intensifying: Allen AI freely releasing 136K high-quality code trajectories directly impacts the commercial code data service market. Data service companies need to establish differentiated advantages in data scale, domain coverage, or labeling quality. 3. Synthetic data mainstreaming: The successful use of GLM-4.5-Air (not a top-tier model) to generate training data validates the "mid-tier model + verification mechanism" synthetic data approach, lowering the cost threshold for data production. 4. Vertical domain opportunity: Sera datasets focus on three specific open-source projects, suggesting enterprise code agent training requires large amounts of fine-tuning data for specific codebases — a commercial opportunity for "customized enterprise code datasets."
Details: NVIDIA released PhysicalAI-Robotics-Kitchen-Sim-Demos on February 10, 2026, containing 600 hours of human teleoperation demonstration data across 316 different tasks with 55K trajectories. Data collected using Franka Panda robot + Omron mobile base, following LeRobot format, providing complete action, state, and sensor data. Also released SAGE-10k dataset (2025-12-31) with 10K interactive indoor scenes covering 50 room types.
Business implications: 1. Embodied AI data bar raised: The release of 600 hours of real robotics manipulation data significantly raises the "baseline" for robotics datasets. Commercial data suppliers still at "a few hundred trajectories" scale will rapidly lose competitiveness. 2. Hardware-data binding trend: NVIDIA is building a "hardware-data-algorithm" closed-loop ecosystem by providing standardized hardware solutions (Franka Panda + Omron) with matching datasets. Data service companies need to consider partnerships with mainstream robotics hardware manufacturers. 3. Scene standardization demand: SAGE-10k's 50 room types indicate that robotics training requires large-scale diverse scene data, creating a "3D scene generation + robotics action labeling" service opportunity. 4. Format standardization trend: LeRobot format is becoming the de facto standard for robotics datasets. Data service companies must ensure output data is compatible with this format.
Details: Meta released facebook/EgoAVU_data on January 9, 2026, focusing on first-person perspective audio-video joint understanding. The dataset uses a scalable automated data engine, containing QA pairs, audio, and video multimodal annotations, designed for training AI models that understand human daily activities.
Business implications: 1. Emerging data type: First-person audio-video data is a critical training resource for AR/VR and embodied AI, but market supply is scarce — providing a competition-light new track for data service companies. 2. Collection device opportunity: First-person data requires specialized wearable devices (like Meta's smart glasses) for collection. Data service companies can partner with hardware manufacturers to establish collection infrastructure. 3. Automated data engine: Meta's emphasis on a "scalable automated data engine" implies that large-scale data production must rely on automated toolchains. The efficiency disadvantage of traditional human labeling will be further amplified. 4. Scenario diversity demand: Daily activity understanding requires covering numerous life scenarios (cooking, repairs, socializing, etc.), providing new business directions for crowdsourced labeling platforms.
Details: The DataChef paper published on February 11, 2026 proposes using reinforcement learning to optimize LLM training data recipes. The method automatically searches for optimal mixing ratios across different data sources via RL algorithms, significantly improving model performance. Concurrently, another paper published the same day, "Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning," found that repeating high-quality data is more effective than simply scaling data volume in long chain-of-thought fine-tuning.
Business implications: 1. Data quality assessment demand: The DataChef method's prerequisite is accurately evaluating the quality and characteristics of different data sources, creating new market demand for data evaluation services and quality scoring tools. 2. Small-scale premium data approach: The conclusion that Data Repetition > Data Scaling points small and mid-sized data service companies in the right direction — rather than pursuing massive low-quality data, focus on producing small volumes of high-quality, reusable premium datasets. 3. Data recipe consulting services: Enterprise clients need professional services to determine optimal data mixing for their specific tasks. Data service companies can offer "data recipe optimization consulting" beyond just selling raw data. 4. Synthetic data granularity control: These studies suggest future data production needs finer-grained control (difficulty distribution, style consistency) rather than simple volume scaling.
Details: Zhipu AI (Z.ai) released GLM-5 in February 2026 with 744B parameters, becoming China's largest open-source LLM. Twitter and social media discussion about GLM-5 reached extreme levels, with multiple tech communities sharing technical details.
Business implications: 1. Chinese data demand explosion: A 700B-parameter model's pre-training requires tens of terabytes of high-quality Chinese data, creating enormous commercial opportunities for Chinese data suppliers — especially vertical domain, high-quality conversation, code, and multimodal Chinese data. 2. RLHF/alignment data gap: Alignment difficulty for ultra-large models grows exponentially, requiring massive high-quality preference data and red-team testing data — a high-value market for RLHF data labeling service providers. 3. Domestic substitution accelerating: GLM-5's release reduces Chinese enterprises' dependence on overseas models, but also means Chinese data demand will primarily be consumed by domestic models. Data service companies need to strengthen partnerships with domestic model vendors. 4. Evaluation dataset demand: Rapid model capability improvement causes existing benchmarks to saturate quickly, creating new demand for "saturation-resistant, high-difficulty evaluation datasets."