W14 AI Data Intelligence

One-line Summary

29 datasets in one week, video multimodal data enters systematic supply [P0]; talent turbulence clashes with commercial expansion [P0]; commercial expansion and safety controversies escalate in parallel [P1]. Top data demand signal this week: Video Understanding / Tracking Data.

Key Findings

This week's 5 high commercial value findings

P0 Allen AI Molmo2 Video Understanding Dataset Cluster Erupts: 29 Datasets in One Week, Video Multimodal Data Enters Systematic Supply [P0]

Allen AI released 29 datasets under the Molmo2 brand this week, nearly all focused on the video understanding task pipeline: molmo2-single-object-track (single object tracking, 2/24), molmo2-reasonvos (reasoning video object segmentation, 2/27), molmo2-burst (burst detection, 2/23), molmo2-mevis/mevis-valid (motion expression video segmentation), molmo2-ref-davis17/ref-yt-vos (reference-guided tracking), molmo2-revos/vicas/moca/lv-vis (multi-scenario video object segmentation), molmo2-hardcodes (hard-coded samples, 2/25), molmo2-academic-video-points (academic video tracking point labeling, 2/17), Molmo2-VideoPoint (video localization data, 360 downloads), Molmo2-VideoLocalizedNarratives/CaptionHf/VideoMME/TGIF/TVQA/NewsVideoQA (video narrative and QA series). Also released Dolci-Think-SFT-32B (1,464 downloads, reasoning SFT data), Dolci-Instruct-SFT-Tool-Use-SA (tool use SFT data), code_fresh_0825_1225 (25M token code data, 42 languages), SimpleToM (theory of mind evaluation), asta-user-interactions (scientific tool user interaction data). On GitHub, molmo2 repository (197 stars), molmospaces robotics ecosystem (152 stars, +15) continue growing.

Business implications: This is the largest single-week release of video understanding training data in the past six months. Allen AI is systematically building a complete data pipeline from "video object tracking → video segmentation → video localized narratives → video QA," meaning video multimodal data has transitioned from a previously scattered, scarce state to industrial-scale supply. For data service companies, Allen AI's open strategy (ODC-BY / Apache-2.0 licenses) both lowers video data market pricing expectations and creates new opportunities for differentiation around video data quality — there remains significant value space between synthetic tracking labels vs. human-annotated precision labels.

P0 Qwen Core Member Junyang Lin Departs Amid Small Model Rollout: Talent Turbulence Clashes with Commercial Expansion [P0]

Reddit r/LocalLLaMA's hottest post this week "Junyang Lin has left Qwen" (799 votes, 3/3) — the departure of a core Qwen R&D member sparked widespread community discussion. Meanwhile, Qwen 3.5 Small series (0.8B-9B) launched on Product Hunt (3/3), Qwen3.5-35B-A3B downloads surged from 21K last week to 680K, FP8 version hit 330K, 122B-A10B reached 150K, 27B-FP8 reached 159K. Qwen ecosystem continued expanding: Qwen3Guard real-time safety filtering, Qwen-Image-Edit image editing, Qwen-MT multilingual translation, GSPO scalable RL training. Reddit posts on Qwen3.5-9B abliterated (108 votes) and Qwen3.5-9B Uncensored (30 votes) show the community has begun systematically modifying Qwen small models. Tianchi IEEE AICAS 2026 edge VLM deployment challenge continued progressing.

Business implications: The impact of core personnel departure on Qwen's R&D cadence remains to be seen, but commercial data shows the "mass rollout" strategy has successfully landed — 680K downloads for 35B-A3B proves massive market demand for small MoE vision models. Community-driven abliterated/uncensored versions indicate Qwen small models have entered the "ecosystem self-modification" stage, and demand for customized fine-tuning data will diffuse from officially-led to community-driven. For the data industry, the explosion of Qwen small models means "high-SNR visual reasoning data suited for 9B parameter scale" is a high-certainty growth category.

P1 OpenAI Strategic Triple Play + GPT-5.3 Instant: Commercial Expansion and Safety Controversies Escalate in Parallel [P1]

OpenAI released three strategic partnerships this week — Amazon strategic cooperation (Frontier platform on AWS), Microsoft partnership renewal statement, and Department of Defense contract signing. GPT-5.3 Instant and system card released simultaneously (3/3), positioned as "smoother everyday conversation." The DoD contract triggered intense community reaction: LessWrong "A Tale of Three Contracts" deep analysis of Anthropic being flagged as a supply chain risk, "Mass Surveillance w/ LLMs is the Default Outcome" (DoW contract implications), Reddit "DoW vs Anthropic saga proves closed-source safety is a fraud" (64 votes) demanding open safety evaluations. Anthropic's response to Defense Secretary Pete Hegseth's statement drew attention. GitHub codex 61,868 stars (+670), openai-agents-python 19,132 stars.

Business implications: OpenAI's government contracts will drive two data demand directions: first, safety red-line evaluation data for government/military scenarios (contracts explicitly define safety red lines); second, AI deployment evaluation data in classified environments. Community calls for open safety evaluations mean independent safety evaluation benchmark data will become essential — both to assess model capabilities and to verify safety commitments. For Knowlyr, the irreplaceability of "human judgment" in safety evaluation is further reinforced by this political contest.

P1 Together AI CoderForge-Preview Sets New Open-Source Coding Agent Dataset SOTA [P1]

Together AI released CoderForge-Preview (2/20, 8,413 downloads, 118 likes), currently the largest open-source test-verified coding Agent dataset. Fine-tuned on Qwen-3 32B, SWE-Bench Verified performance improved from 23.0% to 59.4% pass@1, ranking first among open data and second among open-weight models ≤32B. Concurrent Reddit post "Benchmarked 94 LLM endpoints for jan 2026" (54 votes) shows open-source models have closed to within 5 points of closed-source models on quality. Mistral released Devstral 2 and Vibe CLI, strengthening coding Agent toolchains. SWE-rebench V2 (HF Papers) proposed cross-language SWE task scalable collection methods.

Business implications: CoderForge-Preview proves open-source coding data can achieve near-closed-source results, which will accelerate decentralized production of coding Agent data. Key differentiation directions: real enterprise codebase Agent behavioral trajectories (rather than synthetic environments), and cross-language SWE task data (the direction of SWE-rebench V2). For data service providers, "real human developer debugging and fixing processes" are more valuable than synthetic code tasks.

P2 Apple 'Intelligence Cannot Be Separated from Judgment' Paper + Google Gemini 3.1 Flash-Lite: Alignment Theory and Efficiency Models Advance on Dual Tracks [P2]

Apple Machine Learning Research published "On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment" — arguing from computational complexity theory that AI alignment filtering is theoretically inseparable from intelligence itself, i.e., you cannot perfectly filter harmful outputs without affecting model intelligence. Also released Hallucination Span Detection reasoning, EMBridge gesture EMG cross-modal transfer, UI component variant instantiation, and App Store search LLM enhancement. Google released Gemini 3.1 Flash-Lite (fastest, lowest-cost Gemini 3 series) and Nano Banana 2 image generation model. HN "Open-Source Article 12 Logging for EU AI Act" (35 votes) shows AI compliance tooling is going open-source.

Business implications: Apple's paper provides rigorous theoretical backing for "human judgment is irreplaceable in AI systems" — if filtering and alignment are computationally inseparable from intelligence, then "having humans make judgments" is not a temporary stopgap but a long-term structural necessity. Gemini 3.1 Flash-Lite and GPT-5.3 Instant both pushing "low-cost efficient inference" means lightweight model evaluation data demand is growing rapidly. The open-sourcing of EU AI Act compliance tools signals that compliance evaluation data will emerge as a new category.

Demand Signals

Infer training data demands from model releases

Video Understanding / Tracking Data

Critical ↑ New

Allen AI Molmo2 29 video datasets in one week · Full pipeline coverage: video object tracking/segmentation/localization

Multimodal Visual Reasoning Data

Critical → Continuing

Qwen 3.5 Small downloads hit 680K · 122B-A10B 150K · Community abliterating small models · InternLM Spatial-SSRL

Coding Agent Data

Critical ↑ New

CoderForge-Preview SWE-Bench 23%→59.4% · Devstral 2 · SWE-rebench V2 cross-language tasks

Safety Evaluation / Alignment Data

High ↑ New

OpenAI DoD contract safety red lines · Apple 'Intelligence Cannot Be Separated from Judgment' paper · PrivMedChat differential privacy RLHF

RLHF / Preference Alignment Data

High → Continuing

Robometer trajectory contrastive reward model · RubricBench evaluation alignment · GRM breadth-depth synergy

Agent Tool / Planning Data

High ↑ New

Qwen DeepPlanning long-horizon Agent planning · LOGIGEN verifiable Agent task generation · DigiData mobile control

Robotics / Tactile Data

High ↑ New

BAAI ToucHD tactile dataset · NVIDIA NuRec robotics · Arena-GR1 manipulation

Synthetic Data Methodology

Moderate ↑ New

CHIMERA compact synthetic reasoning data · CharacterFlywheel 15-generation iterative production optimization · VisNec visual necessity filtering

EU Compliance Evaluation Data

Moderate ↑ New

HN: Open-source Article 12 logging infrastructure · AI safety review tools going open-source

Safety Adversarial / Evaluation Data ↓ Dropped Present in previous issue, absent this issue

Agent Terminal / Tool Data ↓ Dropped Present in previous issue, absent this issue

Coding / Code Reasoning Data ↓ Dropped Present in previous issue, absent this issue

Model Compression Evaluation Data ↓ Dropped Present in previous issue, absent this issue

Spatial Understanding / Embodied AI Data ↓ Dropped Present in previous issue, absent this issue

Speech / Multi-Speaker Understanding Data ↓ Dropped Present in previous issue, absent this issue

Synthetic Data Quality Evaluation ↓ Dropped Present in previous issue, absent this issue

Multilingual Data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset	Downloads	Weekly Growth
nvidia/Nemotron-Terminal-Corpus	744	+18500.0%
nvidia/HiLiftAeroML	1,011	+73.7%
google/WaxalNLP	13,506	+36.7%
allenai/asta-summary-citation-counts	439	+13.7%
microsoft/SYNUR	122	+0.8%

Deep Dive — DataRecipe

This week's 3 high-value datasets reverse-analyzed (auto-generated by DataRecipe)

togethercomputer/CoderForge-Preview

300 samples · 7 fields · Hard

6.0/10

Data Structure

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds

Low Risk Data may become outdated over time → Establish continuous update mechanisms

allenai/Dolci-Think-SFT-32B

300 samples · 3 fields · Hard

6.0/10

Data Structure

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds

Low Risk Data may become outdated over time → Establish continuous update mechanisms

google/MapTrace

300 samples · 3 fields · Medium

6.5/10

Data Structure

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds

Low Risk Data may become outdated over time → Establish continuous update mechanisms

Analyzed 3 datasets this week · 99.6% human effort

Want to discuss this issue?

Kai Founder & CEO

苏文 AI Documentation & Release Engineer

陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →

Video Understanding Data Enters Industrial-Scale SupplyApple Proves Human Judgment Irreplaceable

Key Findings

Demand Signals

Download Movers

Deep Dive — DataRecipe

Data Structure

Risk Assessment

Data Structure

Risk Assessment

Data Structure

Risk Assessment

Video Understanding Data Enters Industrial-Scale Supply
Apple Proves Human Judgment Irreplaceable