W13 AI Data Intelligence

One-line Summary

Qwen 3.5 family ships 3 models on 2/24, Chinese open-source VLM enters full-size coverage phase [P0]; Anthropic RSP v3.0 + distillation attack detection + claude-code-security [P0]; NVIDIA Nemotron-Terminal-Corpus opens new terminal Agent SFT dataset category (2/19) [P1]. Top data demand signal this week: Multimodal Visual Reasoning Data.

Key Findings

This week's 5 high commercial value findings

P0 Qwen 3.5 Family Ships 3 Models on 2/24 — Chinese Open-Source VLM Enters Full-Size Coverage Phase [P0]

Following the Qwen3.5-397B-A17B flagship (2/16, 483K downloads, 1,052 likes) and FP8 quantized version (2/18, 93K downloads), Alibaba released three mid-to-small variants on 2/24: Qwen3.5-35B-A3B (21K downloads, 365 likes, Azure deployment supported), Qwen3.5-27B (6,875 downloads, 254 likes), Qwen3.5-122B-A10B (3,320 downloads, 225 likes). All three new models feature image-text-to-text multimodal architecture, with 35B-A3B and 122B-A10B using MoE sparse activation and 27B being Dense. Reddit community reaction was extremely enthusiastic: r/LocalLLaMA posts "Qwen3-30B-A3B vs Qwen3.5-35B-A3B on RTX 5090" (135 votes), "Qwen 3.5 craters on hard coding tasks" (128 votes), "Qwen 3.5 benchmark comparison" (89 votes), "Vision language benchmarks of qwen3.5" (40 votes) appeared in rapid succession. Meanwhile, Qwen blog continued ecosystem output: Qwen3Guard real-time safety filtering, Qwen-Image-Edit image editing, Qwen-Image native text rendering, GSPO scalable RL training, Qwen-MT multilingual translation.

Business implications: Compared to last week's single flagship 397B, Qwen 3.5 now spans 5 sizes (27B/35B-A3B/122B-A10B/397B-A17B/397B-FP8), covering from consumer GPUs (RTX 5090 can run 35B-A3B) to data center scenarios. This means visual reasoning training data demand shifts from "preparing data for one large model" to "preparing data of varying complexity for an entire product line" — small models need refined, high-SNR data, large models need complex reasoning chain data, MoE models may need specialized domain routing data. For Knowlyr, the commercial value of Chinese visual reasoning data has upgraded from last week's "clear demand" to "highly certain scaled demand."

P0 Anthropic Safety Infrastructure Triple Release: RSP v3.0 + Distillation Attack Detection + claude-code-security [P0]

Anthropic released three safety-related results this week: (1) Responsible Scaling Policy v3.0 — updated responsible scaling policy framework defining stricter model deployment safety thresholds; (2) Detecting and preventing distillation attacks — proposed methods for detecting model distillation attacks, with Nathan Lambert providing in-depth analysis on Interconnects blog as "How much does distillation really matter for Chinese LLMs?", and Zhiyuan community reposting "One Anthropic blog post, IBM drops 13%"; (3) claude-code-security-review (GitHub 3,318 stars) — AI-powered code security review GitHub Action. Also released Persona Selection Model interpretability research (simultaneously published on Alignment Forum), AI Fluency Index education report. Research team page updated showcasing Economic Research, Interpretability, and Societal Impacts directions.

Business implications: Anthropic's distillation attack detection directly names the distillation issue in Chinese LLMs, triggering industry shockwaves (IBM down 13%). This signals two data needs: (1) Adversarial safety data — model distillation detection requires large-scale "original output vs distilled output" paired datasets for training detectors; (2) Safety audit data — RSP v3.0's stricter thresholds mean more models must pass safety evaluations before deployment, making safety evaluation datasets essential. For Knowlyr, the irreplaceability of "human judgment" in safety evaluation is further strengthened — whether a model can be safely deployed ultimately requires expert human judgment.

P1 NVIDIA Nemotron-Terminal-Corpus Opens New Terminal Agent SFT Dataset Category (2/19) [P1]

NVIDIA released Nemotron-Terminal-Corpus (2/19, cc-by-4.0) and companion Nemotron-Terminal-Synthetic-Tasks — the former is a large-scale terminal interaction SFT dataset designed for training LLM Linux terminal operation capabilities, synthesized through the Terminal-Task-Gen pipeline; the latter provides skill-based synthetic task structures for evaluating and training autonomous terminal Agents. Companion paper "On Data Engineering for Scaling LLM Terminal Capabilities" (authors Renjie Pi, Grace Lam, M. Shoeybi) published simultaneously. Concurrent NVIDIA ecosystem signals: Nemotron-3-Nano-30B-A3B series maintaining high downloads (BF16 version 852K, FP8 version 1.159M), nemotron-colembed-vl-4b-v2 visual document retrieval model (54K downloads), Isaac-GR00T robotics foundation model (6,248 stars).

Business implications: Terminal-Corpus marks Agent SFT data expansion from Web/API scenarios to system administration — a critical scenario for enterprise AI Agent deployment (ops automation, DevOps). The cc-by-4.0 license enables commercial use, but synthetic data limitations (Terminal-Task-Gen generated) mean human-labeled data from real operational scenarios still holds value. Combined with Reddit hot post "Your coding agent sessions are sitting on your machine right now" (46 votes), the industry is recognizing the value of Agent behavioral trajectory data — whoever can systematically collect real terminal operation data builds a differentiated moat.

P1 Cerebras REAP Compresses Step 3.5 Flash + InternLM Spatial Self-Supervision: Efficient Inference and Spatial Understanding on Dual Tracks (2/25) [P1]

Cerebras released Step-3.5-Flash-REAP-121B-A11B and Step-3.5-Flash-REAP-149B-A11B on 2/25, continuing its "large model slimming" technical approach (last week compressed MiniMax-M2.5). Same day, InternLM released Spatial-SSRL-3B spatial self-supervised model — a multimodal architecture specializing in spatial understanding and self-supervised learning, tagged "spatial understanding, self-supervised learning". On Reddit, Unsloth Q3 quantization benchmarks surpassing Q4 and MXFP4 (63 votes), Mercury 2 diffusion model inference speed discussion (16 votes) — model compression and efficient inference topics remain active.

Business implications: Cerebras doing REAP compression for different models two weeks in a row (MiniMax last week, Step this week) is becoming the de facto standard for "model compression as a service" — quality evaluation of compressed models requires systematic comparative datasets. Spatial-SSRL-3B opens a new spatial self-supervision direction, echoing last week's MolmoSpaces embodied AI spatial resources — spatial understanding data demand expands from the visual domain to self-supervised pre-training. Quantization evaluation data ("how much quality is lost after compression") is evolving from niche to essential.

P2 Apple ML Intensive Research Output + Embodied AI Capital Heats Up: Wayve $1.2B + AI2 Robotics Series B [P2]

Apple Machine Learning published 5 research papers in one week: (1) CoT reasoning dynamic analysis — revealing trace dynamics of chain-of-thought reasoning; (2) Speech understanding gap — LLM performance on speech input far below text, pointing to directions for bridging the gap; (3) HTML text extraction — re-examining HTML-to-Text extraction methods for LLM pre-training, discovering limitations of existing approaches; (4) AMUSE — audio-visual multi-speaker understanding benchmark and alignment framework, highlighting shortcomings of multimodal models (GPT-4o, Qwen3-Omni) in multi-speaker conversation scenarios; (5) depyf — PyTorch compiler debugging tool. In embodied AI, Wayve raised $1.2 billion Series D (planning supervised autonomous driving robotaxi trial operations in London in 2026, entering consumer market in 2027), AI2 Robotics completed Series B (valued over $1 billion, developing AlphaBot VLA model for semi-humanoid robots).

Business implications: Apple's 5 papers signal data needs: speech-text alignment data (bridging the speech understanding gap), multi-speaker conversation labeling data (gaps identified by AMUSE), high-quality HTML-to-Text training data (pre-training infrastructure improvement). The funding scale of Wayve and AI2 Robotics confirms embodied AI has entered the industrialization phase — $1.2 billion is for building products, not research. Implication for the data industry: high-quality labeling data demand for autonomous driving and robotic manipulation scenarios will scale from research-grade to product-grade volumes.

Demand Signals

Infer training data demands from model releases

Multimodal Visual Reasoning Data

Critical → Continuing

Qwen 3.5 family expands to 5 sizes · GLM-4.6V open-sourced · InternLM Spatial-SSRL-3B spatial understanding

Safety Adversarial / Evaluation Data

Critical ↑ New

Anthropic RSP v3.0 + distillation attack detection + claude-code-security · CAMEL confidence-gated reward modeling · IR3 reward hacking detection

RLHF / Preference Alignment Data

Critical → Continuing

MARS self-refinement · CAMEL confidence-gated reflection · IR3 contrastive inverse RL reward hacking detection · gradient regularization against reward hacking

Agent Terminal / Tool Data

High ↑ New

NVIDIA Nemotron-Terminal-Corpus SFT · MagicAgent general Agent planning · Reddit "coding agent sessions" discussion

Coding / Code Reasoning Data

High → Continuing

Devstral-2-123B · Devstral-Small-2-24B · Reddit "Qwen 3.5 craters on hard coding tasks"

Model Compression Evaluation Data

High ↑ New

Cerebras REAP Step 3.5 Flash · Unsloth Q3 beats Q4/MXFP4 · Mercury 2 diffusion model inference speed

Spatial Understanding / Embodied AI Data

High ↑ New

InternLM Spatial-SSRL-3B · Wayve $1.2B funding · AI2 Robotics Series B · GEBench GUI interaction evaluation

Speech / Multi-Speaker Understanding Data

Moderate ↑ New

Apple "Closing the Gap Between Text and Speech" · AMUSE multi-speaker benchmark · TinyTTS 9M parameter TTS

Synthetic Data Quality Evaluation

Moderate ↑ New

"When Pretty Isn't Useful" · ReSyn autonomous synthetic environment scaling

Multilingual Data

Moderate → Continuing

WaxalNLP African languages · BURMESE-SAN Myanmar benchmark · Qwen-MT multilingual translation

Agent Behavioral / Trajectory Data ↓ Dropped Present in previous issue, absent this issue

Complex Reasoning Evaluation Data ↓ Dropped Present in previous issue, absent this issue

Robotics / Embodied AI Data ↓ Dropped Present in previous issue, absent this issue

Document OCR Data ↓ Dropped Present in previous issue, absent this issue

Quantization / Compression Evaluation Data ↓ Dropped Present in previous issue, absent this issue

Safety / Alignment Audit Data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset	Downloads	Weekly Growth
allenai/olmix	272	+14.3%
google/WaxalNLP	9,883	-6.6%
nvidia/HiLiftAeroML	582	-14.9%

Want to discuss this issue?

Kai Founder & CEO

苏文 AI Documentation & Release Engineer

陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →

Qwen 3.5 Full-Size CoverageSafety Adversarial Data Demand Emerges

Key Findings

Demand Signals

Download Movers

Qwen 3.5 Full-Size Coverage
Safety Adversarial Data Demand Emerges