W12 AI Data Intelligence

One-line Summary

Allen AI releases 5 datasets + Olmix data mixing framework, systematically defining pre-training data methodology; Meta open-sources 200K+ multilingual multi-turn preference dataset, RLHF data public supply upgraded; RLHF/alignment research enters 4th consecutive week of high-density output, methodology moves toward personalization and decoupling. Top data demand signal this week: Multimodal Visual Reasoning Data.

Key Findings

This week's 5 high commercial value findings

P0 Allen AI Releases 5 Datasets + Olmix Data Mixing Framework, Systematically Defining Pre-training Data Methodology (2026-02-11 to 2026-02-17)

Allen AI released 5 datasets and 8 models this week, the highest single-week output among research institutions. Key highlights: allenai/olmix (2026-02-11, 238 downloads, 18 likes) — providing proxy run swarm data for OLMo pre-training, systematically solving the core pre-training question of "what ratio to mix different domain data for optimal results"; allenai/Dolci-Instruct-DPO (2,498 downloads) — 260K preference pairs for OLMo 3 Instruct 7B alignment training, ODC-BY license; allenai/olmOCR-bench (2,745 downloads, 58 likes) — 1,403 PDFs + 7,010 unit tests, establishing PDF-to-Markdown OCR system evaluation standard; allenai/Molmo2-MultiImageQA (194 downloads) — multi-image visual QA instruction fine-tuning dataset; allenai/molmospaces (204 downloads, +39.7% week-over-week growth) — embodied AI 3DGUT/USD resources updated for Isaac Sim compatible format. Companion blog posts published simultaneously: Olmix data mixing framework details, AutoDiscovery automated scientific discovery, MolmoSpaces ecosystem introduction, How2Everything real-world procedure evaluation.

Business implications: Allen AI has leapt from "releasing individual datasets" to "outputting data methodology" — Olmix's swarm data mixing method will change the engineering practice of pre-training data ratio optimization. Data service providers should focus on: 1) Data mixing optimization as a service — helping clients find optimal training ratios; 2) OCR evaluation benchmark standardization — olmOCR-bench may become the de facto standard in the document AI field; data suppliers should calibrate document labeling quality accordingly; 3) DPO preference data public supply — 260K open-source DPO data compresses commercial space for low-quality preference data; differentiated competition must focus on vertical domains.

P0 Meta Open-Sources 200K+ Multilingual Multi-Turn Preference Dataset, RLHF Data Public Supply Upgraded (2025-05-13 First Release, Entering Monitoring Scope This Week)

facebook/community-alignment-dataset (194 downloads, 39 likes, cc-by-4.0) — 200K+ LLM response comparison data from 3,000+ global annotators, covering multilingual and multi-turn conversation scenarios. This is Meta's largest-scale multilingual preference dataset. Also released facebook/actionbench (2026-02-19, 2 downloads) — 128 video-animation point cloud paired samples for evaluating video-to-animated 3D mesh generation. The two datasets represent Meta's positioning on both "text alignment" and "video-3D multimodal" data fronts.

Business implications: community-alignment-dataset's cc-by-4.0 license means anyone can use it for commercial training — good news for small and mid-sized model vendors but a direct hit to preference data suppliers. Differentiation directions: 1) Vertical industry preference data (medical, legal, financial and other professional scenarios not covered by Meta's dataset); 2) Chinese preference data — while multilingual, the dataset's Chinese coverage depth is limited; 3) Continuous update service — open-source datasets are static, while clients need preference data that continuously updates alongside model iterations.

P1 RLHF/Alignment Research Enters 4th Consecutive Week of High-Density Output, Methodology Moves Toward Personalization and Decoupling (2026-02-16 to 2026-02-19)

5 RLHF/alignment papers this week: MARS (2026-02-19) — Margin-Aware reward modeling + self-refining data augmentation, addressing high cost of preference data; Learning Personalized Agents from Human Feedback (2026-02-18) — introducing PersonaliZe framework for Agents adapting to dynamic personal preference changes; Multi-Objective Alignment for Personalized Psychotherapy (2026-02-17) — multi-objective alignment in psychotherapy, balancing patient preferences with clinical safety; Interactionless IRL (2026-02-16) — proposing "interaction-free inverse reinforcement learning," decoupling safety objectives from policy to avoid "alignment waste"; Latency-aware HITL-RL (2026-02-17) — embedding human feedback and latency constraints in semantic communication. Common trend across all five papers: moving from "one-size-fits-all alignment" to "personalized + decoupled + multi-objective + scenario-specific."

Business implications: Refinement of alignment methodology directly changes data requirements: 1) Personalized preference data — no longer "all of humanity's preferences" but "preferences of specific user groups/individuals"; data collection needs to cover population diversity; 2) Multi-objective labeling — the same sample requires preference labels across multiple dimensions (safety, helpfulness, personalization, etc.); labeling costs rise but per-sample data value increases; 3) Dynamic preference data — PersonaliZe framework emphasizes preferences change over time, meaning preference data needs periodic refreshing; "one-time labeling" models will be replaced by "continuous labeling services."

P1 Three Frontier Models Debut in Same Week: Gemini 3.1 Pro, Sonnet 4.6, Qwen 3.5-397B, Multimodal Arms Race Reaches Fever Pitch (2026-02-16 to 2026-02-19)

Google releases Gemini 3.1 Pro (2026-02-19, DeepMind blog: "A smarter model for your most complex tasks"), emphasizing complex task reasoning; Anthropic releases Claude Sonnet 4.6 (2026-02-19, "frontier performance across coding, agents, and professional work at scale"); Qwen 3.5-397B-A17B (2026-02-16, 105K downloads, 754 likes) MoE architecture vision-language model. Meanwhile MiniMax-M2.5 with 123K downloads, 814 likes becomes community favorite; Cerebras releases REAP compressed versions (172B-A10B and 139B-A10B). Reddit post "Qwen3.5 Plus, GLM 5, Gemini 3.1 Pro, Sonnet 4.6, three new open source agents" (57 upvotes) confirms the community's sense of model release density.

Business implications: Three frontier models releasing in the same week means the next wave of alignment and evaluation data demand will surge simultaneously. Key areas: 1) Complex task reasoning data — Gemini 3.1 Pro targets "complex tasks," needing multi-step reasoning, long chain-of-thought evaluation and training data; 2) Coding/Agent data — Sonnet 4.6 emphasizes coding and agents; Agent behavior trajectory and code reasoning data demand rises; 3) Visual-language multimodal data — Qwen 3.5 is a vision-language model; 397B scale means massive visual reasoning data consumption.

P2 GGML/llama.cpp Join Hugging Face, Local AI Infrastructure Consolidation Accelerates (2026-02-19)

Hugging Face blog announces "GGML and llama.cpp join HF to ensure the long-term progress of Local AI." GGML is the most widely used quantization format for local model inference; llama.cpp is the community's most active local inference engine. Concurrent signals: Reddit "Free ASIC Llama 3.1 8B inference at 16,000 tok/s" (318 upvotes, week's highest), suggesting dedicated hardware-accelerated local inference has crossed the usability threshold; "Kimi K2.5 better than Opus 4.6 on hallucination benchmark" (46 upvotes) shows local/open-source models challenging closed-source frontier in specific domains; Snorkel AI demonstrates 4B model outperforming 235B model through tool discipline.

Business implications: Local AI infrastructure consolidation means: 1) Quantized model evaluation data demand — quality loss from quantization needs systematic evaluation; "pre/post-quantization comparison evaluation datasets" is a new category; 2) Edge scenario fine-tuning data — 16K tok/s ASIC inference + GGML/HF consolidation moves edge deployment from technical validation to production-ready, scaling edge-specific data demand; 3) Small model alignment data — Snorkel AI's 4B model case proves small models can outperform large ones through precise fine-tuning, but the prerequisite is high-quality vertical domain alignment data.

Demand Signals

Infer training data demands from model releases

Multimodal Visual Reasoning Data

Critical ↑ New

Qwen 3.5-397B VLM · GLM-4.6V visual reasoning · Molmo2-MultiImageQA multi-image VQA

RLHF/Preference Alignment Data

Critical ↑ New

Meta 200K+ preference pairs open-sourced · Allen AI 260K DPO pairs · MARS reward modeling self-refinement · PersonaliZe personalized alignment

Agent Behavior/Trajectory Data

High → Continuing

Sonnet 4.6 Agent performance · Snowflake AgentWorldModel-1K · Mistral Vibe CLI/Devstral 2 · OpenAI Codex 61K stars

Complex Reasoning Evaluation Data

High ↑ New

Gemini 3.1 Pro "complex tasks" · HLE-Verified human ultimate exam corrections · MATEO temporal reasoning benchmark

Coding/Code Reasoning Data

High ↑ New

Sonnet 4.6 coding performance · Qwen3 Coder Next · Reddit "surge in LLM coding capabilities" · TAROT code generation RL

Multilingual Data

High ↑ New

ÜberWeb 20T multilingual curation · WaxalNLP African language speech · ParlaCAP 28 European parliaments · Crowdsourcing Piedmontese

Robotics/Embodied AI Data

Moderate ↑ New

NVIDIA NuRec · MolmoSpaces +39.7% growth · Humanoid End-Effector Control · Isaac-GR00T 6.2K stars

Document OCR Data

Moderate ↑ New

olmOCR-bench · Mistral OCR 3 · PaddleOCR-VL in llama.cpp · amazon/doc_split

Quantization/Compression Evaluation Data

Moderate ↑ New

Cerebras REAP compressed MiniMax · ASIC 16K tok/s inference · INT8 cross-chip precision variance

Safety/Alignment Audit Data

Moderate ↑ New

EleutherAI misalignment-control-sft · Qwen3Guard real-time safety · OpenAI $7.5M alignment research grants

Robotics VLA Trajectory Data ↓ Dropped Present in previous issue, absent this issue

RL Training/Alignment Data ↓ Dropped Present in previous issue, absent this issue

Chinese LLM Alignment Data ↓ Dropped Present in previous issue, absent this issue

Multilingual Evaluation Data ↓ Dropped Present in previous issue, absent this issue

Real-time Safety Labeling Data ↓ Dropped Present in previous issue, absent this issue

Visual Reasoning Data ↓ Dropped Present in previous issue, absent this issue

Sim-to-Real Paired Data ↓ Dropped Present in previous issue, absent this issue

Audio/Speech Data ↓ Dropped Present in previous issue, absent this issue

Image Editing Instruction Data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset	Downloads	Weekly Growth
allenai/molmospaces	204	+39.7%

Deep Dive — DataRecipe

This week's 2 high-value datasets reverse-analyzed (auto-generated by DataRecipe)

facebook/EgoAVU_data

300 samples · 6 fields · Medium

6.0/10

Data Structure

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds

Low Risk Data may become outdated over time → Establish continuous update mechanisms

allenai/olmix

300 samples · 113 fields · Medium

6.5/10

Data Structure

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds

Low Risk Data may become outdated over time → Establish continuous update mechanisms

Analyzed 2 datasets this week · 83.9% human effort · all Medium difficulty

Want to discuss this issue?

Kai Founder & CEO

苏文 AI Documentation & Release Engineer

陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →

Multimodal Alignment Data Arms RaceAllen AI Defines Pre-training Data Methodology

Key Findings

Demand Signals

Download Movers

Deep Dive — DataRecipe

Data Structure

Risk Assessment

Data Structure

Risk Assessment

Multimodal Alignment Data Arms Race
Allen AI Defines Pre-training Data Methodology