W08 AI Data Intelligence

One-line Summary

Allen AI releases five datasets + Olmix data mixing framework, systematically defining pre-training data methodology; Meta open-sources 200K+ multilingual multi-turn preference dataset, upgrading RLHF public data supply; RLHF/alignment research at high-density output for the fourth consecutive week, methodology moving toward personalization and decoupling. Top data demand signal this week: Multimodal Visual Reasoning Data.

Key Findings

This week's 5 high commercial value findings

P0 Allen AI Releases Five Datasets + Olmix Data Mixing Framework, Systematically Defining Pre-Training Data Methodology (2026-02-11 to 2026-02-17)

Allen AI released 5 datasets and 8 models this week, making it the highest single-week output research institution. Key highlights: allenai/olmix (2026-02-11, 238 downloads, 18 likes) — proxy run swarm data for OLMo pre-training, systematically solving the core pre-training question of "what ratio of different domain data produces optimal results"; allenai/Dolci-Instruct-DPO (2,498 downloads) — 260K preference pairs for OLMo 3 Instruct 7B alignment training, ODC-BY license; allenai/olmOCR-bench (2,745 downloads, 58 likes) — 1,403 PDFs + 7,010 unit tests, establishing PDF-to-Markdown OCR system evaluation standards; allenai/Molmo2-MultiImageQA (194 downloads) — multi-image visual question answering instruction fine-tuning dataset; allenai/molmospaces (204 downloads, +39.7% weekly growth) — embodied AI 3DGUT/USD resource update with Isaac Sim compatible format. Companion blog posts published simultaneously: Olmix data mixing framework deep dive, AutoDiscovery automated scientific discovery, MolmoSpaces ecosystem introduction, How2Everything real-world procedure evaluation.

Business implications: Allen AI has leaped from "releasing individual datasets" to "outputting data methodology" — Olmix's swarm data mixing method will transform the engineering practice of pre-training data proportioning. Data service providers should watch for: 1) Data mixing optimization as a service — helping clients find optimal training ratios; 2) OCR evaluation benchmark standardization — olmOCR-bench may become the de facto standard in document AI, and data suppliers should calibrate document labeling quality accordingly; 3) Public supply of DPO preference data — 260K open-source DPO data compresses the commercial space for low-quality preference data, and differentiated competition must focus on vertical domains.

P0 Meta Open-Sources 200K+ Multilingual Multi-Turn Preference Dataset, Upgrading RLHF Public Data Supply (2025-05-13 Initial Release, Entered Monitoring Scope This Week)

facebook/community-alignment-dataset (194 downloads, 39 likes, cc-by-4.0) — 200K+ LLM response comparison data from 3,000+ global annotators, covering multilingual and multi-turn conversation scenarios. This is Meta's largest-scale open-source multilingual preference dataset. Also released facebook/actionbench (2026-02-19, 2 downloads) — 128 video-to-animated point cloud paired samples for evaluating the ability to generate animated 3D meshes from video. The two datasets represent Meta's strategic positioning on two data fronts: "text alignment" and "video-3D multimodal."

Business implications: The cc-by-4.0 license of community-alignment-dataset means anyone can use it freely for commercial training — a boon for small and mid-sized model companies, but a direct impact on preference data suppliers. Differentiation directions: 1) Vertical industry preference data (medical, legal, financial, and other professional scenarios not covered by Meta's dataset); 2) Chinese preference data — although the dataset is multilingual, its Chinese coverage depth is limited; 3) Continuous update services — open-source datasets are static, while clients need preference data that continuously updates as models iterate.

P1 RLHF/Alignment Research at High-Density Output for Fourth Consecutive Week, Methodology Moving Toward Personalization and Decoupling (2026-02-16 to 2026-02-19)

Five RLHF/alignment-related papers this week: MARS (2026-02-19) — Margin-Aware reward modeling + self-refined data augmentation, addressing the high cost of preference data; Learning Personalized Agents from Human Feedback (2026-02-18) — introduces the PersonaliZe framework, enabling agents to adapt to dynamic changes in individual preferences; Multi-Objective Alignment for Personalized Psychotherapy (2026-02-17) — multi-objective alignment in psychotherapy scenarios, balancing patient preferences with clinical safety; Interactionless IRL (2026-02-16) — proposes "interaction-free inverse reinforcement learning," decoupling safety objectives from policy to avoid "alignment waste"; Latency-aware HITL-RL (2026-02-17) — embedding human feedback and latency constraints in semantic communication. Common trend across all five papers: moving from "one-size-fits-all alignment" toward "personalized + decoupled + multi-objective + scenario-specific."

Business implications: The refinement of alignment methodology directly changes data requirements: 1) Personalized preference data — no longer "humanity's preferences" but "preferences of specific user groups/individuals," requiring data collection to cover population diversity; 2) Multi-objective labeling — the same sample needs preference labeling across multiple dimensions (safety, helpfulness, personalization, etc.), increasing labeling costs but raising the value per data point; 3) Dynamic preference data — the PersonaliZe framework emphasizes that preferences change over time, meaning preference data needs periodic refreshing, and the "one-time labeling" model will be replaced by "continuous labeling services."

P1 Three Frontier Models Debut in the Same Week: Gemini 3.1 Pro, Sonnet 4.6, Qwen 3.5-397B — Multimodal Arms Race Reaches White Heat (2026-02-16 to 2026-02-19)

Google released Gemini 3.1 Pro (2026-02-19, DeepMind blog: "A smarter model for your most complex tasks"), emphasizing complex task reasoning capabilities; Anthropic released Claude Sonnet 4.6 (2026-02-19, "frontier performance across coding, agents, and professional work at scale"); Qwen 3.5-397B-A17B (2026-02-16, 105K downloads, 754 likes) MoE architecture vision-language model. Concurrently, MiniMax-M2.5 became a community favorite with 123K downloads and 814 likes, and Cerebras released REAP compressed versions (172B-A10B and 139B-A10B). Reddit hot post "Qwen3.5 Plus, GLM 5, Gemini 3.1 Pro, Sonnet 4.6, three new open source agents" (57 votes) confirms the community's sense of model release density.

Business implications: Three frontier models releasing in the same week signals a synchronized explosion in alignment and evaluation data demand. Key areas to watch: 1) Complex task reasoning data — Gemini 3.1 Pro targets "complex tasks," requiring multi-step reasoning and long-chain thinking evaluation and training data; 2) Coding/Agent data — Sonnet 4.6 emphasizes coding and agents, driving up demand for agent behavior trajectories and code reasoning data; 3) Vision-language multimodal data — Qwen 3.5 is a vision-language model, and at 397B scale, its consumption of visual reasoning data is enormous.

P2 GGML/llama.cpp Joins Hugging Face, Local AI Infrastructure Consolidation Accelerates (2026-02-19)

Hugging Face blog announced "GGML and llama.cpp join HF to ensure the long-term progress of Local AI." GGML is the most widely used quantization format for local model inference, and llama.cpp is the most active local inference engine in the community. Concurrent signals: Reddit "Free ASIC Llama 3.1 8B inference at 16,000 tok/s" (318 votes, highest this week), suggesting dedicated hardware-accelerated local inference has crossed the usability threshold; "Kimi K2.5 better than Opus 4.6 on hallucination benchmark" (46 votes) showing local/open-source models challenging closed-source frontiers in specific domains; Snorkel AI demonstrating a 4B model surpassing a 235B model through tool discipline.

Business implications: The consolidation of local AI infrastructure means: 1) Quantized model evaluation data demand — quality loss from quantization needs systematic evaluation, creating a new category of "pre/post-quantization comparison evaluation datasets"; 2) End-to-end on-device fine-tuning data — 16K tok/s ASIC inference + GGML/HF integration moves edge deployment from technical validation to production-ready, and edge-specialized data demand will scale up; 3) Small model alignment data — Snorkel AI's 4B model case proves small models can outperform large ones through precise fine-tuning, but the prerequisite is high-quality vertical domain alignment data.

Demand Signals

Infer training data demands from model releases

Multimodal Visual Reasoning Data

Critical ↑ New

Qwen 3.5-397B VLM · GLM-4.6V visual reasoning · Molmo2-MultiImageQA multi-image VQA

RLHF/Preference Alignment Data

Critical ↑ New

Meta 200K+ preference pairs open-sourced · Allen AI 260K DPO pairs · MARS reward modeling self-refinement · PersonaliZe personalized alignment

Agent Behavior/Trajectory Data

High ↑ New

Sonnet 4.6 Agent performance · Snowflake AgentWorldModel-1K · Mistral Vibe CLI/Devstral 2 · OpenAI Codex 61K⭐

Complex Reasoning Evaluation Data

High ↑ New

Gemini 3.1 Pro "complex tasks" · HLE-Verified human ultimate exam correction · MATEO temporal reasoning benchmark

Coding/Code Reasoning Data

High ↑ New

Sonnet 4.6 coding performance · Qwen3 Coder Next · Reddit "surge in LLM coding capabilities" · TAROT code generation RL

Multilingual Data

High → Continuing

ÜberWeb 20T multilingual curation · WaxalNLP African language speech · ParlaCAP 28 European parliaments · Crowdsourcing Piedmontese

Robotics/Embodied AI Data

Moderate ↑ New

NVIDIA NuRec · MolmoSpaces +39.7% growth · Humanoid End-Effector Control · Isaac-GR00T 6.2K⭐

Document OCR Data

Moderate ↑ New

olmOCR-bench · Mistral OCR 3 · PaddleOCR-VL in llama.cpp · amazon/doc_split

Quantization/Compression Evaluation Data

Moderate ↑ New

Cerebras REAP compression of MiniMax · ASIC 16K tok/s inference · INT8 cross-chip precision variance

Safety/Alignment Audit Data

Moderate ↑ New

EleutherAI misalignment-control-sft · Qwen3Guard real-time safety · OpenAI $7.5M alignment research grant

Robotics Manipulation Data ↓ Dropped Present in previous issue, absent this issue

Multimodal Preference Data ↓ Dropped Present in previous issue, absent this issue

Speech/ASR Data ↓ Dropped Present in previous issue, absent this issue

Code Data ↓ Dropped Present in previous issue, absent this issue

Video Understanding Data ↓ Dropped Present in previous issue, absent this issue

GUI/Agent Data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset	Downloads	Weekly Growth
allenai/molmospaces	204	+39.7%

Deep Dive — DataRecipe

This week's 2 high-value datasets reverse-analyzed (auto-generated by DataRecipe)

facebook/EgoAVU_data

300 samples · 6 fields · Medium

6.0/10

Data Structure

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds

Low Risk Data may become outdated over time → Establish continuous update mechanisms

allenai/olmix

300 samples · 113 fields · Medium

6.5/10

Data Structure

Risk Assessment

Medium Risk Labeling quality may fluctuate → Establish rigorous QA processes with quality thresholds

Low Risk Data may become outdated over time → Establish continuous update mechanisms

Analyzed 2 datasets this week · 83.9% human effort · all Medium difficulty

Want to discuss this issue?

Kai Founder & CEO

苏文 AI Documentation & Release Engineer

陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →

placeholderplaceholder

Key Findings

Demand Signals

Download Movers

Deep Dive — DataRecipe

Data Structure

Risk Assessment

Data Structure

Risk Assessment

placeholder
placeholder