placeholder
placeholder
This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts
Allen AI releases five datasets + Olmix data mixing framework, systematically defining pre-training data methodology; Meta open-sources 200K+ multilingual multi-turn preference dataset, upgrading RLHF public data supply; RLHF/alignment research at high-density output for the fourth consecutive week, methodology moving toward personalization and decoupling. Top data demand signal this week: Multimodal Visual Reasoning Data.
Key Findings
This week's 5 high commercial value findings
Allen AI released 5 datasets and 8 models this week, making it the highest single-week output research institution. Key highlights: allenai/olmix (2026-02-11, 238 downloads, 18 likes) — proxy run swarm data for OLMo pre-training, systematically solving the core pre-training question of "what ratio of different domain data produces optimal results"; allenai/Dolci-Instruct-DPO (2,498 downloads) — 260K preference pairs for OLMo 3 Instruct 7B alignment training, ODC-BY license; allenai/olmOCR-bench (2,745 downloads, 58 likes) — 1,403 PDFs + 7,010 unit tests, establishing PDF-to-Markdown OCR system evaluation standards; allenai/Molmo2-MultiImageQA (194 downloads) — multi-image visual question answering instruction fine-tuning dataset; allenai/molmospaces (204 downloads, +39.7% weekly growth) — embodied AI 3DGUT/USD resource update with Isaac Sim compatible format. Companion blog posts published simultaneously: Olmix data mixing framework deep dive, AutoDiscovery automated scientific discovery, MolmoSpaces ecosystem introduction, How2Everything real-world procedure evaluation.
facebook/community-alignment-dataset (194 downloads, 39 likes, cc-by-4.0) — 200K+ LLM response comparison data from 3,000+ global annotators, covering multilingual and multi-turn conversation scenarios. This is Meta's largest-scale open-source multilingual preference dataset. Also released facebook/actionbench (2026-02-19, 2 downloads) — 128 video-to-animated point cloud paired samples for evaluating the ability to generate animated 3D meshes from video. The two datasets represent Meta's strategic positioning on two data fronts: "text alignment" and "video-3D multimodal."
Five RLHF/alignment-related papers this week: MARS (2026-02-19) — Margin-Aware reward modeling + self-refined data augmentation, addressing the high cost of preference data; Learning Personalized Agents from Human Feedback (2026-02-18) — introduces the PersonaliZe framework, enabling agents to adapt to dynamic changes in individual preferences; Multi-Objective Alignment for Personalized Psychotherapy (2026-02-17) — multi-objective alignment in psychotherapy scenarios, balancing patient preferences with clinical safety; Interactionless IRL (2026-02-16) — proposes "interaction-free inverse reinforcement learning," decoupling safety objectives from policy to avoid "alignment waste"; Latency-aware HITL-RL (2026-02-17) — embedding human feedback and latency constraints in semantic communication. Common trend across all five papers: moving from "one-size-fits-all alignment" toward "personalized + decoupled + multi-objective + scenario-specific."
Google released Gemini 3.1 Pro (2026-02-19, DeepMind blog: "A smarter model for your most complex tasks"), emphasizing complex task reasoning capabilities; Anthropic released Claude Sonnet 4.6 (2026-02-19, "frontier performance across coding, agents, and professional work at scale"); Qwen 3.5-397B-A17B (2026-02-16, 105K downloads, 754 likes) MoE architecture vision-language model. Concurrently, MiniMax-M2.5 became a community favorite with 123K downloads and 814 likes, and Cerebras released REAP compressed versions (172B-A10B and 139B-A10B). Reddit hot post "Qwen3.5 Plus, GLM 5, Gemini 3.1 Pro, Sonnet 4.6, three new open source agents" (57 votes) confirms the community's sense of model release density.
Hugging Face blog announced "GGML and llama.cpp join HF to ensure the long-term progress of Local AI." GGML is the most widely used quantization format for local model inference, and llama.cpp is the most active local inference engine in the community. Concurrent signals: Reddit "Free ASIC Llama 3.1 8B inference at 16,000 tok/s" (318 votes, highest this week), suggesting dedicated hardware-accelerated local inference has crossed the usability threshold; "Kimi K2.5 better than Opus 4.6 on hallucination benchmark" (46 votes) showing local/open-source models challenging closed-source frontiers in specific domains; Snorkel AI demonstrating a 4B model surpassing a 235B model through tool discipline.
Demand Signals
Infer training data demands from model releases
Download Movers
Datasets with the largest download changes this week
| Dataset | Downloads | Weekly Growth |
|---|---|---|
| allenai/molmospaces | 204 | +39.7% |
Deep Dive — DataRecipe
This week's 2 high-value datasets reverse-analyzed (auto-generated by DataRecipe)
Data Structure
Risk Assessment
Data Structure
Risk Assessment
Analyzed 2 datasets this week · 83.9% human effort · all Medium difficulty
Want to discuss this issue?
Auto-generated by AI Dataset Radar · Updated weekly
AI Dataset Radar →