Radar Brief Week 22, 2026 · 2026-05-11 — 2026-05-18

NVIDIA video benchmark surges to 2,479 downloads in two weeks
Video scene judgment becomes a new data frontier

This week scanned 86 HF orgs · 50 GitHub orgs · 71 blogs · 125 X accounts

0
Valuable Datasets
0
Related Papers
0
Blog Posts
0
Active Repos
One-line Summary

NVIDIA's PhysicalAI-VANTAGE-Bench reached 2,479 downloads within 14 days of its release on 2026-05-04, while the Subset version reached 1,284 downloads after its release on 2026-05-05 [P0]; LAION added 16 rl_environment, 4 reward_model, and 1 rlhf_preference datasets this period, forming a systematic alignment data stack [P0]; Meta and Google simultaneously strengthened multilingual quality datasets, with facebook/bouquet and google/fleurs reaching 1,435 and 57,173 downloads respectively [P1]. The strongest data demand signal this week: fixed-camera video understanding / cross-camera tracking data.

Key Findings

This week's 5 high commercial value findings

P0 NVIDIA's PhysicalAI-VANTAGE-Bench reached 2,479 downloads within 14 days of its release on 2026-05-04, while the Subset version reached 1,284 downloads after its release on 2026-05-05 [P0]

nvidia/PhysicalAI-VANTAGE-Bench was released on 2026-05-04 and currently has 2,479 downloads and 9 likes; nvidia/PhysicalAI-VANTAGE-Bench-Subset was released on 2026-05-05 and currently has 1,284 downloads and 1 like. Change tracking shows VANTAGE-Bench increased from 19 in the previous period to 2,479, up by 2,460 downloads or 12,947.4%; the Subset version rose from 6 to 1,284, up by 1,278 downloads or 21,300.0%. Both focus on video understanding tasks from fixed infrastructure cameras, covering real-world scenarios such as warehouses and smart cities.

Business significance → This indicates that "fixed-camera video understanding / cross-scene tracking evaluation" is becoming a high-demand data category in Physical AI. The value of this type of data lies not in massive volumes of raw video, but in high-judgment-density information such as cross-camera object consistency, event boundaries, and failed scene transfer cases. Synthetic generation can help with scale, but it cannot easily replace judgment about "whether this counts as the same event / the same entity." For Knowlyr, this is a clear opportunity for human judgment data services: video event segmentation, cross-camera identity consistency review, hard-case feedback loops, and evaluation set construction.
P0 LAION added 16 rl_environment, 4 reward_model, and 1 rlhf_preference datasets this period, forming a systematic alignment data stack [P0]

In the change data, rl_environment increased from 1 to 16 datasets, adding 15 new ones; reward_model rose from 0 to 4; rlhf_preference rose from 0 to 1. Representative datasets include laion/nemotron-gym-safety, laion/nemotron-gym-agent-workplace, laion/nemotron-gym-agent-calendar, laion/nemotron-gym-competitive-coding, laion/scaling-laws-for-comparison-full, as well as laion/mix_h10_reward_binary-v2, laion/mix_h10_reward_proportional-v2, laion/mix_h10_reward_staged-v2, and laion/mix_baseline_uniform-v2, all of which appeared for the first time this period.

Business significance → The industry is shifting from "single-turn preference data" to systematic post-training built around "environment + trajectory + reward + comparison." What is hardest to replace here is not task generation itself, but reward definition, failure attribution, adversarial sample design, and multi-objective trade-off criteria—all of which require human judgment. For Knowlyr, service positioning should evolve from "labeling" to "alignment data production where people earn income by contributing judgment," with a focus on preference comparison, trajectory review, and reward rubric design for safety, office Agents, and coding Agents.
P1 Meta and Google are simultaneously strengthening multilingual quality datasets, with facebook/bouquet and google/fleurs reaching 1,435 and 57,173 downloads respectively [P1]

facebook/bouquet was released on 2025-06-10 and currently has 1,435 downloads and 36 likes. It is a many-to-many parallel translation quality evaluation set across 8 languages, with underlying text manually created by linguists. google/fleurs was released on 2022-04-19 and currently has 57,173 downloads and 402 likes, covering speech recognition in 102 languages, with labels including expert-generated, crowdsourced, and machine-generated sources. Together, both point to multilingual speech/translation quality evaluation rather than simple corpus expansion.

Business significance → Competition in multilingual data has shifted from "whether data exists" to "whether quality judgment is reliable." In particular, issues such as translation quality, accent intelligibility, and cross-lingual consistency still depend on fine-grained judgment from native speakers and experts. For Knowlyr, this is a high-value supply direction: multilingual subjective quality evaluation, bilingual sentence-pair preference selection, and cultural-context consistency review, rather than low-cost generic corpus curation.
P1 Agent evaluation datasets continue heating up, with internlm/WildClawBench reaching 8,250 downloads, while Microsoft adds Orchard and WebTailBench [P1]

internlm/WildClawBench was released on 2026-03-24 and currently has 8,250 downloads and 59 likes, up by 567 from 7,683 in the previous period. Change data also shows microsoft/Orchard newly added 166 downloads and 8 likes, and microsoft/WebTailBench newly added 366 downloads and 16 likes; both are categorized as agent_tool. Databricks' databricks/officeqa was released on 2025-12-15 and currently has 131 downloads, focusing on end-to-end reasoning over real documents.

Business significance → The focus of Agent training is shifting from general QA to "getting work done in real environments." What determines the model ceiling is not the number of web snapshots, but process-oriented judgment such as whether task decomposition is correct, whether tool use is compliant, and whether failures can be analyzed retrospectively. For Knowlyr, the opportunity lies in building agent task trajectory quality inspection, manual rubric evaluation, and real enterprise document task sets—areas where human judgment is more irreplaceable than in generic instruction data.
P2 Scientific and industrial document datasets are entering high-value reasoning evaluation, with allenai/olmoearth-paper-embeddings and databricks/officeqa appearing simultaneously [P2]

allenai/olmoearth-paper-embeddings was released on 2026-05-15 and currently has 2,876 downloads and 2 likes, providing paper embeddings for 26 Earth observation foundation models across 24 downstream tasks. databricks/officeqa was released on 2025-12-15 and currently has 131 downloads, centered on grounded reasoning over U.S. Treasury bulletin documents dating back to the 1930s. Meanwhile, Microsoft Research published a blog related to SocialReasoning-Bench on 2026-05-14, emphasizing that agents may execute tasks but do not necessarily continue improving users' situations.

Business significance → The barrier for high-value document AI lies in "whether the evidence chain holds," not merely OCR or extraction. Scientific tables, historical archives, financial bulletins, and industrial rule documents all require people to make judgment on evidence citation, conclusion robustness, and task completion standards. For Knowlyr, document understanding services should focus on high-value steps such as evidence alignment evaluation, citation correctness review, and long-document task decomposition judgment.

Demand Signals

Infer training data demands from model releases

Data Type Intensity Trend Related Signals
Fixed-camera video understanding / cross-camera tracking data
Very strong ↑ New
nvidia/PhysicalAI-VANTAGE-Bench reached 2,479 downloads after its release on 2026-05-04; the Subset version reached 1,284 downloads on 2026-05-05
RL environment and task episode data
Very strong ↑ New
laion added 16 rl_environment datasets, covering scenarios such as safety · calendar · workplace · competitive-coding
Reward model and preference comparison data
Very strong ↑ New
laion added 4 reward_model and 1 rlhf_preference datasets, forming a reward + comparison data stack
Agent tool-use trajectories
Strong ↑ New
The agent_tool category grew from 0 to 9, adding microsoft/Orchard · microsoft/WebTailBench · databricks/officeqa · allenai/asta-summary-citation-counts
Multilingual quality evaluation data
Strong ↑ New
facebook/bouquet reached 1,435 downloads, and google/fleurs reached 57,173 downloads; both emphasize expert/crowdsourced quality sources
Scientific and enterprise document reasoning data
Strong ↑ New
databricks/officeqa · allenai/asta-summary-citation-counts · olmoearth-paper-embeddings point to reasoning over real documents and evidence chains
3D world models / robot spatial labeling
Strong ↑ New
nvidia/PointWorld-DROID reached 571 downloads, advancing alongside Lyra-2.0 and the Physical AI ecosystem
Safety and reliability evaluation data
Strong ↑ New
laion/nemotron-gym-safety was newly added; Microsoft published blogs on SocialReasoning-Bench and delegation reliability; the community is actively discussing benchmark hacking and arXiv hallucination penalties
Scientific computing and industrial simulation data
Medium ↑ New
nvidia/HiLiftAeroML reached 11,330 downloads, Linear-Radiation-Transport was newly added, and GridSFM_US_power_grid added 432 downloads
Video generation alignment data
Medium ↑ New
Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization was published on 2026-05-15
Robot manipulation trajectories and language instructions ↓ Dropped Present in previous issue, absent this issue
Video anomaly understanding / infrastructure camera data ↓ Dropped Present in previous issue, absent this issue
Coding Agent trajectories ↓ Dropped Present in previous issue, absent this issue
Open-ended social evaluation and safety rubrics ↓ Dropped Present in previous issue, absent this issue
Multilingual / low-resource speech ↓ Dropped Present in previous issue, absent this issue
Synthetic coding tasks and validation sets ↓ Dropped Present in previous issue, absent this issue
Multimodal search and web Agent data ↓ Dropped Present in previous issue, absent this issue
Video reward modeling / preference learning data ↓ Dropped Present in previous issue, absent this issue

Download Movers

Datasets with the largest download changes this week

Dataset Downloads Weekly Growth
nvidia/PhysicalAI-VANTAGE-Bench-Subset 1,284 +21300.0%
nvidia/PhysicalAI-VANTAGE-Bench 2,479 +12947.4%
laion/Scientific-Summaries 34,214 +1241.7%
microsoft/delulu-fim-benchmark 659 +112.6%
internlm/WildClawBench 8,250 +7.4%

Want to discuss this issue?

Kai
Kai Founder & CEO
苏文
苏文 AI Documentation & Release Engineer
陆明哲
陆明哲 AI Product Manager

Auto-generated by AI Dataset Radar · Updated weekly

AI Dataset Radar →