Multi-source competitive intelligence fusion engine — automatically scans 339 signal sources (93 HF orgs + 50 GitHub orgs + 71 blogs + 125 X accounts), using watermark-driven incremental scanning and anomaly scoring models to generate structured weekly reports and trend analyses.

339 Signal Sources Incremental Scanning Anomaly Scoring

Quick Start

Install

pip install ai-dataset-radar

Usage

from radar.scanner import Scanner

scanner = Scanner()
report = scanner.scan(days=7)

MCP Tools

19 callable endpoints

+

radar_scan Run AI dataset competitive intelligence scan, monitoring latest updates on HuggingFace, GitHub, arXiv, and blogs

radar_summary Get summary statistics from the latest scan report

radar_datasets Get the list of newly discovered datasets

radar_github Get latest activity from GitHub organizations

radar_papers Get latest related papers

radar_config View current monitoring configuration (tracked organizations, keywords, etc.)

radar_blogs Get latest blog posts (from 62+ blog sources)

radar_reddit Get Reddit AI/ML community posts (r/MachineLearning, r/LocalLLaMA, etc.)

radar_search Full-text search across all data sources (datasets, GitHub, papers, blogs, X/Twitter, Reddit), supports keywords and regex

radar_trend Query dataset growth trends: fastest rising, breakthrough growth, historical curves for specific datasets

radar_history View historical scan report timeline with statistical summaries and change trends

radar_diff Compare two scan reports, auto-detect added/removed datasets, repos, papers, etc.

radar_trends View historical trend data: quantity changes over time for each data source, supports chart data output

radar_matrix Get competitor matrix: cross-analysis of datasets/repos/papers/blogs by organization and data type

radar_lineage Get dataset lineage analysis: derivation relationships, version chains, fork trees, and root datasets

radar_org_graph Get organization relationship graph: collaboration edges, clustering, and centrality rankings

radar_alerts Get recent alert records: zero-data, threshold, trend breakouts, change detection, etc.

radar_export Export latest report in specified format (CSV / Markdown table / JSON compact)

radar_subscribe Manage watchlist — add/view/remove tracked datasets or organizations, future scans highlight matches

Documentation

AI Dataset Radar

Name: AI Dataset Radar
Author: Knowlyr

Multi-Source Competitive Intelligence Engine
for AI Training Data Ecosystems

Async multi-source intelligence — watermark-driven incremental scanning, anomaly detection, cross-dimensional analysis, agent-native

GitHub · PyPI · knowlyr.com · 中文版

Abstract

Competitive intelligence for AI training data has long been constrained by three fundamental bottlenecks: information asymmetry, source fragmentation, and reactive monitoring. AI Dataset Radar proposes a multi-source asynchronous competitive intelligence engine that achieves full-pipeline concurrent crawling via aiohttp across 7 data sources and 337+ monitored targets (86 HuggingFace orgs / 50 GitHub orgs / 71 blogs / 125 X accounts / 5 Reddit communities / Papers with Code), reduces API call volume from $O(N)$ to $O(\Delta N)$ through org-level watermark incremental scanning, and closes the loop from passive observation to proactive alerting via 7 anomaly detection rules across 4 categories.

AI Dataset Radar implements a multi-source async competitive intelligence engine covering 86 HuggingFace orgs, 50 GitHub orgs, 71 blogs, 125 X accounts, 5 Reddit communities, and Papers with Code. The system features org-level watermark incremental scanning that reduces API calls from $O(N)$ to $O(\Delta N)$, anomaly detection with 7 rules across 4 categories, and three-dimensional cross-analysis (competitive matrix, dataset lineage, org relationship graph). It exposes 19 MCP tools, 19 REST endpoints, and 7 Claude Code Skills for agent-native integration.

Architecture

flowchart TD
    subgraph S[" 7 Data Sources · 337+ Targets"]
        direction LR
        S1["HuggingFace<br/>86 orgs"] ~~~ S2["GitHub<br/>50 orgs"] ~~~ S3["Blogs<br/>71 sources"]
        S4["Papers<br/>arXiv + HF"] ~~~ S5["X / Twitter<br/>125 accounts"] ~~~ S6["Reddit<br/>5 communities"]
        S7["Papers with Code"]
    end

    S --> T["Trackers<br/>aiohttp async · org-level watermark"]
    T --> A["Analyzers<br/>classification · trends · matrix · lineage · org graph"]
    A --> D["Anomaly Detection<br/>7 rules × 4 categories · fingerprint dedup"]

    subgraph O[" Output Layer"]
        direction LR
        O1["JSON structured"] ~~~ O2["Markdown reports"] ~~~ O3["AI Insights"]
    end

    D --> O

    subgraph I[" Agent Interface Layer"]
        direction LR
        I1["REST API<br/>19 endpoints"] ~~~ I2["MCP Server<br/>19 tools"] ~~~ I3["Skills<br/>7 commands"] ~~~ I4["Dashboard<br/>12 tabs"]
    end

    O --> I

    style S fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style T fill:#0969da,color:#fff,stroke:#0969da
    style A fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style D fill:#e5534b,color:#fff,stroke:#e5534b
    style O fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style I fill:#2da44e,color:#fff,stroke:#2da44e

Key Features

Feature	Description
Multi-Source Async Crawling	7 sources, 337+ targets via aiohttp full-pipeline concurrency; 500+ concurrent requests per scan
Watermark Incremental Scanning	Org-level watermark per source; API calls from $O(N)$ to $O(\Delta N)$
Three-Dimensional Cross-Analysis	Competitive matrix + dataset lineage + org relationship graph
Anomaly Detection & Alerting	7 rules across 4 categories; fingerprint dedup; Email + Webhook distribution
Time-Series Persistence	SQLite daily snapshots; bulk upsert; long-cycle trend analysis
Agent-Native Interfaces	19 MCP tools + 19 REST endpoints + 7 Claude Code Skills
AI-Powered Insights	LLM-generated analytical reports; multi-provider (Anthropic / Kimi / DeepSeek)
Real-Time Dashboard	12-tab web dashboard with panoramic intelligence view

Quick Start

git clone https://github.com/liuxiaotong/ai-dataset-radar.git
cd ai-dataset-radar
pip install -r requirements.txt && playwright install chromium
cp .env.example .env  # Edit to fill in tokens

# Basic scan (auto-generates AI analytical report)
python src/main_intel.py --days 7

# Scan + DataRecipe deep analysis
python src/main_intel.py --days 7 --recipe

# Docker
docker compose run scan

Data Sources

Source	Count	Coverage
HuggingFace	86 orgs	67 labs + 27 vendors (incl. robotics, Europe, Asia-Pacific)
Blogs	71 sources	Labs + researchers + independent blogs + data vendors
GitHub	50 orgs	AI labs + Chinese open-source + robotics + data vendors
Papers	2 sources	arXiv (cs.CL/AI/LG/CV/RO) + HF Papers
Papers with Code	API	Dataset/benchmark tracking; paper citation relationships
X/Twitter	125 accounts	13 categories: CEOs/Leaders + researchers + robotics
Reddit	5 communities	MachineLearning, LocalLLaMA, dataset, deeplearning, LanguageTechnology

Ecosystem

Layer	Project	PyPI	Description	Repo
Discovery	Radar	knowlyr-radar	Multi-source competitive intelligence · incremental scanning · anomaly alerting	You are here
Analysis	DataRecipe	knowlyr-datarecipe	Reverse analysis, schema extraction, cost estimation	GitHub
Production	DataSynth	knowlyr-datasynth	LLM batch synthesis	GitHub
Production	DataLabel	knowlyr-datalabel	Lightweight annotation	GitHub
Quality	DataCheck	knowlyr-datacheck	Rule validation, dedup detection, distribution analysis	GitHub
Audit	ModelAudit	knowlyr-modelaudit	Distillation detection, model fingerprinting	GitHub
Deliberation	Crew	knowlyr-crew	Adversarial multi-agent deliberation · persistent memory evolution · MCP-native	GitHub
Identity	knowlyr-id	--	Identity system + AI employee runtime	GitHub
Agent Training	knowlyr-gym	sandbox/recorder/reward/hub	Gymnasium-style RL framework · process reward model · SFT/DPO/GRPO	GitHub