Quick Start
pip install ai-dataset-radar
from radar.scanner import Scanner
scanner = Scanner()
report = scanner.scan(days=7)
radar_scan
Run AI dataset competitive intelligence scan, monitoring latest updates on HuggingFace, GitHub, arXiv, and blogs
radar_summary
Get summary statistics from the latest scan report
radar_datasets
Get the list of newly discovered datasets
radar_github
Get latest activity from GitHub organizations
radar_papers
Get latest related papers
radar_config
View current monitoring configuration (tracked organizations, keywords, etc.)
radar_blogs
Get latest blog posts (from 62+ blog sources)
radar_reddit
Get Reddit AI/ML community posts (r/MachineLearning, r/LocalLLaMA, etc.)
radar_search
Full-text search across all data sources (datasets, GitHub, papers, blogs, X/Twitter, Reddit), supports keywords and regex
radar_trend
Query dataset growth trends: fastest rising, breakthrough growth, historical curves for specific datasets
radar_history
View historical scan report timeline with statistical summaries and change trends
radar_diff
Compare two scan reports, auto-detect added/removed datasets, repos, papers, etc.
radar_trends
View historical trend data: quantity changes over time for each data source, supports chart data output
radar_matrix
Get competitor matrix: cross-analysis of datasets/repos/papers/blogs by organization and data type
radar_lineage
Get dataset lineage analysis: derivation relationships, version chains, fork trees, and root datasets
radar_org_graph
Get organization relationship graph: collaboration edges, clustering, and centrality rankings
radar_alerts
Get recent alert records: zero-data, threshold, trend breakouts, change detection, etc.
radar_export
Export latest report in specified format (CSV / Markdown table / JSON compact)
radar_subscribe
Manage watchlist — add/view/remove tracked datasets or organizations, future scans highlight matches
Documentation
AI Dataset Radar
Multi-Source Competitive Intelligence Engine
for AI Training Data Ecosystems
Async multi-source intelligence — watermark-driven incremental scanning, anomaly detection, cross-dimensional analysis, agent-native
GitHub · PyPI · knowlyr.com · 中文版
Abstract
Competitive intelligence for AI training data has long been constrained by three fundamental bottlenecks: information asymmetry, source fragmentation, and reactive monitoring. AI Dataset Radar proposes a multi-source asynchronous competitive intelligence engine that achieves full-pipeline concurrent crawling via aiohttp across 7 data sources and 337+ monitored targets (86 HuggingFace orgs / 50 GitHub orgs / 71 blogs / 125 X accounts / 5 Reddit communities / Papers with Code), reduces API call volume from $O(N)$ to $O(\Delta N)$ through org-level watermark incremental scanning, and closes the loop from passive observation to proactive alerting via 7 anomaly detection rules across 4 categories.
AI Dataset Radar implements a multi-source async competitive intelligence engine covering 86 HuggingFace orgs, 50 GitHub orgs, 71 blogs, 125 X accounts, 5 Reddit communities, and Papers with Code. The system features org-level watermark incremental scanning that reduces API calls from $O(N)$ to $O(\Delta N)$, anomaly detection with 7 rules across 4 categories, and three-dimensional cross-analysis (competitive matrix, dataset lineage, org relationship graph). It exposes 19 MCP tools, 19 REST endpoints, and 7 Claude Code Skills for agent-native integration.
Architecture
flowchart TD
subgraph S[" 7 Data Sources · 337+ Targets"]
direction LR
S1["HuggingFace<br/>86 orgs"] ~~~ S2["GitHub<br/>50 orgs"] ~~~ S3["Blogs<br/>71 sources"]
S4["Papers<br/>arXiv + HF"] ~~~ S5["X / Twitter<br/>125 accounts"] ~~~ S6["Reddit<br/>5 communities"]
S7["Papers with Code"]
end
S --> T["Trackers<br/>aiohttp async · org-level watermark"]
T --> A["Analyzers<br/>classification · trends · matrix · lineage · org graph"]
A --> D["Anomaly Detection<br/>7 rules × 4 categories · fingerprint dedup"]
subgraph O[" Output Layer"]
direction LR
O1["JSON structured"] ~~~ O2["Markdown reports"] ~~~ O3["AI Insights"]
end
D --> O
subgraph I[" Agent Interface Layer"]
direction LR
I1["REST API<br/>19 endpoints"] ~~~ I2["MCP Server<br/>19 tools"] ~~~ I3["Skills<br/>7 commands"] ~~~ I4["Dashboard<br/>12 tabs"]
end
O --> I
style S fill:#1a1a2e,color:#e0e0e0,stroke:#444
style T fill:#0969da,color:#fff,stroke:#0969da
style A fill:#8b5cf6,color:#fff,stroke:#8b5cf6
style D fill:#e5534b,color:#fff,stroke:#e5534b
style O fill:#1a1a2e,color:#e0e0e0,stroke:#444
style I fill:#2da44e,color:#fff,stroke:#2da44e
Key Features
| Feature | Description |
|---|---|
| Multi-Source Async Crawling | 7 sources, 337+ targets via aiohttp full-pipeline concurrency; 500+ concurrent requests per scan |
| Watermark Incremental Scanning | Org-level watermark per source; API calls from $O(N)$ to $O(\Delta N)$ |
| Three-Dimensional Cross-Analysis | Competitive matrix + dataset lineage + org relationship graph |
| Anomaly Detection & Alerting | 7 rules across 4 categories; fingerprint dedup; Email + Webhook distribution |
| Time-Series Persistence | SQLite daily snapshots; bulk upsert; long-cycle trend analysis |
| Agent-Native Interfaces | 19 MCP tools + 19 REST endpoints + 7 Claude Code Skills |
| AI-Powered Insights | LLM-generated analytical reports; multi-provider (Anthropic / Kimi / DeepSeek) |
| Real-Time Dashboard | 12-tab web dashboard with panoramic intelligence view |
Quick Start
git clone https://github.com/liuxiaotong/ai-dataset-radar.git
cd ai-dataset-radar
pip install -r requirements.txt && playwright install chromium
cp .env.example .env # Edit to fill in tokens
# Basic scan (auto-generates AI analytical report)
python src/main_intel.py --days 7
# Scan + DataRecipe deep analysis
python src/main_intel.py --days 7 --recipe
# Docker
docker compose run scan
Data Sources
| Source | Count | Coverage |
|---|---|---|
| HuggingFace | 86 orgs | 67 labs + 27 vendors (incl. robotics, Europe, Asia-Pacific) |
| Blogs | 71 sources | Labs + researchers + independent blogs + data vendors |
| GitHub | 50 orgs | AI labs + Chinese open-source + robotics + data vendors |
| Papers | 2 sources | arXiv (cs.CL/AI/LG/CV/RO) + HF Papers |
| Papers with Code | API | Dataset/benchmark tracking; paper citation relationships |
| X/Twitter | 125 accounts | 13 categories: CEOs/Leaders + researchers + robotics |
| 5 communities | MachineLearning, LocalLLaMA, dataset, deeplearning, LanguageTechnology |
Ecosystem
| Layer | Project | PyPI | Description | Repo |
|---|---|---|---|---|
| Discovery | Radar | knowlyr-radar | Multi-source competitive intelligence · incremental scanning · anomaly alerting | You are here |
| Analysis | DataRecipe | knowlyr-datarecipe | Reverse analysis, schema extraction, cost estimation | GitHub |
| Production | DataSynth | knowlyr-datasynth | LLM batch synthesis | GitHub |
| Production | DataLabel | knowlyr-datalabel | Lightweight annotation | GitHub |
| Quality | DataCheck | knowlyr-datacheck | Rule validation, dedup detection, distribution analysis | GitHub |
| Audit | ModelAudit | knowlyr-modelaudit | Distillation detection, model fingerprinting | GitHub |
| Deliberation | Crew | knowlyr-crew | Adversarial multi-agent deliberation · persistent memory evolution · MCP-native | GitHub |
| Identity | knowlyr-id | -- | Identity system + AI employee runtime | GitHub |
| Agent Training | knowlyr-gym | sandbox/recorder/reward/hub | Gymnasium-style RL framework · process reward model · SFT/DPO/GRPO | GitHub |
Want to discuss this project? Reach out to