Open Source Python MIT

DataRecipe

Data Recipe

★ 0 ⑂ 0 Updated 2026-02-25
AI dataset reverse engineering framework — a 6-stage deep analysis pipeline that automatically extracts labeling specs, cost models, and replication strategies. LLM-enhanced layer generates 23+ multi-role production documents, supporting dual-format output (human-readable + machine-parsable).
6-Stage Pipeline 23+ Document Generation LLM-Enhanced

Quick Start

Install
pip install knowlyr-datarecipe
Usage
# CLI
knowlyr-datarecipe deep-analyze tencent/CL-bench
parse_spec_document Parse a specification document (PDF, Word, image, text) and extract text content. Returns the document text and a prompt for LLM analysis.
generate_spec_output Generate project artifacts (annotation spec, executive summary, milestone plan, cost breakdown) from analysis JSON.
analyze_huggingface_dataset Run deep analysis on a HuggingFace dataset and generate reproduction guide.
get_extraction_prompt Get the LLM extraction prompt template for analyzing a specification document. Use this when you want to analyze a document yourself instead of using an external API.
extract_rubrics Extract scoring rubrics and evaluation patterns from a HuggingFace dataset. Returns structured templates for annotation guidelines.
extract_prompts Extract system prompt templates from a HuggingFace dataset. Returns unique prompts categorized by domain.
compare_datasets Compare multiple HuggingFace datasets side by side. Returns comparison metrics and recommendations.
profile_dataset Generate annotator profile and cost estimation for a dataset. Returns required skills, team size, and budget.
get_agent_context Get the AI Agent context file from a previous analysis. Returns structured data for AI Agent consumption.
recipe_template 从分析结果生成标注模板(接 data-label)。读取 DATA_SCHEMA.json 和 ANNOTATION_SPEC.md,生成 data-label 兼容的 HTML 标注模板。
recipe_diff 对比两次分析结果的差异。比较 schema 字段、统计数据、评分规范等。
enhance_analysis_reports Apply LLM-enhanced context to regenerate analysis reports with rich,

Documentation

English | 中文

DataRecipe

Automated Dataset Reverse Engineering
and Reproduction Cost Estimation

Reverse-engineer any AI dataset: extract schemas, estimate costs, and generate production-ready documentation from samples or requirement docs

GitHub · PyPI · knowlyr.com

Why DataRecipe?

Reproducing an AI dataset requires answering three questions: What does the data look like (Schema), How much will it cost (Cost), and How to build it (Methodology). Today these answers come from manually reading papers, inspecting samples, and writing specs — a process that takes days and cannot be reused across datasets.

DataRecipe automates the entire reverse engineering process. Give it a HuggingFace dataset or a requirement document (PDF/Word/Image), and it will:

  • Infer Schema — field types, constraints, distributions
  • Extract Rubrics & Prompts — scoring criteria, annotation dimensions, prompt templates
  • Model Costs — token-level analysis, phased cost breakdown, human-machine split ratios
  • Generate 23+ Production Documents — for 6 stakeholder roles (executive, PM, annotators, engineers, finance, AI agents)
  • Enhance with LLM — a single LLM call produces EnhancedContext, upgrading template outputs to domain-specific professional analyses

Quick Start

pip install knowlyr-datarecipe

# Analyze a HuggingFace dataset (local, no API key needed)
knowlyr-datarecipe deep-analyze tencent/CL-bench

# Enable LLM enhancement for richer output
knowlyr-datarecipe deep-analyze tencent/CL-bench --use-llm

# Analyze a requirement document
knowlyr-datarecipe analyze-spec requirements.pdf

Optional extras: pip install knowlyr-datarecipe[llm] (Anthropic/OpenAI), [pdf], [mcp], or [all].

Six-Stage Analysis Pipeline

graph LR
    I["Input<br/>HF Dataset / PDF / Word"] --> A1["Schema<br/>Inference"]
    A1 --> A2["Rubric<br/>Extraction"]
    A2 --> A3["Prompt<br/>Extraction"]
    A3 --> A4["Cost<br/>Modeling"]
    A4 --> A5["Human-Machine<br/>Split"]
    A5 --> A6["Benchmark<br/>Comparison"]
    A6 --> E["LLM Enhancer<br/>EnhancedContext"]
    E --> G["Generators<br/>23+ Documents"]

    style A1 fill:#0969da,color:#fff,stroke:#0969da
    style E fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style G fill:#2da44e,color:#fff,stroke:#2da44e
    style I fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A2 fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A3 fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A4 fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A5 fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A6 fill:#1a1a2e,color:#e0e0e0,stroke:#444

Each stage outputs both human-readable (Markdown) and machine-parseable (JSON/YAML) formats. The LLM Enhancement Layer runs in three modes: auto (detect environment), interactive (host LLM handles it), or api (standalone Anthropic/OpenAI call).

Core Features

Feature Description
Multi-Source Input HuggingFace datasets, PDF, Word, images, plain text
Token-Level Cost Analysis Phased cost model with human-machine split and industry benchmarks
Stakeholder Documents 23+ docs for executives, PMs, annotators, engineers, finance, AI agents
Agent-Ready Output Structured context, workflow state, reasoning traces, executable pipeline
Radar Integration Batch-analyze datasets discovered by AI Dataset Radar
12 MCP Tools Seamless AI IDE integration for analysis, enhancement, and comparison
3572 Tests, 97% Coverage Production-grade reliability

Ecosystem

DataRecipe is part of the knowlyr data infrastructure:

Layer Project Role
Discovery AI Dataset Radar Dataset intelligence and trend analysis
Analysis DataRecipe Reverse engineering, schema inference, cost modeling
Production DataSynth / DataLabel LLM batch synthesis / lightweight annotation
Quality DataCheck Rule validation, anomaly detection, auto-fix
Audit ModelAudit Distillation detection, model fingerprinting
# End-to-end workflow
knowlyr-datarecipe deep-analyze tencent/CL-bench --use-llm      # Analyze
knowlyr-datalabel generate ./projects/tencent_CL-bench/          # Annotate
knowlyr-datasynth generate ./projects/tencent_CL-bench/ -n 1000  # Synthesize
knowlyr-datacheck validate ./projects/tencent_CL-bench/          # Validate

GitHub · PyPI · knowlyr.com

knowlyr — automated dataset reverse engineering and reproduction cost estimation

Want to discuss this project? Reach out to

Kai" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
Kai Founder & CEO
陆明哲" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
陆明哲 AI 产品经理