AI dataset reverse engineering framework — a 6-stage deep analysis pipeline that automatically extracts labeling specs, cost models, and replication strategies. LLM-enhanced layer generates 23+ multi-role production documents, supporting dual-format output (human-readable + machine-parsable).

6-Stage Pipeline 23+ Document Generation LLM-Enhanced

Quick Start

Install

pip install knowlyr-datarecipe

Usage

# CLI
knowlyr-datarecipe deep-analyze tencent/CL-bench

MCP Tools

12 callable endpoints

+

parse_spec_document Parse a specification document (PDF, Word, image, text) and extract text content. Returns the document text and a prompt for LLM analysis.

generate_spec_output Generate project artifacts (annotation spec, executive summary, milestone plan, cost breakdown) from analysis JSON.

analyze_huggingface_dataset Run deep analysis on a HuggingFace dataset and generate reproduction guide.

get_extraction_prompt Get the LLM extraction prompt template for analyzing a specification document. Use this when you want to analyze a document yourself instead of using an external API.

extract_rubrics Extract scoring rubrics and evaluation patterns from a HuggingFace dataset. Returns structured templates for annotation guidelines.

extract_prompts Extract system prompt templates from a HuggingFace dataset. Returns unique prompts categorized by domain.

compare_datasets Compare multiple HuggingFace datasets side by side. Returns comparison metrics and recommendations.

profile_dataset Generate annotator profile and cost estimation for a dataset. Returns required skills, team size, and budget.

get_agent_context Get the AI Agent context file from a previous analysis. Returns structured data for AI Agent consumption.

recipe_template Generate annotation template from analysis results (for data-label). Reads DATA_SCHEMA.json and ANNOTATION_SPEC.md to produce data-label compatible HTML template.

recipe_diff Compare differences between two analysis results — schema fields, statistics, scoring rubrics, etc.

enhance_analysis_reports Apply LLM-enhanced context to regenerate analysis reports with rich,

Documentation

English | 中文

DataRecipe

Name: DataRecipe
Author: Knowlyr

Automated Dataset Reverse Engineering
and Reproduction Cost Estimation

Reverse-engineer any AI dataset: extract schemas, estimate costs, and generate production-ready documentation from samples or requirement docs

GitHub · PyPI · knowlyr.com

Why DataRecipe?

Reproducing an AI dataset requires answering three questions: What does the data look like (Schema), How much will it cost (Cost), and How to build it (Methodology). Today these answers come from manually reading papers, inspecting samples, and writing specs — a process that takes days and cannot be reused across datasets.

DataRecipe automates the entire reverse engineering process. Give it a HuggingFace dataset or a requirement document (PDF/Word/Image), and it will:

Infer Schema — field types, constraints, distributions
Extract Rubrics & Prompts — scoring criteria, annotation dimensions, prompt templates
Model Costs — token-level analysis, phased cost breakdown, human-machine split ratios
Generate 23+ Production Documents — for 6 stakeholder roles (executive, PM, annotators, engineers, finance, AI agents)
Enhance with LLM — a single LLM call produces EnhancedContext, upgrading template outputs to domain-specific professional analyses

Quick Start

pip install knowlyr-datarecipe

# Analyze a HuggingFace dataset (local, no API key needed)
knowlyr-datarecipe deep-analyze tencent/CL-bench

# Enable LLM enhancement for richer output
knowlyr-datarecipe deep-analyze tencent/CL-bench --use-llm

# Analyze a requirement document
knowlyr-datarecipe analyze-spec requirements.pdf

Optional extras: pip install knowlyr-datarecipe[llm] (Anthropic/OpenAI), [pdf], [mcp], or [all].

Six-Stage Analysis Pipeline

graph LR
    I["Input<br/>HF Dataset / PDF / Word"] --> A1["Schema<br/>Inference"]
    A1 --> A2["Rubric<br/>Extraction"]
    A2 --> A3["Prompt<br/>Extraction"]
    A3 --> A4["Cost<br/>Modeling"]
    A4 --> A5["Human-Machine<br/>Split"]
    A5 --> A6["Benchmark<br/>Comparison"]
    A6 --> E["LLM Enhancer<br/>EnhancedContext"]
    E --> G["Generators<br/>23+ Documents"]

    style A1 fill:#0969da,color:#fff,stroke:#0969da
    style E fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style G fill:#2da44e,color:#fff,stroke:#2da44e
    style I fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A2 fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A3 fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A4 fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A5 fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style A6 fill:#1a1a2e,color:#e0e0e0,stroke:#444

Each stage outputs both human-readable (Markdown) and machine-parseable (JSON/YAML) formats. The LLM Enhancement Layer runs in three modes: auto (detect environment), interactive (host LLM handles it), or api (standalone Anthropic/OpenAI call).

Core Features

Feature	Description
Multi-Source Input	HuggingFace datasets, PDF, Word, images, plain text
Token-Level Cost Analysis	Phased cost model with human-machine split and industry benchmarks
Stakeholder Documents	23+ docs for executives, PMs, annotators, engineers, finance, AI agents
Agent-Ready Output	Structured context, workflow state, reasoning traces, executable pipeline
Radar Integration	Batch-analyze datasets discovered by AI Dataset Radar
12 MCP Tools	Seamless AI IDE integration for analysis, enhancement, and comparison
3572 Tests, 97% Coverage	Production-grade reliability

Ecosystem

DataRecipe is part of the knowlyr data infrastructure:

Layer	Project	Role
Discovery	AI Dataset Radar	Dataset intelligence and trend analysis
Analysis	DataRecipe	Reverse engineering, schema inference, cost modeling
Production	DataSynth / DataLabel	LLM batch synthesis / lightweight annotation
Quality	DataCheck	Rule validation, anomaly detection, auto-fix
Audit	ModelAudit	Distillation detection, model fingerprinting

# End-to-end workflow
knowlyr-datarecipe deep-analyze tencent/CL-bench --use-llm      # Analyze
knowlyr-datalabel generate ./projects/tencent_CL-bench/          # Annotate
knowlyr-datasynth generate ./projects/tencent_CL-bench/ -n 1000  # Synthesize
knowlyr-datacheck validate ./projects/tencent_CL-bench/          # Validate

GitHub · PyPI · knowlyr.com

_{knowlyr — automated dataset reverse engineering and reproduction cost estimation}

Want to discuss this project? Reach out to