Open Source Python MIT

DataSynth

Data Synthesis

★ 0 ⑂ 0 Updated 2026-02-25
Multi-strategy LLM data synthesis engine — seed evolution, template synthesis, and batch generation cover scenarios from cold start to scale. Built-in quality-diversity tradeoff mechanisms, semantic deduplication, and phased cost modeling.
Multi-Strategy Synthesis Quality-Diversity Tradeoff Semantic Deduplication

Quick Start

Install
pip install knowlyr-datasynth
Usage
from datasynth import DataSynthesizer, SynthesisConfig

config = SynthesisConfig(target_count=100)
synth = DataSynthesizer(config)
prepare_synthesis 准备数据合成 Prompt(交互模式,不直接调用 LLM)
parse_synthesis_result 解析 LLM 生成的合成数据并保存
synthesize_data 直接调用 LLM 生成合成数据 (需要 API key)
validate_data 验证数据文件是否符合 Schema
synth_augment 对已有数据做变体扩增(改写/回译/扰动/风格迁移)
synth_batch 批量合成数据(支持进度追踪和断点续传)
synth_evaluate 对合成数据做多维度快检(多样性/忠实度/质量分布)
estimate_synthesis_cost 估算合成成本
synth_translate 将合成数据翻译为目标语言(保留格式和标签结构)

Documentation

English | 中文

DataSynth

LLM-Powered Synthetic Dataset Generation
with Quality-Diversity Optimization

Seed-to-scale synthetic data engine with auto-detected templates, concurrent generation, schema validation, and precise cost estimation

GitHub · PyPI · knowlyr.com

Why DataSynth?

High-quality training data is the key bottleneck for LLM performance. Manual annotation is expensive ($0.1--$10 per sample), slow (100 samples/day), and inconsistent across annotators. Naive LLM batch calls lack quality guarantees -- duplicate samples, schema violations, and distribution skew go undetected.

DataSynth bridges this gap: starting from ~50 seed samples, it auto-detects data types, selects specialized prompt templates, generates data via concurrent LLM calls, validates against schema constraints, and deduplicates across batches -- all at $0.001--$0.01 per sample.

Core Features

  • Auto-Detected Templates -- Automatically identifies instruction-response, preference pairs (DPO/RLHF), or multi-turn dialogue and applies specialized prompts
  • Concurrent Generation -- Multi-batch parallel LLM calls with thread-safe deduplication and incremental resume (--resume)
  • Schema Validation -- Type checking, range/enum/length constraints; non-compliant samples are filtered automatically
  • Precise Cost Estimation -- Per-model pricing with --dry-run to estimate before generating
  • Post-Generation Hooks -- Auto-trigger downstream quality checks after generation completes
  • Distribution Statistics -- Field-level distribution reports for generated datasets

Quick Start

pip install knowlyr-datasynth
export ANTHROPIC_API_KEY=your_key

# Generate 100 samples from DataRecipe analysis output
knowlyr-datasynth generate ./analysis_output/my_dataset/ -n 100

# Concurrent generation with cost estimation
knowlyr-datasynth generate ./output/ -n 1000 --concurrency 3 --dry-run

# Resume after interruption
knowlyr-datasynth generate ./output/ -n 1000 --resume

# Interactive mode (no API key needed)
knowlyr-datasynth prepare ./analysis_output/my_dataset/ -n 10
from datasynth import SynthEngine

engine = SynthEngine(model="claude-sonnet-4-20250514")
result = engine.generate(
    analysis_dir="./analysis_output/my_dataset/",
    target_count=100,
    concurrency=3,
)
print(f"Generated: {result.generated_count}, Cost: ${result.cost_usd:.4f}")

Pipeline

graph LR
    Seed["Seed Data<br/>(~50 samples)"] --> Detect["Type Detector<br/>Auto-detect"]
    Detect --> Template["Template<br/>Specialized Prompt"]
    Template --> Gen["Generator<br/>Concurrent Batches"]
    Gen --> Val["Validator<br/>Schema Constraints"]
    Val --> Dedup["Deduplicator<br/>Seed + Cross-batch"]
    Dedup --> Stats["Statistics<br/>Distribution Report"]

    style Gen fill:#0969da,color:#fff,stroke:#0969da
    style Val fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style Dedup fill:#2da44e,color:#fff,stroke:#2da44e
    style Seed fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Detect fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Template fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Stats fill:#1a1a2e,color:#e0e0e0,stroke:#444

Ecosystem

DataSynth is part of the knowlyr data infrastructure:

Layer Project Role
Discovery AI Dataset Radar Dataset intelligence and trend analysis
Analysis DataRecipe Reverse analysis, schema extraction, cost estimation
Production DataSynth LLM synthesis, auto templates, schema validation, cost estimation
Production DataLabel Zero-server annotation, LLM pre-labeling, IAA analysis
Quality DataCheck Rule validation, anomaly detection, auto-fix
Audit ModelAudit Distillation detection, model fingerprinting

GitHub · PyPI · knowlyr.com

knowlyr -- LLM-powered synthetic dataset generation with quality-diversity optimization

Want to discuss this project? Reach out to

Kai" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
Kai Founder & CEO
罗清河" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
罗清河 AI 数据工程师