Open Source Python MIT
DataSynth

DataSynth

Data Synthesis

★ 0 ⑂ 0 Updated 2026-03-15
Multi-strategy LLM data synthesis engine — seed evolution, template synthesis, and batch generation cover scenarios from cold start to scale. Built-in quality-diversity tradeoff mechanisms, semantic deduplication, and phased cost modeling.
Multi-Strategy Synthesis Quality-Diversity Tradeoff Semantic Deduplication

Quick Start

Install
pip install knowlyr-datasynth
Usage
from datasynth import DataSynthesizer, SynthesisConfig

config = SynthesisConfig(target_count=100)
synth = DataSynthesizer(config)
prepare_synthesis Prepare data synthesis prompt (interactive mode, does not directly call LLM)
parse_synthesis_result Parse LLM-generated synthetic data and save to file
synthesize_data Directly call LLM to generate synthetic data (requires API key)
validate_data Validate data file against Schema
synth_augment Augment existing data with variants (rewriting/back-translation/perturbation/style transfer)
synth_batch Batch synthesize data (supports progress tracking and checkpoint resume)
synth_evaluate Quick-check synthetic data across multiple dimensions (diversity/fidelity/quality distribution)
estimate_synthesis_cost Estimate synthesis cost
synth_translate Translate synthetic data to target language (preserving format and label structure)

Documentation

English | 中文

DataSynth

LLM-Powered Synthetic Dataset Generation
with Quality-Diversity Optimization

Seed-to-scale synthetic data engine with auto-detected templates, concurrent generation, schema validation, and precise cost estimation

GitHub · PyPI · knowlyr.com

Why DataSynth?

High-quality training data is the key bottleneck for LLM performance. Manual annotation is expensive ($0.1--$10 per sample), slow (100 samples/day), and inconsistent across annotators. Naive LLM batch calls lack quality guarantees -- duplicate samples, schema violations, and distribution skew go undetected.

DataSynth bridges this gap: starting from ~50 seed samples, it auto-detects data types, selects specialized prompt templates, generates data via concurrent LLM calls, validates against schema constraints, and deduplicates across batches -- all at $0.001--$0.01 per sample.

Core Features

  • Auto-Detected Templates -- Automatically identifies instruction-response, preference pairs (DPO/RLHF), or multi-turn dialogue and applies specialized prompts
  • Concurrent Generation -- Multi-batch parallel LLM calls with thread-safe deduplication and incremental resume (--resume)
  • Schema Validation -- Type checking, range/enum/length constraints; non-compliant samples are filtered automatically
  • Precise Cost Estimation -- Per-model pricing with --dry-run to estimate before generating
  • Post-Generation Hooks -- Auto-trigger downstream quality checks after generation completes
  • Distribution Statistics -- Field-level distribution reports for generated datasets

Quick Start

pip install knowlyr-datasynth
export ANTHROPIC_API_KEY=your_key

# Generate 100 samples from DataRecipe analysis output
knowlyr-datasynth generate ./analysis_output/my_dataset/ -n 100

# Concurrent generation with cost estimation
knowlyr-datasynth generate ./output/ -n 1000 --concurrency 3 --dry-run

# Resume after interruption
knowlyr-datasynth generate ./output/ -n 1000 --resume

# Interactive mode (no API key needed)
knowlyr-datasynth prepare ./analysis_output/my_dataset/ -n 10
from datasynth import SynthEngine

engine = SynthEngine(model="claude-sonnet-4-20250514")
result = engine.generate(
    analysis_dir="./analysis_output/my_dataset/",
    target_count=100,
    concurrency=3,
)
print(f"Generated: {result.generated_count}, Cost: ${result.cost_usd:.4f}")

Pipeline

graph LR
    Seed["Seed Data<br/>(~50 samples)"] --> Detect["Type Detector<br/>Auto-detect"]
    Detect --> Template["Template<br/>Specialized Prompt"]
    Template --> Gen["Generator<br/>Concurrent Batches"]
    Gen --> Val["Validator<br/>Schema Constraints"]
    Val --> Dedup["Deduplicator<br/>Seed + Cross-batch"]
    Dedup --> Stats["Statistics<br/>Distribution Report"]

    style Gen fill:#0969da,color:#fff,stroke:#0969da
    style Val fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style Dedup fill:#2da44e,color:#fff,stroke:#2da44e
    style Seed fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Detect fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Template fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Stats fill:#1a1a2e,color:#e0e0e0,stroke:#444

Ecosystem

DataSynth is part of the knowlyr data infrastructure:

Layer Project Role
Discovery AI Dataset Radar Dataset intelligence and trend analysis
Analysis DataRecipe Reverse analysis, schema extraction, cost estimation
Production DataSynth LLM synthesis, auto templates, schema validation, cost estimation
Production DataLabel Zero-server annotation, LLM pre-labeling, IAA analysis
Quality DataCheck Rule validation, anomaly detection, auto-fix
Audit ModelAudit Distillation detection, model fingerprinting

GitHub · PyPI · knowlyr.com

knowlyr -- LLM-powered synthetic dataset generation with quality-diversity optimization

Want to discuss this project? Reach out to

Kai
Kai Founder & CEO
罗清河
罗清河 AI Data Engineer