Multi-strategy LLM data synthesis engine — seed evolution, template synthesis, and batch generation cover scenarios from cold start to scale. Built-in quality-diversity tradeoff mechanisms, semantic deduplication, and phased cost modeling.

Multi-Strategy Synthesis Quality-Diversity Tradeoff Semantic Deduplication

Quick Start

Install

pip install knowlyr-datasynth

Usage

from datasynth import DataSynthesizer, SynthesisConfig

config = SynthesisConfig(target_count=100)
synth = DataSynthesizer(config)

MCP Tools

9 callable endpoints

+

prepare_synthesis Prepare data synthesis prompt (interactive mode, does not directly call LLM)

parse_synthesis_result Parse LLM-generated synthetic data and save to file

synthesize_data Directly call LLM to generate synthetic data (requires API key)

validate_data Validate data file against Schema

synth_augment Augment existing data with variants (rewriting/back-translation/perturbation/style transfer)

synth_batch Batch synthesize data (supports progress tracking and checkpoint resume)

synth_evaluate Quick-check synthetic data across multiple dimensions (diversity/fidelity/quality distribution)

estimate_synthesis_cost Estimate synthesis cost

synth_translate Translate synthetic data to target language (preserving format and label structure)

Documentation

English | 中文

DataSynth

Name: DataSynth
Author: Knowlyr

LLM-Powered Synthetic Dataset Generation
with Quality-Diversity Optimization

Seed-to-scale synthetic data engine with auto-detected templates, concurrent generation, schema validation, and precise cost estimation

GitHub · PyPI · knowlyr.com

Why DataSynth?

High-quality training data is the key bottleneck for LLM performance. Manual annotation is expensive ($0.1--$10 per sample), slow (100 samples/day), and inconsistent across annotators. Naive LLM batch calls lack quality guarantees -- duplicate samples, schema violations, and distribution skew go undetected.

DataSynth bridges this gap: starting from ~50 seed samples, it auto-detects data types, selects specialized prompt templates, generates data via concurrent LLM calls, validates against schema constraints, and deduplicates across batches -- all at $0.001--$0.01 per sample.

Core Features

Auto-Detected Templates -- Automatically identifies instruction-response, preference pairs (DPO/RLHF), or multi-turn dialogue and applies specialized prompts
Concurrent Generation -- Multi-batch parallel LLM calls with thread-safe deduplication and incremental resume (--resume)
Schema Validation -- Type checking, range/enum/length constraints; non-compliant samples are filtered automatically
Precise Cost Estimation -- Per-model pricing with --dry-run to estimate before generating
Post-Generation Hooks -- Auto-trigger downstream quality checks after generation completes
Distribution Statistics -- Field-level distribution reports for generated datasets

Quick Start

pip install knowlyr-datasynth
export ANTHROPIC_API_KEY=your_key

# Generate 100 samples from DataRecipe analysis output
knowlyr-datasynth generate ./analysis_output/my_dataset/ -n 100

# Concurrent generation with cost estimation
knowlyr-datasynth generate ./output/ -n 1000 --concurrency 3 --dry-run

# Resume after interruption
knowlyr-datasynth generate ./output/ -n 1000 --resume

# Interactive mode (no API key needed)
knowlyr-datasynth prepare ./analysis_output/my_dataset/ -n 10

from datasynth import SynthEngine

engine = SynthEngine(model="claude-sonnet-4-20250514")
result = engine.generate(
    analysis_dir="./analysis_output/my_dataset/",
    target_count=100,
    concurrency=3,
)
print(f"Generated: {result.generated_count}, Cost: ${result.cost_usd:.4f}")

Pipeline

graph LR
    Seed["Seed Data<br/>(~50 samples)"] --> Detect["Type Detector<br/>Auto-detect"]
    Detect --> Template["Template<br/>Specialized Prompt"]
    Template --> Gen["Generator<br/>Concurrent Batches"]
    Gen --> Val["Validator<br/>Schema Constraints"]
    Val --> Dedup["Deduplicator<br/>Seed + Cross-batch"]
    Dedup --> Stats["Statistics<br/>Distribution Report"]

    style Gen fill:#0969da,color:#fff,stroke:#0969da
    style Val fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style Dedup fill:#2da44e,color:#fff,stroke:#2da44e
    style Seed fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Detect fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Template fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Stats fill:#1a1a2e,color:#e0e0e0,stroke:#444

Ecosystem

DataSynth is part of the knowlyr data infrastructure:

Layer	Project	Role
Discovery	AI Dataset Radar	Dataset intelligence and trend analysis
Analysis	DataRecipe	Reverse analysis, schema extraction, cost estimation
Production	DataSynth	LLM synthesis, auto templates, schema validation, cost estimation
Production	DataLabel	Zero-server annotation, LLM pre-labeling, IAA analysis
Quality	DataCheck	Rule validation, anomaly detection, auto-fix
Audit	ModelAudit	Distillation detection, model fingerprinting