Quick Start
pip install knowlyr-datasynth
from datasynth import DataSynthesizer, SynthesisConfig
config = SynthesisConfig(target_count=100)
synth = DataSynthesizer(config)
prepare_synthesis
准备数据合成 Prompt(交互模式,不直接调用 LLM)
parse_synthesis_result
解析 LLM 生成的合成数据并保存
synthesize_data
直接调用 LLM 生成合成数据 (需要 API key)
validate_data
验证数据文件是否符合 Schema
synth_augment
对已有数据做变体扩增(改写/回译/扰动/风格迁移)
synth_batch
批量合成数据(支持进度追踪和断点续传)
synth_evaluate
对合成数据做多维度快检(多样性/忠实度/质量分布)
estimate_synthesis_cost
估算合成成本
synth_translate
将合成数据翻译为目标语言(保留格式和标签结构)
Documentation
English | 中文
DataSynth
LLM-Powered Synthetic Dataset Generation
with Quality-Diversity Optimization
Seed-to-scale synthetic data engine with auto-detected templates, concurrent generation, schema validation, and precise cost estimation
GitHub · PyPI · knowlyr.com
Why DataSynth?
High-quality training data is the key bottleneck for LLM performance. Manual annotation is expensive ($0.1--$10 per sample), slow (100 samples/day), and inconsistent across annotators. Naive LLM batch calls lack quality guarantees -- duplicate samples, schema violations, and distribution skew go undetected.
DataSynth bridges this gap: starting from ~50 seed samples, it auto-detects data types, selects specialized prompt templates, generates data via concurrent LLM calls, validates against schema constraints, and deduplicates across batches -- all at $0.001--$0.01 per sample.
Core Features
- Auto-Detected Templates -- Automatically identifies instruction-response, preference pairs (DPO/RLHF), or multi-turn dialogue and applies specialized prompts
- Concurrent Generation -- Multi-batch parallel LLM calls with thread-safe deduplication and incremental resume (
--resume) - Schema Validation -- Type checking, range/enum/length constraints; non-compliant samples are filtered automatically
- Precise Cost Estimation -- Per-model pricing with
--dry-runto estimate before generating - Post-Generation Hooks -- Auto-trigger downstream quality checks after generation completes
- Distribution Statistics -- Field-level distribution reports for generated datasets
Quick Start
pip install knowlyr-datasynth
export ANTHROPIC_API_KEY=your_key
# Generate 100 samples from DataRecipe analysis output
knowlyr-datasynth generate ./analysis_output/my_dataset/ -n 100
# Concurrent generation with cost estimation
knowlyr-datasynth generate ./output/ -n 1000 --concurrency 3 --dry-run
# Resume after interruption
knowlyr-datasynth generate ./output/ -n 1000 --resume
# Interactive mode (no API key needed)
knowlyr-datasynth prepare ./analysis_output/my_dataset/ -n 10
from datasynth import SynthEngine
engine = SynthEngine(model="claude-sonnet-4-20250514")
result = engine.generate(
analysis_dir="./analysis_output/my_dataset/",
target_count=100,
concurrency=3,
)
print(f"Generated: {result.generated_count}, Cost: ${result.cost_usd:.4f}")
Pipeline
graph LR
Seed["Seed Data<br/>(~50 samples)"] --> Detect["Type Detector<br/>Auto-detect"]
Detect --> Template["Template<br/>Specialized Prompt"]
Template --> Gen["Generator<br/>Concurrent Batches"]
Gen --> Val["Validator<br/>Schema Constraints"]
Val --> Dedup["Deduplicator<br/>Seed + Cross-batch"]
Dedup --> Stats["Statistics<br/>Distribution Report"]
style Gen fill:#0969da,color:#fff,stroke:#0969da
style Val fill:#8b5cf6,color:#fff,stroke:#8b5cf6
style Dedup fill:#2da44e,color:#fff,stroke:#2da44e
style Seed fill:#1a1a2e,color:#e0e0e0,stroke:#444
style Detect fill:#1a1a2e,color:#e0e0e0,stroke:#444
style Template fill:#1a1a2e,color:#e0e0e0,stroke:#444
style Stats fill:#1a1a2e,color:#e0e0e0,stroke:#444
Ecosystem
DataSynth is part of the knowlyr data infrastructure:
| Layer | Project | Role |
|---|---|---|
| Discovery | AI Dataset Radar | Dataset intelligence and trend analysis |
| Analysis | DataRecipe | Reverse analysis, schema extraction, cost estimation |
| Production | DataSynth | LLM synthesis, auto templates, schema validation, cost estimation |
| Production | DataLabel | Zero-server annotation, LLM pre-labeling, IAA analysis |
| Quality | DataCheck | Rule validation, anomaly detection, auto-fix |
| Audit | ModelAudit | Distillation detection, model fingerprinting |
GitHub · PyPI · knowlyr.com
knowlyr -- LLM-powered synthetic dataset generation with quality-diversity optimization
Want to discuss this project? Reach out to