Quick Start
pip install knowlyr-datacheck
from datacheck import DataChecker
checker = DataChecker()
report = checker.check_file("training_data.json")
check_data_quality
检查数据文件的质量 (支持 JSON/JSONL/CSV)
validate_from_datarecipe
使用 DataRecipe 分析结果验证数据
compare_distributions
对比多个数据文件的分布 (支持 JSON/JSONL/CSV)
list_quality_rules
列出所有可用的质量检查规则
infer_schema
从数据文件推断 Schema (字段类型、约束、必填项)
fix_data
修复数据文件常见质量问题 (去重、去空白、PII 脱敏)
batch_check_directory
批量检查目录下所有数据文件的质量 (递归扫描 JSON/JSONL/CSV)
check_drift
检测两个数据文件之间的分布漂移(数值统计差异、类别分布变化、文本特征对比)
check_leakage
检测训练集和测试集之间的数据泄漏(完全重复 + token Jaccard 近似重复)
check_bias
检测数据集偏差(类别不均衡、文本长度分布偏差、语言分布偏差)
check_coverage
检测数据集覆盖度 — 统计字段完整度、缺失值比例、唯一值分布
Documentation
DataCheck
Multi-Dimensional Data Quality Validation
with Statistical Anomaly Detection
Automated quality validation for LLM training data — composable rules, IQR/Z-score anomaly detection, and auto-fix pipeline
Why DataCheck?
Training data quality is the hidden bottleneck of model performance. Overlooked format errors, hidden PII leaks, undetected duplicate samples — any single issue can amplify into systematic bias downstream.
Existing quality solutions are either one-off scripts (not reusable) or heavyweight platforms (expensive to deploy), and generally lack statistical anomaly detection and auto-fix capabilities.
DataCheck solves this with a composable rule engine that provides end-to-end data quality validation:
- 9 Built-in Rules covering completeness, validity, privacy, and consistency
- IQR / Z-score Dual-Method anomaly detection for numeric and text length outliers
- LLM-Assisted Evaluation for instruction clarity and response relevance
- Auto-Fix Pipeline — dedup, strip whitespace, PII redaction
- Report Diff — quantify quality improvements before vs. after fixes
Get Started in 30 Seconds
pip install knowlyr-datacheck
# Check your data
knowlyr-datacheck check data.json
# Auto-fix issues
knowlyr-datacheck fix data.jsonl -o fixed.jsonl --strip-pii
# Compare before/after
knowlyr-datacheck diff report_v1.json report_v2.json
Quality Pipeline
graph LR
D["Data Files<br/>JSON / JSONL / CSV"] --> R["Rule Engine<br/>9 Rules + YAML Custom"]
R --> A["Anomaly Detector<br/>IQR / Z-score"]
A --> Rep["Quality Report<br/>MD / JSON / HTML"]
Rep --> Fix["Auto Fix<br/>Dedup · PII · Trim"]
Fix --> Diff["Report Diff<br/>Before vs After"]
style R fill:#0969da,color:#fff,stroke:#0969da
style A fill:#8b5cf6,color:#fff,stroke:#8b5cf6
style Rep fill:#2da44e,color:#fff,stroke:#2da44e
style Fix fill:#e5534b,color:#fff,stroke:#e5534b
style D fill:#1a1a2e,color:#e0e0e0,stroke:#444
style Diff fill:#1a1a2e,color:#e0e0e0,stroke:#444
Core Features
Composable Rule Engine
9 built-in rules with 4 preset rulesets (default, sft, preference, llm). Extend with YAML — no Python code needed:
rules:
- field: instruction
check: min_length
value: 10
severity: error
Statistical Anomaly Detection
Pure Python, zero external dependencies. Automatically enabled when sample size $\geq 10$:
- IQR Method: $\text{outlier}(x) \iff x < Q_1 - 1.5 \cdot \text{IQR} ;\lor; x > Q_3 + 1.5 \cdot \text{IQR}$
- Z-score Method: $\text{outlier}(x) \iff |z(x)| > 3$
LLM-Assisted Quality Evaluation
Semantic-level quality checks beyond rule-based validation:
knowlyr-datacheck check data.json --ruleset llm
MCP Integration
11 MCP tools for seamless AI IDE integration — check, fix, diff, infer schema, and more, all from your editor.
Python SDK
from datacheck import DataChecker, QualityReport
checker = DataChecker()
result = checker.check_file("data.json")
report = QualityReport(result)
report.print_summary()
Ecosystem
DataCheck is part of the knowlyr data infrastructure:
| Layer | Project | Role |
|---|---|---|
| Discovery | AI Dataset Radar | Dataset intelligence & trend analysis |
| Analysis | DataRecipe | Reverse analysis, schema extraction, cost estimation |
| Production | DataSynth / DataLabel | LLM batch synthesis / lightweight annotation |
| Quality | DataCheck | Rule validation, anomaly detection, auto-fix |
| Audit | ModelAudit | Distillation detection, model fingerprinting |
Want to discuss this project? Reach out to