Open Source Python MIT

DataCheck

Data Check

★ 0 ⑂ 0 Updated 2026-02-25
Multi-dimensional data quality verification framework — covering completeness, uniqueness, validity, and anomaly detection across four quality dimensions. Built-in IQR/Z-score anomaly detection and n-gram Jaccard approximate deduplication, gatekeeping quality before data enters training.
Four-Dimensional Quality Model Anomaly Detection Approximate Deduplication

Quick Start

Install
pip install knowlyr-datacheck
Usage
from datacheck import DataChecker

checker = DataChecker()
report = checker.check_file("training_data.json")
check_data_quality 检查数据文件的质量 (支持 JSON/JSONL/CSV)
validate_from_datarecipe 使用 DataRecipe 分析结果验证数据
compare_distributions 对比多个数据文件的分布 (支持 JSON/JSONL/CSV)
list_quality_rules 列出所有可用的质量检查规则
infer_schema 从数据文件推断 Schema (字段类型、约束、必填项)
fix_data 修复数据文件常见质量问题 (去重、去空白、PII 脱敏)
batch_check_directory 批量检查目录下所有数据文件的质量 (递归扫描 JSON/JSONL/CSV)
check_drift 检测两个数据文件之间的分布漂移(数值统计差异、类别分布变化、文本特征对比)
check_leakage 检测训练集和测试集之间的数据泄漏(完全重复 + token Jaccard 近似重复)
check_bias 检测数据集偏差(类别不均衡、文本长度分布偏差、语言分布偏差)
check_coverage 检测数据集覆盖度 — 统计字段完整度、缺失值比例、唯一值分布

Documentation

DataCheck

Multi-Dimensional Data Quality Validation
with Statistical Anomaly Detection

Automated quality validation for LLM training data — composable rules, IQR/Z-score anomaly detection, and auto-fix pipeline

Why DataCheck?

Training data quality is the hidden bottleneck of model performance. Overlooked format errors, hidden PII leaks, undetected duplicate samples — any single issue can amplify into systematic bias downstream.

Existing quality solutions are either one-off scripts (not reusable) or heavyweight platforms (expensive to deploy), and generally lack statistical anomaly detection and auto-fix capabilities.

DataCheck solves this with a composable rule engine that provides end-to-end data quality validation:

  • 9 Built-in Rules covering completeness, validity, privacy, and consistency
  • IQR / Z-score Dual-Method anomaly detection for numeric and text length outliers
  • LLM-Assisted Evaluation for instruction clarity and response relevance
  • Auto-Fix Pipeline — dedup, strip whitespace, PII redaction
  • Report Diff — quantify quality improvements before vs. after fixes

Get Started in 30 Seconds

pip install knowlyr-datacheck

# Check your data
knowlyr-datacheck check data.json

# Auto-fix issues
knowlyr-datacheck fix data.jsonl -o fixed.jsonl --strip-pii

# Compare before/after
knowlyr-datacheck diff report_v1.json report_v2.json

Quality Pipeline

graph LR
    D["Data Files<br/>JSON / JSONL / CSV"] --> R["Rule Engine<br/>9 Rules + YAML Custom"]
    R --> A["Anomaly Detector<br/>IQR / Z-score"]
    A --> Rep["Quality Report<br/>MD / JSON / HTML"]
    Rep --> Fix["Auto Fix<br/>Dedup · PII · Trim"]
    Fix --> Diff["Report Diff<br/>Before vs After"]

    style R fill:#0969da,color:#fff,stroke:#0969da
    style A fill:#8b5cf6,color:#fff,stroke:#8b5cf6
    style Rep fill:#2da44e,color:#fff,stroke:#2da44e
    style Fix fill:#e5534b,color:#fff,stroke:#e5534b
    style D fill:#1a1a2e,color:#e0e0e0,stroke:#444
    style Diff fill:#1a1a2e,color:#e0e0e0,stroke:#444

Core Features

Composable Rule Engine

9 built-in rules with 4 preset rulesets (default, sft, preference, llm). Extend with YAML — no Python code needed:

rules:
  - field: instruction
    check: min_length
    value: 10
    severity: error

Statistical Anomaly Detection

Pure Python, zero external dependencies. Automatically enabled when sample size $\geq 10$:

  • IQR Method: $\text{outlier}(x) \iff x < Q_1 - 1.5 \cdot \text{IQR} ;\lor; x > Q_3 + 1.5 \cdot \text{IQR}$
  • Z-score Method: $\text{outlier}(x) \iff |z(x)| > 3$

LLM-Assisted Quality Evaluation

Semantic-level quality checks beyond rule-based validation:

knowlyr-datacheck check data.json --ruleset llm

MCP Integration

11 MCP tools for seamless AI IDE integration — check, fix, diff, infer schema, and more, all from your editor.

Python SDK

from datacheck import DataChecker, QualityReport

checker = DataChecker()
result = checker.check_file("data.json")
report = QualityReport(result)
report.print_summary()

Ecosystem

DataCheck is part of the knowlyr data infrastructure:

Layer Project Role
Discovery AI Dataset Radar Dataset intelligence & trend analysis
Analysis DataRecipe Reverse analysis, schema extraction, cost estimation
Production DataSynth / DataLabel LLM batch synthesis / lightweight annotation
Quality DataCheck Rule validation, anomaly detection, auto-fix
Audit ModelAudit Distillation detection, model fingerprinting

GitHub · PyPI

knowlyr — multi-dimensional data quality validation with statistical anomaly detection

Want to discuss this project? Reach out to

Kai" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
Kai Founder & CEO
林晓桐" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
林晓桐 AI 数据质量专家