Reinforcement Learning
Loop Solutions
Judgment infrastructure deployed in model training
- General data labeling (text / image / video / audio)
- Multilingual and cross-cultural content
- Domain knowledge data production
- Domain data cleaning and structuring
- RLHF — Reinforcement Learning from Human Feedback
- Complex reasoning data production and labeling
- Preference alignment and iterative optimization
- Code / math / logic reasoning scenarios
- Hallucination detection and correction
- Extreme challenge questions beyond current strongest models
- Abstract reasoning and complex scenario construction
- Frontier evaluation dataset production (HLE / ARC-AGI)
- Human-level evaluation (HLE) and in-context learning (ICL)
- Agent benchmark evaluation and simulation environments
- Multi-dimensional capability metrics
- Automated evaluation workflows
- Human-AI collaborative evaluation standard calibration
Core Products
From training loops to authoritative evaluation — covering the entire AI data pipeline
Training Loop
Pairwise comparison, multi-dimensional scoring, continuous iteration — capturing the subtlest differences in human preferences to train Reward Models to distinguish 'good' from 'better'.
Expert teams with programming, math, and logic backgrounds producing code refactoring reviews, mathematical proof chains, and multi-step inference data — verifiable chains of thought, not just correct answers.
Multi-dimensional hallucination taxonomy — fabrication, date confusion, numerical misattribution, logical inference errors — with root cause analysis and evidence chains for every hallucination.
Evaluation & Expert
From ARC-AGI 2 to Humanity's Last Exam, producing high-difficulty evaluation datasets for leading institutions. Agent evaluation, crowdsourced evaluation, expert review.
A cross-disciplinary expert network covering high-barrier domains such as medicine, law, finance, math, and physics — real practitioners, not generalist annotators.
Multimodal labeling, multilingual localization, novel task types — we define the problem, design the workflow, and deliver results together with you.
Engagement Process
From requirements to scale delivery — every step is measurable
Deep understanding of business scenarios and model objectives, clarifying data types, quality standards, and delivery timelines.
Defining labeling specifications, quality metrics, and acceptance criteria; designing task workflows and expert team configuration.
Small-batch pilot labeling to align on standards and quality expectations. Once confirmed, scale production begins.
Expert teams working in parallel with multi-layer QA and real-time monitoring, ensuring delivery speed and data consistency.
Continuously optimizing data strategy based on model training feedback, forming a data → training → evaluation closed loop.
Customer Cases
From leading tech giants to world-class AI research institutions — we let our delivery speak
The client was rapidly iterating complex Agent applications (data analysis, writing, presentations), benchmarking against top products, and needed an evaluation-grade data team capable of fine-grained, logic-heavy assessment.
- Deeply involved in defining evaluation standards
- Full coverage from visual layout to deep logical reasoning
- Fine-grained evaluation insights driving product leadership
- Sole data vendor handling all high-priority core evaluations
- Accumulated thousands of high-value alignment records
- Effectively supported new strategy iteration and launch
As the foundation model pushed into uncharted territory, the client urgently needed high-quality data for HLE (Human-Level Evaluation), complex multi-attachment processing, and ICL (In-Context Learning).
- HLE human-level evaluation data construction
- Complex multi-attachment scenario processing
- ICL high-difficulty logic chain data orchestration
- Filled gaps in the client's frontier evaluation sets
- Advanced long-context understanding and complex instruction following
- Demonstrated exceptional business understanding
For primate behavioral research, the lab needed high-precision 3D skeleton keypoint annotation of complex videos showing primates grasping objects. Research-grade 3D spatial annotation with extremely low tolerance for error.
- Primate 3D motion skeleton keypoint annotation
- High-precision 3D spatial annotation
- Strict adherence to research-grade precision standards
- Sole designated vendor for a national-level research project
- Consistently delivered high-precision data
- Directly contributed to major research publications
Facing a sudden surge in overseas business volume, the client needed to rapidly assemble a large-scale professional English annotation team with stringent language requirements (TEM-4/TEM-8 certified).
- 100+ TEM-8 certified annotators onboarded in 3 days
- Mobilized 200+ full-time and 1,000+ crowdsource reserves
- Response speed far exceeded client expectations
- Exceptionally high delivery consistency
- Data rework rate strictly controlled under 5%
- Significantly reduced client's secondary QA effort
The client needed human expert review to score AI-generated code refactoring proposals on preference dimensions, improving model code readability and refactoring quality.
- Assembled expert teams with programming backgrounds
- Designed multi-dimensional preference evaluation framework
- Established quantitative code readability scoring criteria
- Experts performed pairwise preference comparison and scoring
- Continuous iterative training to optimize the Reward Model
- Significant improvement in code readability scores
Addressing common hallucination issues in large language models through multi-dimensional cross-validation between generated content and reference materials for hallucination identification and root cause analysis.
- Fabrication
- Date confusion
- Numerical misattribution
- Factual misattribution
- Logical inference errors
- Labeling hallucinations with reasoning evidence
- Labeling contradictions between reference materials
- Labeling consistency between real content and references
- Producing hard-to-distinguish hallucination cause data
Producing abstract reasoning evaluation datasets to measure AI systems' general intelligence — one of the benchmarks closest to AGI.
- Designing visual and logical reasoning tasks
- Constructing multi-level abstract reasoning problems
- Ensuring problems pose genuine challenges for AI
- Human expert cross-validation
- Ensuring logical consistency and unambiguity
- Multi-round iterative selection of high-quality samples
Contributing to the 'Humanity's Last Exam' dataset — questions created by world-class experts, specifically designed to test the upper limits of LLM capabilities.
- Organizing cross-disciplinary domain experts
- Covering high-barrier fields: mathematics, physics, law, and more
- Ensuring problems exceed current strongest model capabilities
- Standardized answers and scoring rubrics
- Multi-round expert review to eliminate disputes
- Producing high-quality reasoning process data
Building automated Agent evaluation pipelines for clients, systematically assessing task completion and tool-calling accuracy in simulation environments.
- Building automated evaluation workflows
- Designing multi-dimensional Agent capability metrics
- Constructing reproducible simulation test environments
- Feeding evaluation results back into model training
- Continuously expanding evaluation scenario coverage
- Human-AI collaborative calibration of evaluation standards
Talk to the Right Person
Dedicated contacts for every area — human + AI employees responding together
FAQ
How are you different from regular data labeling companies?
We don't just label data — we help clients train their models. Through RLHF preference alignment, chain-of-thought labeling, and RL loops, we directly participate in model training iteration, not just data production.
What does the RLHF data labeling process look like?
Our expert teams perform pairwise comparisons and multi-dimensional scoring of model outputs, generating preference data to train Reward Models. Through continuous iteration, we progressively optimize model performance.
What languages and domains do you support?
We support multilingual labeling including Chinese, English, Japanese, Korean, and more, covering 40+ vertical domains such as code, math, law, medicine, and finance. Our AntGather Community includes 10,000+ labeling experts with professional backgrounds.
How do you ensure data quality?
Multi-layer quality control: expert cross-validation, consistency checks, automated anomaly detection, and continuous iterative training. All data undergoes at least two rounds of human review.
What types of projects have you delivered?
Code refactoring RLHF, hallucination detection, HLE extreme evaluation, ARC-AGI abstract reasoning, Agent evaluation, 3D skeleton annotation, and more. We have real delivery cases from basic labeling to frontier evaluation.
What is the typical data delivery timeline?
Pilot validation takes 3-5 days. Scale production depends on data volume. We have onboarded 100+ full-time annotators for clients within 3 days — rapid response is one of our core strengths.
How is your expert team assembled?
Our AntGather Community has 10,000+ judgment nodes covering 40+ professional domains. 85% hold bachelor's degrees or higher, with an average age of 29. We match experts based on task requirements, typically completing task matching within 3 days.
Can I purchase only part of your services?
Yes. Our four-tier judgment services can be purchased individually or combined. From basic data production to complete RLHF loops, we configure flexibly based on your needs.
Are your open-source tools free to use?
Yes. All 8 open-source projects and 130 MCP endpoints are fully open source, supporting both CLI and MCP modes. You can integrate them directly into Claude, VS Code, or custom Agents.
How do I get started?
Contact us to book a demo. We respond within 1 business day. After understanding your needs, we provide a solution design within 2-3 days, then move to pilot validation.
What is Knowlyr?
Knowlyr is an AI data infrastructure company headquartered in Shanghai, founded in 2025. We provide RLHF training data, expert evaluation, and human feedback services for frontier AI models. Knowlyr operates an expert network of 10,000+ professionals across 40+ domains and offers 8 open-source tools with 130 MCP endpoints.
How is Knowlyr different from Scale AI or Surge AI?
While Scale AI and Surge AI focus primarily on data labeling at scale, Knowlyr specializes in human judgment infrastructure — the harder problems that require deep domain expertise. We provide end-to-end RLHF training loops, independent third-party AI evaluation, and a fully open-source MCP-native toolchain. The core difference: we participate in model training iteration, not just data production.
What is RLHF and how does Knowlyr support it?
RLHF (Reinforcement Learning from Human Feedback) trains AI models using human preference data. Knowlyr provides the complete RLHF loop: expert teams perform pairwise comparisons and multi-dimensional scoring of model outputs, generating preference data to train Reward Models. This iterative process aligns model behavior with human values. We cover code, math, reasoning, and alignment scenarios.




