Careers at Knowlyr — Do What Others Can't

Why Join Us

01

Define Standards, Not Just Execute

You'll be defining AI evaluation methodologies, not just writing code to spec

02

Human + AI Collaborative Team

33 AI employees handle execution while humans focus on judgment and decisions

03

Impact Leading Clients

Your work directly influences how leading AI research labs train their models

Open Positions

Evaluation Scientist

AI 评估科学家

Define what good AI looks like — and prove it · Shanghai (Remote negotiable)

AI Evaluation Dataset Design Client Communication

+

The Role

You will answer one question: What is good AI, and how do you prove it?

It's not just about defining standards — you also need to convince clients the standards are right.
It's not just about building datasets — you need to prove the data actually improves models.

Your work will directly influence how leading AI companies train and improve their models.

What You'll Do

Discover Problems

Identify patterns from AI error cases collected from the community
Uncover common pain points from client feedback
Systematically mine AI model weaknesses and edge cases
Track the latest AI evaluation research from North America and identify directions worth pursuing

Define Problems

Turn vague notions of "AI isn't good" into quantifiable evaluation dimensions
Design evaluation schemas that clearly define what is good and what is bad
Design behavioral evaluation criteria for new AI paradigms like Agents
Write labeling guidelines that a 10,000-person labeling network can execute consistently

Build Datasets

Design data collection strategies (crowdsourcing, synthetic, scraping)
Establish quality control processes to ensure consistency in large-scale labeling
Build reusable benchmarks and datasets
Drive automation and engineering of evaluation pipelines

Prove the Data is Good

Design experiments to demonstrate data effectiveness for model improvement
Write methodology documentation so clients understand our evaluation logic
Communicate with clients and answer "Why are your standards correct?"
Build client trust through data and experiments

What We're Looking For

Required Skills

Problem definition: Can identify structure and patterns in messy data
Standard design: Knows how to turn "good vs. bad" into executable rules
Value demonstration: Uses data and experiments to persuade — not afraid of client pushback
Communication: Can explain methodology to technical staff and articulate value to business stakeholders

Required Mindset

Genuinely curious about AI evaluation — wants to figure out how to measure AI quality
Wants to define standards, not just follow them — has the drive to proactively frame problems
Not "I think it's good" but "I'll prove it to you"

Nice to Have

Experience building datasets or benchmarks (academic or industry)
Published papers in AI, NLP, or software engineering
Hands-on experience with labeling or crowdsourcing platforms
Familiarity with LLM evaluation methods (RLHF, LLM-as-Judge, Constitutional AI, etc.)
Familiarity with Agent evaluation, red teaming, or adversarial evaluation
Consulting or client-facing work experience

Not a Fit If You

Only write papers and don't care about practical application
Only execute and can't independently define problems
Get defensive when clients challenge your work
Don't resonate with "the value of human judgment in the AI era"

View Details → Take Ability Test

Data Validation Scientist

数据价值验证研究员

Prove the value of data through experiments · Shanghai (Remote negotiable)

Model Training Experiment Design Data Analysis

+

The Role

You will answer one question: Does our data actually work?

We build evaluation datasets, but data quality can't be self-proclaimed.
You need to prove through experiments that models trained with our data are genuinely better.

What You'll Do

Design Validation Experiments

Design controlled experiments to verify the impact of different datasets on model performance
Define evaluation metrics to measure "how much the data improved the model"
Control variables to ensure credible experimental conclusions

Run Model Training

Use 3B/7B small models for rapid validation experiments
Proficient in fine-tuning, SFT, DPO, and other methods
Experienced with training frameworks like LLaMA-Factory and Axolotl

Analyze Experimental Results

Interpret training results and assess data effectiveness
Identify data issues (which data is useful, which is noise)
Produce visual reports that non-technical stakeholders can understand

Feed Back into Data Iteration

Guide data collection and labeling improvements based on experimental results
Collaborate with Evaluation Scientists to form a "data -> validation -> improvement" loop

What We're Looking For

Required Skills

Model training: Familiar with LLM fine-tuning workflows (SFT, DPO, RLHF concepts)
Experiment design: Understands variable control, control groups, and statistical significance
Result analysis: Can draw reliable conclusions from experimental data
Tool proficiency: PyTorch, Transformers, and common training frameworks

Required Mindset

Genuinely curious about how data affects models
Pursues experimental rigor — not satisfied with "it ran, so it's done"
Willing to communicate with non-technical teams and explain experimental findings