Hiring Shanghai (Remote negotiable)

Evaluation Scientist

AI 评估科学家 · Define what good AI looks like — and prove it

Apply for This Position → Take Ability Test

The Role

You will answer one question: What is good AI, and how do you prove it?

It's not just about defining standards — you also need to convince clients the standards are right.
It's not just about building datasets — you need to prove the data actually improves models.

Your work will directly influence how leading AI companies train and improve their models.

What You'll Do

Discover Problems

Identify patterns from AI error cases collected from the community
Uncover common pain points from client feedback
Systematically mine AI model weaknesses and edge cases
Track the latest AI evaluation research from North America and identify directions worth pursuing

Define Problems

Turn vague notions of "AI isn't good" into quantifiable evaluation dimensions
Design evaluation schemas that clearly define what is good and what is bad
Design behavioral evaluation criteria for new AI paradigms like Agents
Write labeling guidelines that a 10,000-person labeling network can execute consistently

Build Datasets

Design data collection strategies (crowdsourcing, synthetic, scraping)
Establish quality control processes to ensure consistency in large-scale labeling
Build reusable benchmarks and datasets
Drive automation and engineering of evaluation pipelines

Prove the Data is Good

Design experiments to demonstrate data effectiveness for model improvement
Write methodology documentation so clients understand our evaluation logic
Communicate with clients and answer "Why are your standards correct?"
Build client trust through data and experiments

What We're Looking For

Required Skills

Problem definition: Can identify structure and patterns in messy data
Standard design: Knows how to turn "good vs. bad" into executable rules
Value demonstration: Uses data and experiments to persuade — not afraid of client pushback
Communication: Can explain methodology to technical staff and articulate value to business stakeholders

Required Mindset

Genuinely curious about AI evaluation — wants to figure out how to measure AI quality
Wants to define standards, not just follow them — has the drive to proactively frame problems
Not "I think it's good" but "I'll prove it to you"

Nice to Have

Experience building datasets or benchmarks (academic or industry)
Published papers in AI, NLP, or software engineering
Hands-on experience with labeling or crowdsourcing platforms
Familiarity with LLM evaluation methods (RLHF, LLM-as-Judge, Constitutional AI, etc.)
Familiarity with Agent evaluation, red teaming, or adversarial evaluation
Consulting or client-facing work experience

Not a Fit If You

Only write papers and don't care about practical application
Only execute and can't independently define problems
Get defensive when clients challenge your work
Don't resonate with "the value of human judgment in the AI era"

Want to chat before applying?

" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />

王瑶 VP, People & Culture

赵建军 HR

" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />

叶心蕾 AI HR

← All Positions Apply for This Position →