Evaluation Scientist
AI 评估科学家 · Define what good AI looks like — and prove it
The Role
You will answer one question: What is good AI, and how do you prove it?
It's not just about defining standards — you also need to convince clients the standards are right.
It's not just about building datasets — you need to prove the data actually improves models.
Your work will directly influence how leading AI companies train and improve their models.
What You'll Do
Discover Problems
- Identify patterns from AI error cases collected from the community
- Uncover common pain points from client feedback
- Systematically mine AI model weaknesses and edge cases
- Track the latest AI evaluation research from North America and identify directions worth pursuing
Define Problems
- Turn vague notions of "AI isn't good" into quantifiable evaluation dimensions
- Design evaluation schemas that clearly define what is good and what is bad
- Design behavioral evaluation criteria for new AI paradigms like Agents
- Write labeling guidelines that a 10,000-person labeling network can execute consistently
Build Datasets
- Design data collection strategies (crowdsourcing, synthetic, scraping)
- Establish quality control processes to ensure consistency in large-scale labeling
- Build reusable benchmarks and datasets
- Drive automation and engineering of evaluation pipelines
Prove the Data is Good
- Design experiments to demonstrate data effectiveness for model improvement
- Write methodology documentation so clients understand our evaluation logic
- Communicate with clients and answer "Why are your standards correct?"
- Build client trust through data and experiments
What We're Looking For
Required Skills
- Problem definition: Can identify structure and patterns in messy data
- Standard design: Knows how to turn "good vs. bad" into executable rules
- Value demonstration: Uses data and experiments to persuade — not afraid of client pushback
- Communication: Can explain methodology to technical staff and articulate value to business stakeholders
Required Mindset
- Genuinely curious about AI evaluation — wants to figure out how to measure AI quality
- Wants to define standards, not just follow them — has the drive to proactively frame problems
- Not "I think it's good" but "I'll prove it to you"
Nice to Have
- Experience building datasets or benchmarks (academic or industry)
- Published papers in AI, NLP, or software engineering
- Hands-on experience with labeling or crowdsourcing platforms
- Familiarity with LLM evaluation methods (RLHF, LLM-as-Judge, Constitutional AI, etc.)
- Familiarity with Agent evaluation, red teaming, or adversarial evaluation
- Consulting or client-facing work experience
Not a Fit If You
- Only write papers and don't care about practical application
- Only execute and can't independently define problems
- Get defensive when clients challenge your work
- Don't resonate with "the value of human judgment in the AI era"
Want to chat before applying?