Hiring Shanghai (Remote negotiable)

Evaluation Scientist

AI 评估科学家 · Define what good AI looks like — and prove it

The Role

You will answer one question: What is good AI, and how do you prove it?

It's not just about defining standards — you also need to convince clients the standards are right.
It's not just about building datasets — you need to prove the data actually improves models.

Your work will directly influence how leading AI companies train and improve their models.

What You'll Do

Discover Problems

  • Identify patterns from AI error cases collected from the community
  • Uncover common pain points from client feedback
  • Systematically mine AI model weaknesses and edge cases
  • Track the latest AI evaluation research from North America and identify directions worth pursuing

Define Problems

  • Turn vague notions of "AI isn't good" into quantifiable evaluation dimensions
  • Design evaluation schemas that clearly define what is good and what is bad
  • Design behavioral evaluation criteria for new AI paradigms like Agents
  • Write labeling guidelines that a 10,000-person labeling network can execute consistently

Build Datasets

  • Design data collection strategies (crowdsourcing, synthetic, scraping)
  • Establish quality control processes to ensure consistency in large-scale labeling
  • Build reusable benchmarks and datasets
  • Drive automation and engineering of evaluation pipelines

Prove the Data is Good

  • Design experiments to demonstrate data effectiveness for model improvement
  • Write methodology documentation so clients understand our evaluation logic
  • Communicate with clients and answer "Why are your standards correct?"
  • Build client trust through data and experiments

What We're Looking For

Required Skills

  • Problem definition: Can identify structure and patterns in messy data
  • Standard design: Knows how to turn "good vs. bad" into executable rules
  • Value demonstration: Uses data and experiments to persuade — not afraid of client pushback
  • Communication: Can explain methodology to technical staff and articulate value to business stakeholders

Required Mindset

  • Genuinely curious about AI evaluation — wants to figure out how to measure AI quality
  • Wants to define standards, not just follow them — has the drive to proactively frame problems
  • Not "I think it's good" but "I'll prove it to you"

Nice to Have

  • Experience building datasets or benchmarks (academic or industry)
  • Published papers in AI, NLP, or software engineering
  • Hands-on experience with labeling or crowdsourcing platforms
  • Familiarity with LLM evaluation methods (RLHF, LLM-as-Judge, Constitutional AI, etc.)
  • Familiarity with Agent evaluation, red teaming, or adversarial evaluation
  • Consulting or client-facing work experience

Not a Fit If You

  • Only write papers and don't care about practical application
  • Only execute and can't independently define problems
  • Get defensive when clients challenge your work
  • Don't resonate with "the value of human judgment in the AI era"

Want to chat before applying?

王瑶" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
王瑶 VP, People & Culture
赵建军 HR
叶心蕾" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
叶心蕾 AI HR