Join Us

Looking for people who believe human judgment is irreplaceable

Why Join Us

01
Define Standards, Not Just Execute
You'll be defining AI evaluation methodologies, not just writing code to spec
02
Human + AI Collaborative Team
33 AI employees handle execution while humans focus on judgment and decisions
03
Impact Leading Clients
Your work directly influences how leading AI research labs train their models

Open Positions

The Role

You will answer one question: What is good AI, and how do you prove it?

It's not just about defining standards — you also need to convince clients the standards are right.
It's not just about building datasets — you need to prove the data actually improves models.

Your work will directly influence how leading AI companies train and improve their models.

What You'll Do

Discover Problems

  • Identify patterns from AI error cases collected from the community
  • Uncover common pain points from client feedback
  • Systematically mine AI model weaknesses and edge cases
  • Track the latest AI evaluation research from North America and identify directions worth pursuing

Define Problems

  • Turn vague notions of "AI isn't good" into quantifiable evaluation dimensions
  • Design evaluation schemas that clearly define what is good and what is bad
  • Design behavioral evaluation criteria for new AI paradigms like Agents
  • Write labeling guidelines that a 10,000-person labeling network can execute consistently

Build Datasets

  • Design data collection strategies (crowdsourcing, synthetic, scraping)
  • Establish quality control processes to ensure consistency in large-scale labeling
  • Build reusable benchmarks and datasets
  • Drive automation and engineering of evaluation pipelines

Prove the Data is Good

  • Design experiments to demonstrate data effectiveness for model improvement
  • Write methodology documentation so clients understand our evaluation logic
  • Communicate with clients and answer "Why are your standards correct?"
  • Build client trust through data and experiments

What We're Looking For

Required Skills

  • Problem definition: Can identify structure and patterns in messy data
  • Standard design: Knows how to turn "good vs. bad" into executable rules
  • Value demonstration: Uses data and experiments to persuade — not afraid of client pushback
  • Communication: Can explain methodology to technical staff and articulate value to business stakeholders

Required Mindset

  • Genuinely curious about AI evaluation — wants to figure out how to measure AI quality
  • Wants to define standards, not just follow them — has the drive to proactively frame problems
  • Not "I think it's good" but "I'll prove it to you"

Nice to Have

  • Experience building datasets or benchmarks (academic or industry)
  • Published papers in AI, NLP, or software engineering
  • Hands-on experience with labeling or crowdsourcing platforms
  • Familiarity with LLM evaluation methods (RLHF, LLM-as-Judge, Constitutional AI, etc.)
  • Familiarity with Agent evaluation, red teaming, or adversarial evaluation
  • Consulting or client-facing work experience

Not a Fit If You

  • Only write papers and don't care about practical application
  • Only execute and can't independently define problems
  • Get defensive when clients challenge your work
  • Don't resonate with "the value of human judgment in the AI era"

The Role

You will answer one question: Does our data actually work?

We build evaluation datasets, but data quality can't be self-proclaimed.
You need to prove through experiments that models trained with our data are genuinely better.

What You'll Do

Design Validation Experiments

  • Design controlled experiments to verify the impact of different datasets on model performance
  • Define evaluation metrics to measure "how much the data improved the model"
  • Control variables to ensure credible experimental conclusions

Run Model Training

  • Use 3B/7B small models for rapid validation experiments
  • Proficient in fine-tuning, SFT, DPO, and other methods
  • Experienced with training frameworks like LLaMA-Factory and Axolotl

Analyze Experimental Results

  • Interpret training results and assess data effectiveness
  • Identify data issues (which data is useful, which is noise)
  • Produce visual reports that non-technical stakeholders can understand

Feed Back into Data Iteration

  • Guide data collection and labeling improvements based on experimental results
  • Collaborate with Evaluation Scientists to form a "data -> validation -> improvement" loop

What We're Looking For

Required Skills

  • Model training: Familiar with LLM fine-tuning workflows (SFT, DPO, RLHF concepts)
  • Experiment design: Understands variable control, control groups, and statistical significance
  • Result analysis: Can draw reliable conclusions from experimental data
  • Tool proficiency: PyTorch, Transformers, and common training frameworks

Required Mindset

  • Genuinely curious about how data affects models
  • Pursues experimental rigor — not satisfied with "it ran, so it's done"
  • Willing to communicate with non-technical teams and explain experimental findings

Nice to Have

  • Experience with RLHF or human feedback
  • Published relevant papers
  • Experience in data quality assessment
  • Familiarity with reward model training

Not a Fit If You

  • Only tune hyperparameters without understanding the "why"
  • Only care about models and not about data
  • Cannot explain results to non-technical people

Online Skills Assessment

We don't screen by resumes. Complete this assessment to showcase your thinking.

13 open-ended questions · Approx. 2 hours · Answers auto-saved

Want to chat before applying?

王瑶" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
王瑶 VP, People & Culture
赵建军 HR
叶心蕾" onerror="var d=document.createElement('div');d.innerHTML=this.dataset.fallback;this.replaceWith(d.firstChild)" />
叶心蕾 AI HR

Come fight the battles others can't

We don't screen by resumes — we look at how you think

Contact Us →