Join Us
Looking for people who believe human judgment is irreplaceable
Why Join Us
Open Positions
The Role
You will answer one question: What is good AI, and how do you prove it?
It's not just about defining standards — you also need to convince clients the standards are right.
It's not just about building datasets — you need to prove the data actually improves models.
Your work will directly influence how leading AI companies train and improve their models.
What You'll Do
Discover Problems
- Identify patterns from AI error cases collected from the community
- Uncover common pain points from client feedback
- Systematically mine AI model weaknesses and edge cases
- Track the latest AI evaluation research from North America and identify directions worth pursuing
Define Problems
- Turn vague notions of "AI isn't good" into quantifiable evaluation dimensions
- Design evaluation schemas that clearly define what is good and what is bad
- Design behavioral evaluation criteria for new AI paradigms like Agents
- Write labeling guidelines that a 10,000-person labeling network can execute consistently
Build Datasets
- Design data collection strategies (crowdsourcing, synthetic, scraping)
- Establish quality control processes to ensure consistency in large-scale labeling
- Build reusable benchmarks and datasets
- Drive automation and engineering of evaluation pipelines
Prove the Data is Good
- Design experiments to demonstrate data effectiveness for model improvement
- Write methodology documentation so clients understand our evaluation logic
- Communicate with clients and answer "Why are your standards correct?"
- Build client trust through data and experiments
What We're Looking For
Required Skills
- Problem definition: Can identify structure and patterns in messy data
- Standard design: Knows how to turn "good vs. bad" into executable rules
- Value demonstration: Uses data and experiments to persuade — not afraid of client pushback
- Communication: Can explain methodology to technical staff and articulate value to business stakeholders
Required Mindset
- Genuinely curious about AI evaluation — wants to figure out how to measure AI quality
- Wants to define standards, not just follow them — has the drive to proactively frame problems
- Not "I think it's good" but "I'll prove it to you"
Nice to Have
- Experience building datasets or benchmarks (academic or industry)
- Published papers in AI, NLP, or software engineering
- Hands-on experience with labeling or crowdsourcing platforms
- Familiarity with LLM evaluation methods (RLHF, LLM-as-Judge, Constitutional AI, etc.)
- Familiarity with Agent evaluation, red teaming, or adversarial evaluation
- Consulting or client-facing work experience
Not a Fit If You
- Only write papers and don't care about practical application
- Only execute and can't independently define problems
- Get defensive when clients challenge your work
- Don't resonate with "the value of human judgment in the AI era"
The Role
You will answer one question: Does our data actually work?
We build evaluation datasets, but data quality can't be self-proclaimed.
You need to prove through experiments that models trained with our data are genuinely better.
What You'll Do
Design Validation Experiments
- Design controlled experiments to verify the impact of different datasets on model performance
- Define evaluation metrics to measure "how much the data improved the model"
- Control variables to ensure credible experimental conclusions
Run Model Training
- Use 3B/7B small models for rapid validation experiments
- Proficient in fine-tuning, SFT, DPO, and other methods
- Experienced with training frameworks like LLaMA-Factory and Axolotl
Analyze Experimental Results
- Interpret training results and assess data effectiveness
- Identify data issues (which data is useful, which is noise)
- Produce visual reports that non-technical stakeholders can understand
Feed Back into Data Iteration
- Guide data collection and labeling improvements based on experimental results
- Collaborate with Evaluation Scientists to form a "data -> validation -> improvement" loop
What We're Looking For
Required Skills
- Model training: Familiar with LLM fine-tuning workflows (SFT, DPO, RLHF concepts)
- Experiment design: Understands variable control, control groups, and statistical significance
- Result analysis: Can draw reliable conclusions from experimental data
- Tool proficiency: PyTorch, Transformers, and common training frameworks
Required Mindset
- Genuinely curious about how data affects models
- Pursues experimental rigor — not satisfied with "it ran, so it's done"
- Willing to communicate with non-technical teams and explain experimental findings
Nice to Have
- Experience with RLHF or human feedback
- Published relevant papers
- Experience in data quality assessment
- Familiarity with reward model training
Not a Fit If You
- Only tune hyperparameters without understanding the "why"
- Only care about models and not about data
- Cannot explain results to non-technical people
Online Skills Assessment
We don't screen by resumes. Complete this assessment to showcase your thinking.
13 open-ended questions · Approx. 2 hours · Answers auto-saved
Want to chat before applying?
Come fight the battles others can't
We don't screen by resumes — we look at how you think
Contact Us →