- G2i
- Miami, FL
- Contract
- 6 days ago
- $50-200/hr
Senior Software Engineer – AI Interaction Evaluator (Codex / Claude Code, up to $200/hr): our view in 3 lines...
- The Role: An experienced software engineer will evaluate and judge the quality of interactions produced by AI coding agents like OpenAI Codex and Claude Code.
- The Person: The person will evaluate AI-generated coding interactions end-to-end, judge their usefulness and high-level correctness, assess explanations and reasoning, provide opinionated feedback, and help define quality standards.
- Requirements: Staff or Principal-level engineers with TypeScript or JavaScript or Python experience and hands-on use of OpenAI Codex, Claude Code, and Cursor are required.
Job Description
Senior AI Interaction Evaluator (Codex / Claude Code)
These roles are currently filled but we hire on a rolling basis as new projects open up. Apply now to join our talent bench — qualified candidates will be contacted directly when roles become available.
Contract | $50-200/hr | 10–20 hrs/week | Start ASAP (through early May)
Check out this Loom video for more details!
We’re looking for highly experienced software engineer (SR+) to help evaluate the quality of interactions with modern coding agents such as OpenAI Codex and Claude Code.
This is not a traditional engineering role.
You won’t be writing production code.
You’ll be evaluating something harder: whether the model thinks like a great engineer.
What This Role Actually Is
You will assess how AI coding agents behave in real-world scenarios — focusing on:
-
Whether the response makes sense
-
Whether the preamble and reasoning are useful
-
Whether the output reflects strong engineering judgment
-
Whether the interaction feels right to an experienced developer
This role is about engineering taste — not syntax correctness.
What You’ll Be Doing
-
Evaluate AI-generated coding interactions end-to-end
-
Judge whether outputs are:
-
Useful
-
Correct (at a high level)
-
Aligned with how a strong engineer would think
-
-
Assess the quality of explanations and reasoning, not just code
-
Distinguish between different levels of response quality (e.g. what makes something a 2 vs 4)
-
Provide clear, opinionated feedback on:
-
What worked
-
What didn’t
-
What felt “off” or misleading
-
-
Help define what great looks like when interacting with tools like Cursor
What We Mean by “Taste”
We’re specifically looking for engineers who can answer questions like:
-
Does this feel like something a strong engineer would actually say?
-
Is this explanation helpful, or just technically correct?
-
Is the model guiding the user well, or just dumping output?
-
Would this interaction build or erode trust?
You should be comfortable making subjective but rigorous judgments.
Who You Are
-
Staff / Principal-level engineer (or equivalent experience)
-
Strong background in one of the below:
-
TypeScript / JavaScript
-
Python
-
-
Hands-on experience using:
-
OpenAI Codex
-
Claude Code
-
Cursor
-
-
Deep familiarity with modern AI-assisted dev workflows
-
Able to evaluate code without needing to fully execute or deeply review every line
-
Comfortable giving direct, opinionated feedback
-
High bar for what “good engineering” looks like
Nice to Have
-
Experience with tools like Cursor or similar AI-first IDEs
-
Prior exposure to prompt design or evaluation workflows
-
Experience mentoring senior engineers or defining engineering standards
Engagement Details
-
US and Canada up to $200/hr
-
EU and Latam up to $150/hr
-
Other locations up to $100/hr
-
Hours: ~10–20 hours/week
-
Duration: Through early May (with possible extension)
-
Start: ASAP
-
Process:
-
Take-home evaluation exercise
-
One behavioral interview
-
