Key logo

Full Stack AI/ML Engineer

Key
On-site
Palo Alto, California, United States

About Us

This role is for one our client. Our client is one of the worlds fastest-growing AI companies, pushing the boundaries of AI-assisted software development. Their mission is to empower the next generation of AI systems to reason about and work with real-world software repositories. You'll be working at the intersection of software engineering, open-source ecosystems, and frontier AI.

Project Overview

Our client is building high-quality evaluation and training datasets to improve how Large Language Models (LLMs) interact with realistic software engineering tasks. A key focus of this project is curating verifiable software engineering challenges from public GitHub repository histories using a human-in-the-loop process.

Why This Role Is Unique

  • Collaborate directly with AI researchers shaping the future of AI-powered software development. 
  • Work with high-impact open-source projects and evaluate how LLMs perform on real bugs, issues, and developer tasks.
  • Influence dataset design that will train and benchmark next-gen LLMs.

Role Overview What Does a Typical Day Look Like?

  • Review and compare 34 model-generated code responses per task using a structured ranking system.
  • Evaluate code diffs for correctness, code quality, style, and efficiency.
  • Provide clear, detailed rationales explaining the reasoning behind each ranking decision.
  • Maintain high consistency and objectivity across evaluations.
  • Collaborate with the team to identify edge cases and ambiguities in model behavior.

Required Skills & Experience

  • 5+ years of software engineering experience, including 2+ continuous years at a top-tier product company (e.g., Stripe,Netflix,Datadog, Dropbox, Shopify, PayPal, IBM Research).
  • Strong expertise in building full-stack applications and deploying scalable, production-grade software using modern languages and tools.
  • Deep understanding of software architecture, design, development, debugging, and code quality/review assessment.
  • Proven ability to review code diffs and evaluate correctness, maintainability, and efficiency.
  • Excellent oral and written communication skills for clear, structured evaluation rationales.

Bonus Points

  • Experience in LLM research, developer agents, or AI evaluation projects.
  • Background in building or scaling developer tools or automation systems.

Engagement Details

  • Commitment: ~10-20 hours/week (partial PST overlap required)
  • Type: Contractor (no medical/paid leave)
  • Duration: 1 month - starting next week; potential extensions based on performance and fit.