Reflection-driven continuous evaluation harness for slate - BC-967

Project type: Innovation
Desired discipline(s): Engineering - computer / electrical, Engineering, Computer science, Mathematical Sciences, Mathematics
Company: Farpoint Technologies Inc.
Project Length: 6 months to 1 year
Preferred start date: 09/01/2025
Language requirement: English
Location(s): Vancouver, BC, Canada
No. of positions: 2
Desired education level: Master'sPhD
Open to applicants registered at an institution outside of Canada: No

About the company: 

Farpoint Technologies is a leading AI digital transformation consulting company that empowers top-tier organizations to build the AI-assisted workforce of the future. We specialize in providing consulting services to large public, private, and government entities, helping them create AI-accelerated workflows and innovative solutions. Our expertise includes leveraging LLMs, diffusion models, multimodal models, and executing special projects.

Describe the project.: 

This project aims to build a robust, reflection-driven benchmarking and evaluation harness integrated directly into the Slate AI code editor. Unlike existing evaluation methodologies, this innovation incorporates self-diagnosing capabilities inspired by "reflection" research, enabling the system to automatically track, analyze, and correct errors generated during code edits. Slate will autonomously evaluate improvements, reflect on code-change failures, and propose self-corrections, ensuring measurable, continuous enhancement in software reliability and maintainability.

Main tasks for the candidate:
• Develop instrumentation for capturing detailed metrics (code diffs, complexity) via Git hooks.
• Integrate automated tests from a subset of SWE-bench executed after each code change.
• Prompt engineer a system to create unit tests and integrated tests for any arbitrary codebase
• Design reflective prompts enabling Slate to analyze test failures and suggest corrective actions, traversing a graph structure searching for the optimal code generation path
• Create a streamlined data pipeline for efficient metrics storage and analysis.
• Develop a visual dashboard providing insights into ongoing performance and reliability improvements.

Methodology / techniques:
• Execution of SWE-bench tests and capturing test outcomes.
• Implementation of the Reflexion paper, and modern techniques based on this paper (https://arxiv.org/abs/2303.11366)
• LLM-driven reflection prompts for test creation and code editing based on test results
• Data management using SQLite, facilitating efficient queries and analyses.

Required expertise/skills: 

Required expertise/skills:
• Strong Typescript and python skills
• Practical experience with Git
• Familiarity with automated testing methodologies (unit testing frameworks, SWE-bench, or similar benchmarks).
• Knowledge of AST analysis using tree-sitter or similar parsing libraries.
• Competency in prompt engineering and designing reflection prompts for large language models.
• Skills in database usage and basic data management (SQLite).

Assets (optional):
• Experience with complexity metrics analysis (cyclomatic complexity tools like Lizard).
• Familiarity with statistical analysis and data visualization tools (Prometheus/Grafana).