Reflection-driven continuous evaluation harness for slate - BC-967

Genre de projet: Innovation
Discipline(s) souhaitée(s): Génie - informatique / électrique, Génie, Informatique, Sciences mathématiques, Mathématiques
Entreprise: Farpoint Technologies Inc.
Durée du projet: 6 mois à 1 an
Date souhaitée de début: Dès que possible
Langue exigée: Anglais
Emplacement(s): Vancouver, BC, Canada
Nombre de postes: 2
Niveau de scolarité désiré: MaîtriseDoctorat
Ouvert aux candidatures de personnes inscrites à un établissement à l’extérieur du Canada: No

Au sujet de l’entreprise: 

Farpoint Technologies is a leading AI digital transformation consulting company that empowers top-tier organizations to build the AI-assisted workforce of the future. We specialize in providing consulting services to large public, private, and government entities, helping them create AI-accelerated workflows and innovative solutions. Our expertise includes leveraging LLMs, diffusion models, multimodal models, and executing special projects.

Veuillez décrire le projet.: 

This project aims to build a robust, reflection-driven benchmarking and evaluation harness integrated directly into the Slate AI code editor. Unlike existing evaluation methodologies, this innovation incorporates self-diagnosing capabilities inspired by "reflection" research, enabling the system to automatically track, analyze, and correct errors generated during code edits. Slate will autonomously evaluate improvements, reflect on code-change failures, and propose self-corrections, ensuring measurable, continuous enhancement in software reliability and maintainability.

Main tasks for the candidate:
• Develop instrumentation for capturing detailed metrics (code diffs, complexity) via Git hooks.
• Integrate automated tests from a subset of SWE-bench executed after each code change.
• Prompt engineer a system to create unit tests and integrated tests for any arbitrary codebase
• Design reflective prompts enabling Slate to analyze test failures and suggest corrective actions, traversing a graph structure searching for the optimal code generation path
• Create a streamlined data pipeline for efficient metrics storage and analysis.
• Develop a visual dashboard providing insights into ongoing performance and reliability improvements.

Methodology / techniques:
• Execution of SWE-bench tests and capturing test outcomes.
• Implementation of the Reflexion paper, and modern techniques based on this paper (https://arxiv.org/abs/2303.11366)
• LLM-driven reflection prompts for test creation and code editing based on test results
• Data management using SQLite, facilitating efficient queries and analyses.

Expertise ou compétences exigées: 

Required expertise/skills:
• Strong Typescript and python skills
• Practical experience with Git
• Familiarity with automated testing methodologies (unit testing frameworks, SWE-bench, or similar benchmarks).
• Knowledge of AST analysis using tree-sitter or similar parsing libraries.
• Competency in prompt engineering and designing reflection prompts for large language models.
• Skills in database usage and basic data management (SQLite).

Assets (optional):
• Experience with complexity metrics analysis (cyclomatic complexity tools like Lizard).
• Familiarity with statistical analysis and data visualization tools (Prometheus/Grafana).