On-prem / sovereign AI coding assistant for privacy-constrained tasks - BC-966

Project type: Innovation
Desired discipline(s): Engineering - computer / electrical, Engineering, Computer science, Mathematical Sciences, Mathematics
Company: Farpoint Technologies Inc.
Project Length: 6 months to 1 year
Preferred start date: 09/01/2025
Language requirement: English
Location(s): Vancouver, BC, Canada
No. of positions: 2
Desired education level: Master'sPhD
Open to applicants registered at an institution outside of Canada: No

About the company: 

Farpoint Technologies is a leading AI digital transformation consulting company that empowers top-tier organizations to build the AI-assisted workforce of the future. We specialize in providing consulting services to large public, private, and government entities, helping them create AI-accelerated workflows and innovative solutions. Our expertise includes leveraging LLMs, diffusion models, multimodal models, and executing special projects.

Describe the project.: 

This project aims to develop a system for downloading and serving quantized models that are the optimal coding assistant model given the arbitrary hardware that the software has been installed on.

We have developed a coding assistant tool that can use online or local LLM models. The goal of this research project is to create a system that not only serves LLM models locally but can do so on windows and mac machines, and performs a profiling / setup phase where it determines the best local coding model that will run in a reasonably responsive manner (>20 toks/s) on a user’s machine.

If the model is a reasoning model, it should set up prompts or parameters when querying the model such that it can use the thinking mode for the “large model” tasks and the non thinking mode for the quick “small model” tasks.

Main tasks for the candidate:
• Use open source software like vllm or llama.cpp to develop a user friendly model serving system that can run locally on the user’s system
• Create a system that can determine the best model to serve given any arbitrary system configuration (CPU, GPU, VRAM, System RAM, etc.)
• Test and validate the ability of the system to auto-download and load local models for offline AI code assistance
• Validate advanced quantization techniques (mixed precision for different parts of the model) for coding performance / speed of prompt parsing and token generation.

Methodology / techniques:
• Model optimization: LoRA/QLoRA, quantization, and inference tuning.
• Local LLM Serving: Using llama.cpp, vllm or other open source libraries with an MIT or Apache license
• IDE integration: Integrate ability to download and serve local models into the existing Node.js React Electron application
• Benchmarking: HumanEval-PHP tests, pass@k metrics, latency measurements, quantization profiling.

Required expertise/skills: 

Required expertise/skills:
• Practical knowledge of model quantization techniques (e.g., LoRA, QLoRA, 4-/8-bit compression).
• Proficiency in containerization and Docker deployment, including GPU (NVIDIA Container Toolkit) and CPU setups.
• Familiarity with LLM APIs for local model serving.
• Competence in TypeScript.
• Basic understanding of benchmarking methodologies, including accuracy (HumanEval), latency, and resource profiling.

Assets (optional):
• Experience with air-gapped infrastructure environments.
• Familiarity with GPU resource scheduling and optimization.