Multi-modal weakly supervised detection of exploratory episodes in software projects - ON-1167

Genre de projet: Recherche
Discipline(s) souhaitée(s): Génie - informatique / électrique, Génie, Génie - autre, Informatique, Sciences mathématiques
Entreprise: Anonymous
Durée du projet: 4 à 6 mois
Date souhaitée de début: Dès que possible
Langue exigée: Anglais
Emplacement(s): Toronto, ON, Canada
Nombre de postes: 1
Niveau de scolarité désiré: MaîtriseDoctoratRecherche postdoctorale
Ouvert aux candidatures de personnes inscrites à un établissement à l’extérieur du Canada: No

Au sujet de l’entreprise: 

We are a Canadian AI startup focused on building machine learning systems that operate on real-world software development data. Our work sits at the intersection of machine learning, software engineering, and human decision-making, with an emphasis on systems that must function under real operational constraints rather than idealized benchmarks.
The founding team has experience in machine learning, natural language processing, and software engineering, and has previously worked on applying ML techniques in real-world software systems. Team members have also led and collaborated with engineering teams on complex, data-driven projects across multiple industries.
The company works with permissioned, longitudinal data from software teams, including artifacts generated through everyday development and collaboration processes. This data reflects how engineering work actually unfolds over time, often noisy, heterogeneous, and imperfectly documented, creating both technical and methodological challenges that are not well addressed by existing datasets or tools.
Our approach emphasizes rigorous problem formulation, close collaboration between researchers and practitioners, and the development of methods whose outputs can be meaningfully interpreted and evaluated by humans. We are particularly interested in collaborations that bridge academic research with practical deployment considerations, and that contribute insights relevant to empirical software engineering and applied machine learning communities.

Veuillez décrire le projet.: 

This research project addresses a problem at the intersection of machine learning and mining software repositories (MSR): how to identify and characterize exploratory or experimental engineering activity within real-world software development projects.
Modern software repositories record rich, longitudinal traces of development activity, yet distinguishing exploratory engineering work from routine implementation or maintenance remains poorly defined in existing research. There is limited consensus on appropriate problem formulations, representations, or evaluation approaches for this setting, particularly when working with noisy, heterogeneous, and weakly labeled data derived from real projects.
The company aims to advance methods and understanding in this area by formalizing the problem, examining suitable learning paradigms, and empirically evaluating alternative approaches using permissioned software development data. Rather than focusing on end-to-end system construction or incremental performance gains, the project emphasizes research questions around representation, supervision, interpretability, and robustness under real-world constraints. The outcomes are intended to generate transferable insights relevant to the MSR, applied machine learning, and empirical software engineering communities.
The candidate will lead a research-focused project that includes:
• Formal definition and scoping of exploratory engineering “episodes” within longitudinal software project histories
• Analysis of heterogeneous and weakly labeled development artifacts to understand signal characteristics and limitations
• Design and evaluation of learning approaches appropriate for limited or indirect supervision (e.g., weakly supervised, semi-supervised, or interpretable models)
• Investigation of temporal and structural representations suited to software development data
• Analysis of explanation techniques that support human interpretation and critical review of model outputs
• Documentation of assumptions, limitations, and empirical findings in a form suitable for academic dissemination
This project prioritizes research rigor, clarity of formulation, and human interpretability over production deployment. The project is expected to produce well-documented research findings and empirical insights that may be suitable for academic dissemination, subject to alignment between the student, academic supervisor, and company.

Expertise ou compétences exigées: 

Academic Background
• PhD student (or advanced MSc) in Computer Science, Machine Learning, or Software Engineering
• Particularly strong fit for candidates with research interests in one or more of the following areas:
o Mining Software Repositories (MSR)
o Learning under weak or noisy supervision
o Multi-modal or heterogeneous data analysis
o Interpretable or explainable machine learning
o Natural language processing in technical domains
Technical Skills
• Proficiency in Python and common ML frameworks (e.g., PyTorch or equivalent)
• Experience working with real-world, imperfect datasets beyond curated benchmarks
• Familiarity with software development artifacts and workflows (e.g., version control systems, issue trackers)
• Ability to clearly document methods, assumptions, and results in written form
Research Mindset
• Interest in problem formulation and methodological clarity, particularly in settings without established benchmarks
• Preference for robustness, interpretability, and practical validity over optimizing for state-of-the-art metrics alone
• Ability to connect machine learning methods with domain-specific constraints and evaluation