Fine tuning LLM (Large Language Model) to generate patent applications - ON-904

Project type: Innovation
Desired discipline(s): Engineering - computer / electrical, Engineering, Computer science, Mathematical Sciences
Company: XLSCOUT LTD.
Project Length: 6 months to 1 year
Preferred start date: As soon as possible.
Language requirement: English
Location(s): ON, Canada
No. of positions: 5
Desired education level: Undergraduate/Bachelor
Open to applicants registered at an institution outside of Canada: Yes

About the company: 

XLSCOUT is the World’s Largest AI-enabled Technology Database & IP Analytics Platform with 150+ Million Patents & 200+ Million Research Publications. We use the best data patent, NLP, and other data sources available in the market and further improve, standardise, and enrich the data by using a combination of machine learning algorithms, and manual validation. We are one of the only platforms that provide both an advanced search option for IP professionals and an NLP-based interface that is intuitive and easy to use regardless of IP skill level.

Describe the project.: 

The project seeks to optimize the performance of a Language Model (LLM) specifically tailored for patent drafting. Through meticulous fine-tuning, our objective is to elevate the LLM's capabilities, thereby enhancing the efficiency and precision in the generation of patent documents. Our goal is to improve the LLM’s performance in order to increase accuracy and effectiveness of generating patent documents.

This project will follow agile project management business method and focus on continuous small releases depending upon the industrial supervisor’s feedback.

    1. The initial step will be understanding of XLScout’s existing algorithms for text clustering, mapping, and document ranking.
    2. We need to gather authentic data from diverse sources especially field experts of the XLScout’s company. We also should have comprehensive research on the available open source LLMs that guarantee the security of the model.
    3. The pre-processing phase in NLP is a curial time-consuming part of developing an NLP approach. The accuracy of the overall NLP technique is highly dependent on pre-processing. This activity consists of: Initial Preprocesing & Advanced preprocessing
    4. After doing pre-processing phase, building a machine learning method is the next step for reaching the aforementioned goal. Machine Learning algorithms, based on commonly used existing ML models, for these tasks will be developed to achieve high accuracy and scalability. In this step we mainly focus on fine-tuning the available LLMs to achieve high accuracy models. In the past the used algorithms used Recurrent RAG method. However in this research we will apply Multi Model RAG to increase the efficiency of the LLMs
    5. It is difficult to evaluate the results of unsupervised learning-based models using pre-defined metrics such as accuracy, recall, precision, etc. Therefore, the qualitative analysis will be conducted by XLScout’s R&D team to evaluate the output generated at different stages.

Required expertise/skills: 

The ideal candidate for this MITACS internship should possess a strong background in Computer Science, Artificial Intelligence, or a related field, preferably at the graduate level.

Key required skills include: 

  1. Proficiency in NLP and ML, with a specific focus on LLMs. 
  2. Experience with AI and ML algorithms, particularly those relevant to text analysis and generation. 
  3. Strong programming skills, particularly in Python and relevant AI/ML frameworks. 
  4. Ability to work with large datasets and conduct data-driven research. 
  5. Excellent analytical and problem-solving skills. Assets include experience in working with big data, a proven track record in AI-related research, and publications in relevant fields.
  6. The intern should be capable of working independently as well as collaboratively in a team, demonstrating initiative and innovation in their approach.