Synthetic Data Generation With A Proof Assistant For Training Large Language Models

Master’s thesis done at JP Morgan and ENS Paris-Saclay. The report is available here (in French).

The context of this work is the use of large language models to solve formal mathematical reasoning tasks in a proof assistant (Lean). Specifically, we focused on generating synthetic data using Lean in a forward reasoning manner, which is a novel approach that seems to be catching on in the math/AI community. The main challenge was to use Lean’s metaprogramming capabilities, which are quite advanced. A follow-up project would focus more on evaluating the performance of the models trained on this data.

Abstract:

This internship report is part of the Master’s program in Mathematics, Vision, and Learning at ENS Paris-Saclay, in collaboration with JP Morgan within their Artificial Intelligence (AI) research department. The main objective of this work was to use large language models (LLMs) for mathematical reasoning tasks. Specifically, we focused on integrating proof assistants with LLMs to enhance their ability to reason in a formal manner.

The report first explores the theoretical foundations of proof assistants, their role in the formalization of mathematical proofs, and their integration with AI. The project involved generating synthetic data using the Lean proof assistant to train LLMs to reason more rigorously and formally. The data generation process, technical implementation, and the tools used are described in detail, with particular emphasis on metaprogramming in Lean.

The results obtained are promising, with the development of an innovative synthetic data generation system. Several areas for improvement have been identified to fully exploit the potential of this approach. This internship provided the opportunity to develop advanced skills in functional programming and metaprogramming while exploring synergies between AI and proof assistants.