Molecule Retrieval with Natural Language Queries
Data challenge mixing natural language processing and graph neural networks. Project done during my master’s degree.
The goal of the challenge was to retrieve molecules from a database given natural language queries. The handout of the challenge can be found here and the challenge was hosted privately on kaggle.
The main idea is to use contrastive learning to encode the text and the graph in the same vector space. Then, we can use a similarity function to rank the molecules based on their embeddings. Our approach can be summarized in the following steps:
- Design a good graph neural network; in our case we used the GAT architecture.
- Use this GNN in the DiffPool architecture to aggregate the nodes in a clever way.
- Use an ensemble of such models to get a good score.
- The code is available on github.
- The project report is also available on github.