Skip to content

Latest commit

 

History

History
52 lines (28 loc) · 6.18 KB

File metadata and controls

52 lines (28 loc) · 6.18 KB

Useful Links:

1. Querying Land Matrix via Python

Querying the Land Matrix database via Python scripts.

Open In Colab

2. Correction and alignment of country names with those in the Land Matrix

This program calculates the similarity between a given country name and the country names in an existing list, even in cases of typos or incomplete names. It uses word embeddings to represent each country name as a numeric vector, then compares the cosine similarity between the embedding of the given country and those of the list to find the closest match. This method helps associate the given country with its most similar counterpart in the list, facilitating the search for matches.

Open In Colab

3. Adapting large language models (LLMs) using In-Context learning

To adapt large language models (LLMs) such as Llama3 and Mixtral, we use in-context learning to generate correct GraphQL queries for querying the Land Matrix database.

Open In Colab

4. Adapting large language models (LLMs) for REST query generation based on natural language questions, using In-Context Learning, RAG, and LLM agents.

--- LLM Agents ---

The first step involved using a function to retrieve external information from the LMI database. This information represents the values that the fields in the database can take. Due to the size and complexity of the database, this function was used to extract the relevant field values for each question.

For each question, the desired fields were extracted using an LLM, which takes the user’s question as input and returns a list of relevant fields. The function then sends this list to the LMI API to retrieve the values for these fields. These values are then used in the context of a second LLM to generate REST queries. The use of multiple LLMs for extracting relevant fields and generating REST queries is an example of a multi-agent approach.

--- RAG ---

To improve the quality of the REST queries generated by the model, we used the RAG approach to enrich the prompt context with natural language questions and their corresponding queries, similar to the input question. The process begins by calculating word embeddings for the input question and the natural language questions from our question-query corpus using the model all-mpnet-base-v2. This sentence-transformer model maps sentences and paragraphs into a dense vector space of 768 dimensions, useful for tasks like clustering or semantic search. Then, we used Facebook AI Similarity Search (Faiss), a library for similarity search and clustering of dense vectors. Faiss contains algorithms that allow searching in vector sets of any size, even those that don’t fit in RAM, and includes support code for evaluation and parameter tuning.

We calculate the similarity between the input question and all the questions represented by their embeddings stored in the vector database. We return the k most similar questions to the input question along with their corresponding REST queries to include in the context. Rather than using random queries in this context, this method enriches the context with REST queries similar to the input question. This allows the model to generate queries that are both syntactically correct and contextually relevant.

Open In Colab

--- Explanation of the REST Code ---

The REST_Methods folder contains:

The methods used to generate REST queries from natural language questions. We worked on In-Context Learning: Open In Colab, RAG: Open In Colab, and Agent: Open In Colab.

The file rag__DATA.xlsx contains examples of natural language questions and their equivalent REST queries to enrich the context in the RAG and Agent methods.

Context_REST: Open In Colab is a file that contains the context to pass to the prompt for each method.

Queries_Rest.xlsx represents the set of natural language questions used for testing.

To run, for example, the RAG method, first execute the Context_REST file to obtain the context, then run the RAG file {In the method scripts, it is necessary to choose the LLM to use. We mentioned three options at the beginning of the script, but you must select the desired one before running the code}. Running this will produce an Excel file containing the input natural language questions and the corresponding queries generated by the model. To evaluate these queries, you need to load this file into the list of models at the beginning of the REST evaluation script: Open In Colab, which will then return the evaluation metrics.