Update text-data.md

bricaud · web-flow · commit 6cf2d2f49260 · 2025-03-11T16:04:52.000+01:00
diff --git a/_theses/text-data.md b/_theses/text-data.md
@@ -1,18 +1,18 @@
 ---
 layout: base
-title: "Graph of text data, using large language models"
-keywords: "LLMs, Retrieval Augmented Generation"
+title: "Exploring the AI brain using graphs"
+keywords: "LLMs, latent space, Retrieval Augmented Generation"
 contact_person: "Benjamin Ricaud"
 ---
 
 
 ## 📝 Description
-The main idea is to build graphs of documents, graphs of topics contained in them, connecting abstract concepts and ideas inside the documents. We want to detect important news, automatically classify documents and get detailed information about a topic. The approach is in the same vein as Retrieval Augmented Generation [RAG](https://arxiv.org/abs/2005.11401) but we want to leverage the graph structure for a better retrieval of information.
+The main goal of this project is a better understanding of how large language models structure their inner representation of data, how their "AI brain" is organized. We will collect the embeddings of a AI model for a dataset of documents. The idea is then to create the graphs of embeddings and topics contained in them. By encoding chuncks of text following each other in documents, we want to see the connection between abstract concepts and ideas from the documents inside the AI model. 
+We will explore the representation of sentences and texts of (small) text encoders such as BERT and [modernBERT](https://arxiv.org/abs/2412.13663) its recent updated version. To map the latent space we will use an approach similar to Retrieval-Augmented Generation [RAG](https://arxiv.org/abs/2005.11401) and create an graph of nearest neighbors in the latent space to structure it.
+We will first explore the structure of the graph, see if there are clusters, central concepts or ideas inside the model representation. Then we will then use the graph to evaluate how the AI model chain or connect ideas in documents and as it produces text.
+In this project we will make use of simple and advanced tools from network analysis as well as machine learning and generative AI (LLMs). We will use pre-trained models from Huggingface.
 
-Graphs will be made using the inner representation of a large laguage model. we will start by feeding a deep encoder model such as ada002 or BERT with documents or sections of documents and build a vector database from the embeddings of the model. We will then make a graph of nearest neighbors. We will first explore the structure of the graph, see if there are clusters, central concepts or documents. We will then use the graph to refine the feedback given for a given prompt to the LLM.
-In this project we will make use of simple and advanced tools from network analysis as well as machine learning and generative AI (LLMs).
-
-**Data:** Open datasets of documents, databases form the Norwegian public services, Wikipedia data
+**Data:** Open datasets of documents, Wikipedia data
 
 ## 📨 Contact:
-Benjamin Ricaud <benjamin.ricaud@uit.no>
+Benjamin Ricaud <benjamin.ricaud@uit.no>