Skip to content

shrutikakapade/RAG-From-Scratch-Complete-RAG-Pipeline-Langchain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Retrieval-Augmented Generation (RAG) From Scratch

This repository provides a structured and practical guide to understanding and implementing Retrieval-Augmented Generation (RAG) using modern AI tools such as LangChain.

The goal of this project is to help students, developers, and AI practitioners learn how to build a complete RAG pipeline step-by-step, starting from the fundamental components and progressing toward a fully functional system.


What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances the capabilities of Large Language Models (LLMs) by integrating external knowledge retrieval mechanisms.

Instead of relying only on the information stored in a pre-trained model, RAG retrieves relevant information from external data sources such as documents, databases, or web pages and provides that context to the model during response generation.

This significantly improves the accuracy, reliability, and relevance of generated responses.


Why RAG is Important

  • Enables LLMs to access external knowledge sources
  • Reduces hallucinations in generated responses
  • Supports domain-specific knowledge systems
  • Allows AI systems to work with private datasets
  • Improves the accuracy and trustworthiness of AI applications

Core Components of a RAG Pipeline

A typical RAG system consists of several modular components that work together to retrieve relevant information and generate accurate responses.

  1. Document Loader
  2. Text Splitter
  3. Embedding Model
  4. Vector Database
  5. Retriever
  6. Large Language Model (LLM)
  7. Response Generation

RAG System Workflow

The complete workflow of a Retrieval-Augmented Generation system typically follows these steps:

  1. Load raw data from different sources such as PDFs, websites, CSV, or JSON files.
  2. Split large documents into smaller manageable chunks.
  3. Convert text chunks into vector embeddings using embedding models.
  4. Store embeddings in a vector database.
  5. Retrieve the most relevant chunks based on a user query.
  6. Provide retrieved context to the LLM.
  7. Generate an accurate response using both retrieved knowledge and model reasoning.

Repository Structure

This repository is designed as a progressive learning resource where each module focuses on one component of the RAG pipeline.

  • Document Loaders
  • Text Splitters
  • Embeddings
  • Vector Databases
  • Retrievers
  • Complete End-to-End RAG Pipeline

📚 Notebook Overview – RAG Pipeline Implementation

This section provides a concise overview of each notebook, covering the complete Retrieval-Augmented Generation (RAG) pipeline from data ingestion to vector storage and retrieval.


  • langchain_document_loaders_unstructured_practical_examples.ipynb
    • Implements multiple LangChain document loaders (Web, CSV, JSON, Unstructured)
    • Demonstrates multi-source data ingestion
    • Prepares raw data for RAG pipeline processing

  • 01_rag_pdf_document_loaders_langchain_examples.ipynb
    • Focuses on PDF data extraction using different loaders
    • Compares PyPDFLoader, PyMuPDFLoader, and UnstructuredPDFLoader
    • Handles complex formats like images, Word, and PowerPoint files

  • Text_splitters_in_RAG_Pipeline.ipynb
    • Explains text splitting in RAG systems
    • Breaks large documents into smaller chunks
    • Improves LLM performance and retrieval accuracy

  • RAG_Retriever_Search.ipynb
    • Implements loading, splitting, and embedding steps
    • Converts text into vector embeddings
    • Enables semantic similarity-based retrieval

  • rag_multi_model_embeddings_faiss_ipynb.ipynb
    • Builds an end-to-end RAG pipeline
    • Uses OpenAI, Gemini, and Hugging Face embeddings
    • Integrates FAISS for fast vector search
    • Explores EUCLIDEAN and COSINE similarity strategies

  • huggingface_embeddings_to_chroma_vector_db.ipynb
    • Uses HuggingFaceEmbeddings (MiniLM) for semantic search
    • Creates structured Document objects
    • Builds Chroma vector database using from_documents()
    • Performs CRUD operations on vector data

Who Should Use This Repository?

  • Students learning Generative AI
  • Developers building AI applications
  • Engineers exploring LangChain
  • Anyone interested in learning how RAG systems work

Project Objective

The objective of this repository is to provide a clear, structured, and practical learning path for building Retrieval-Augmented Generation systems from scratch using modern AI frameworks.

By following this repository, readers will gain both conceptual understanding and practical implementation experience required to develop real-world RAG-based AI applications.

About

A comprehensive repository to learn and implement Retrieval-Augmented Generation (RAG) from scratch using LangChain. It covers the full RAG pipeline including Document Loaders, Text Splitters, Embeddings, Vector Databases, and Retrievers with practical examples and step-by-step explanations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors