Back to Projects

RAG Flashcard Generator

A Retrieval-Augmented Generation (RAG) powered flashcard generator that turns study materials (e.g., PDFs, notes, webpages) into structured flashcards using a pipeline based on LangChain, embeddings,

Completed Personal Project

View on GitHub

The Challenge

Students and learners often struggle to generate effective study flashcards manually from large amounts of unstructured text (PDFs, notes, lecture material). This project aims to automate that process by building a tool that converts documents into structured, high-quality flashcards using a combination of retrieval and LLM-based generation.

The Solution

This project implements a RAG (Retrieval-Augmented Generation) pipeline:

Document loading and chunking: Reads PDFs or text and splits them into chunks for processing.
Vector embedding and storage: Converts chunks into embeddings (semantic vectors) and stores them in a vector database (ChromaDB).
Retrieval: Retrieves the most relevant text chunks for each query.
LLM flashcard generation: Uses an LLM (Google Gemini via LangChain and LCEL) to create flashcards based on the retrieved context.
Output formatting: Structures the generated flashcards in Markdown with clear Q/A pairs.

Key Features

Automatic document ingestion (PDFs and text).
Vector database integration for semantic retrieval (Chroma).
Flashcard generation using a modern LLM (Google Gemini) with LangChain/LCEL pipelines.
Outputs flashcards in readable Markdown format.

Architecture & Implementation

This project’s high-level architecture looks like:

Input Layer: Document loader (PDF, text)
Preprocessing: Text chunking with character-based split.
Embedding Store: Create and store embeddings in a vector database (Chroma).
Retriever: Semantic search against the vector store to pick relevant chunks
RAG Pipeline: Use LangChain/LCEL with the retriever and Google Gemini to generate flashcards.
Output Formatter: Convert raw generation into structured, Markdown flashcards.

Technologies Used

LangChain LCEL Python OpenAI GPT-4 ChromaDB Vector Embeddings RAG Architecture PyPDF Notion API FastAPI

Challenges & Learnings

This project likely had to handle:

Document processing complexity: PDFs and large documents vary in structure and require reliable chunking.
Efficient semantic retrieval: Building an effective vector database pipeline for different document types
LLM prompt design: Engineering prompts and integration with Google Gemini via LangChain to produce structured flashcards.
Balancing context vs. cost: Retrieving enough relevant text while minimizing API usage.