Building TalkWithDoc: A Conversational AI System with RAG and LLMs

In an era where data drives decisions, wouldn’t it be revolutionary to interact with web content as if it were a human? TalkWithDoc achieves exactly that by combining Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to create a conversational interface for web content.

This tech blog explores the architecture, implementation, and challenges behind TalkWithDoc.

Overview: What is TalkWithDoc?

TalkWithDoc is a backend system designed to process web content, vectorize it, and respond to user queries using conversational AI. Its features include:

Web Crawling: Scrapes textual data from user-specified URLs.
Document Storage and Vectorization: Prepares the data for LLM input.
Conversational AI: Employs Groq’s Mixtral LLM for intelligent responses.
Real-Time Chat: Facilitates dynamic interactions via WebSocket.

With a focus on simplicity, scalability, and performance, TalkWithDoc bridges the gap between static web data and dynamic user interactions.

Tech Stack

The project leverages cutting-edge tools and libraries:

Backend Framework: FastAPI for API and WebSocket handling.
Web Scraping: BeautifulSoup and requests for HTML extraction.
LLM Integration: Groq API’s Mixtral 8x7b model for AI-driven responses.
Embeddings: HuggingFace embeddings for data vectorization.
Environment Management: dotenv for secure API key and host management.

Project Structure

The project is neatly divided into two Python files:

1. crawl.py: Web Crawling Module

This module handles the extraction of data from user-provided URLs. Here's a breakdown:

import requests
from bs4 import BeautifulSoup

def crawl_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    html = str(soup)

    # Save HTML to a text file
    with open("data/text.txt", 'w', encoding='utf-8') as file:
        file.write(html)
    print("Data successfully saved to text.txt")
    return str(soup)

Key Functionality:

Fetches HTML content using requests.
Parses and prettifies it with BeautifulSoup.
Saves the result in data/text.txt for further processing.

2. main.py: Backend and LLM Integration

This file ties everything together, from data crawling to user interaction via WebSocket. Key highlights:

FastAPI for defining endpoints.
Groq’s Mixtral LLM for natural language understanding.
WebSocket for real-time conversations.

Core WebSocket Endpoint:

@app.websocket("/chat")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()

    # Load data and create an index
    documents = SimpleDirectoryReader("data/").load_data()
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    query_engine = index.as_query_engine()

    # Handle client interaction
    try:
        while True:
            data = await websocket.receive_text()
            response = query_engine.query(data)
            await websocket.send_text(f"Response: {response}")
    except WebSocketDisconnect as e:
        print(f"Client disconnected: {e}")

Key Challenges

Efficient Crawling:
Extracting clean data from diverse HTML structures was tricky, requiring extensive testing with BeautifulSoup.
LLM Configuration:
Setting up Groq’s Mixtral model and integrating it with vectorized data demanded precise configuration.
Real-Time Responsiveness:
Ensuring smooth, lag-free WebSocket communication involved optimizing message handling.

Future Enhancements

Advanced Scraping Features: Extract structured data like tables and images for richer insights.
Improved Query Context: Enhance vectorization and LLM prompts for more accurate responses.
User-Friendly Frontend: Build a React-based interface for seamless user interactions.

Conclusion

TalkWithDoc transforms static web data into interactive knowledge, powered by Retrieval-Augmented Generation and LLMs. Whether you’re a researcher, developer, or curious learner, this project showcases how to bridge backend technology and conversational AI.