Building a Modular GenAI Framework for Document Retrieval and Interaction

Without giving too much away, professionally, we’re working on a project that has us knee deep in legal documents—thousands upon thousands of pages of contracts, case law, and regulatory filings. Our team was tasked with building a system that could help attorneys quickly find relevant information across this massive corpus and generate accurate, properly cited responses to complex legal questions.

The challenge isn’t just the volume of documents. It’s the need for precision, attribution, and the ability to synthesize information across multiple sources. Traditional search approaches fell short, and off-the-shelf GenAI solutions can’t provide the reliability or citation tracking we needed.

The proposed apporach is a modular framework that combined the best of retrieval augmented generation (RAG), document summarization techniques like LexRank, and agentic models—all designed to be quickly repurposed for different document-heavy domains.

When we finally deployed this system, it should reduce research time for complex legal questions while maintaining a high accuracy rate compared to expert-generated answers. The key isn’t any single technology—it how we integrated multiple approaches into a cohesive, modular system.

Let’s dive into the framework for your own document-intensive use cases.

The Technology Stack: Beyond Basic RAG

Before we get into implementation details, let’s understand the core technologies that make this framework possible:

RAG: The Foundation, Not the Destination

Retrieval Augmented Generation has become the standard approach for grounding LLMs in external knowledge. But basic RAG implementations—which simply retrieve chunks of text based on vector similarity and feed them to an LLM—have significant limitations when dealing with complex document sets:

They struggle with multi-hop reasoning across documents
They often miss important context that’s spread across multiple sections
They provide limited citation capabilities
They can’t effectively prioritize authoritative sources

Our approach uses RAG as a foundation but extends it significantly with additional components.

LexRank: The Unsung Hero of Document Summarization

While everyone talks about RAG, fewer people leverage LexRank—an algorithm that identifies the most representative sentences in a document or collection of documents.

LexRank works by:

Creating a graph where sentences are nodes
Connecting sentences based on similarity measures
Applying a PageRank-like algorithm to identify the most central sentences

What makes LexRank powerful for our use case is that it excels at extractive summarization—pulling out the most important sentences verbatim rather than generating new text. This is crucial for legal and technical domains where precise wording matters.

Vector Databases: The Retrieval Engine

The choice of vector database significantly impacts both performance and capabilities. We evaluated several options:

Pinecone: Excellent performance but limited filtering capabilities
Weaviate: Strong hybrid search but more complex to set up
Chroma: Simple to use but less scalable for very large collections
Azure AI Search: Well-integrated with Azure ecosystem with strong vector search capabilities

For our implementation, we consider Weaviate due to its hybrid search capabilities, which allowes us to combine semantic search with metadata filtering—crucial for narrowing results by date, document type, jurisdiction, and other attributes.

Chunking Strategies: The Art of Document Segmentation

How you divide documents into chunks dramatically affects retrieval quality. Hierarchical chunking approaches typically work best:

Document-level embeddings: For initial corpus filtering
Section-level chunks: For main retrieval (typically 500-1000 tokens)
Paragraph-level chunks: For targeted information extraction
Sentence-level analysis: For LexRank summarization

This multi-level approach allows the system to zoom in and out as needed, maintaining both broad context and detailed information.

Agentic Models: The Orchestrators

The final piece is an agentic layer that orchestrates the entire process. This layer:

Interprets user queries
Determines the optimal retrieval strategy
Coordinates between retrieval and summarization components
Formulates responses with proper citations
Handles follow-up questions with context awareness

We can implement this using a combination of prompt engineering and tool-calling capabilities in models like GPT-4 and Claude 3.

Building the Document Processing Pipeline

With our technology stack defined, let’s look at how to build the document processing pipeline—the foundation of our framework.

Document Ingestion and Preprocessing

The quality of your document processing directly impacts everything downstream. Our pipeline includes:

Format conversion: Standardizing all documents to text-based formats
Structure preservation: Maintaining headings, lists, and paragraph breaks
Metadata extraction: Pulling out document dates, authors, titles, and other attributes
Entity recognition: Identifying key entities like organizations, people, and legal citations
Cross-reference detection: Finding and normalizing references between documents

Implementation tip: A combination of Apache Tika for initial conversion, custom Python scripts for structure preservation, and SpaCy for entity recognition. The key is creating a consistent intermediate format that preserves both content and structure.

Implementing Hierarchical Chunking

Rather than using a one-size-fits-all chunking strategy, a hierarchical approach required multiple chunking passes:

def create_hierarchical_chunks(document):
    # Document-level embedding
    doc_embedding = embed_text(document.full_text)
    
    # Section-level chunks (based on headings)
    sections = split_by_headings(document.full_text)
    section_chunks = [
        {
            "text": section,
            "embedding": embed_text(section),
            "metadata": document.metadata
        }
        for section in sections
    ]
    
    # Paragraph-level chunks
    paragraph_chunks = []
    for section in sections:
        paragraphs = split_by_paragraphs(section)
        for paragraph in paragraphs:
            paragraph_chunks.append({
                "text": paragraph,
                "embedding": embed_text(paragraph),
                "metadata": document.metadata,
                "parent_section": section[:100]  # Reference to parent
            })
    
    # Store all levels
    store_chunks(document.id, {
        "document": {"text": document.full_text, "embedding": doc_embedding},
        "sections": section_chunks,
        "paragraphs": paragraph_chunks
    })

This approach creates a hierarchical relationship between chunks, allowing the retrieval system to navigate up and down the hierarchy as needed.

Data Vetting and Quality Control

Not all documents are created equal. The proposed implementation includes several data vetting mechanisms:

Duplicate detection: Identifying and merging duplicate or near-duplicate content
Contradiction detection: Flagging conflicting information across documents
Recency analysis: Prioritizing more recent documents when information conflicts
Authority scoring: Weighting documents based on source authority
Quality filtering: Removing low-quality or irrelevant documents

Implementation tip: Implement a quality scoring system that assigns each document a composite score based on these factors. This score is used both for filtering during ingestion and for ranking during retrieval.

def calculate_quality_score(document):
    scores = {
        "completeness": assess_completeness(document),
        "consistency": check_internal_consistency(document),
        "authority": authority_score(document.metadata["source"]),
        "recency": recency_score(document.metadata["date"]),
        "relevance": domain_relevance_score(document)
    }
    
    # Weighted average
    weights = {"completeness": 0.2, "consistency": 0.2, 
               "authority": 0.3, "recency": 0.2, "relevance": 0.1}
    
    return sum(scores[k] * weights[k] for k in scores)

Vector Embedding Approaches

The choice of embedding model significantly impacts retrieval quality:

Domain-specific models outperform general models: We fine-tuned embeddings on legal text
Larger embedding dimensions improve precision: We used 1536-dimensional embeddings
Hybrid models combining multiple embeddings perform best: We ensemble multiple models

For legal documents specifically, the combination of a fine-tuned legal embedding model with a general-purpose model like OpenAI’s text-embedding-3-large provided the best results.

The Retrieval Engine: Finding the Needle in the Haystack

The next component is the retrieval engine—the system that finds relevant information based on user queries.

Vector Database Configuration

A possible Weaviate configuration includes:

Multiple collections for different chunk sizes: Document, section, and paragraph levels
Hybrid search configuration: Combining vector search with BM25 text search
Metadata filtering: Enabling filtering by document attributes
Cross-references: Maintaining relationships between chunks

class_obj = {
    "class": "Paragraph",
    "vectorizer": "text2vec-openai",
    "moduleConfig": {
        "text2vec-openai": {
            "model": "text-embedding-3-large",
            "modelVersion": "latest"
        }
    },
    "properties": [
        {"name": "content", "dataType": ["text"]},
        {"name": "document_id", "dataType": ["string"]},
        {"name": "section_id", "dataType": ["string"]},
        {"name": "metadata", "dataType": ["object"]},
        {"name": "quality_score", "dataType": ["number"]}
    ]
}

Multi-Stage Retrieval Process

Rather than a single retrieval step, we can implemented a multi-stage process:

Query analysis: Understanding the type of information needed
Corpus filtering: Using document-level embeddings to narrow the search space
Primary retrieval: Finding the most relevant sections
Contextual expansion: Adding related paragraphs for context
Re-ranking: Applying additional relevance criteria

This multi-stage approach significantly improves precision while maintaining recall.

Relevance Scoring Beyond Vector Similarity

Vector similarity alone isn’t enough for complex queries:

Vector similarity: Semantic relevance to the query
Term overlap: Keyword matching for precision
Authority score: Weighting by source reliability
Recency: Prioritizing newer information when appropriate
Contextual relevance: How well the chunk fits with other retrieved chunks

def score_chunk(chunk, query, retrieved_chunks):
    # Base vector similarity
    vector_score = vector_similarity(chunk.embedding, query_embedding)
    
    # Term overlap
    term_score = calculate_term_overlap(chunk.text, query)
    
    # Authority from metadata
    authority_score = get_authority_score(chunk.metadata)
    
    # Recency factor
    recency_score = calculate_recency(chunk.metadata.get("date"))
    
    # Contextual relevance to other retrieved chunks
    context_score = calculate_contextual_relevance(chunk, retrieved_chunks)
    
    # Weighted combination
    final_score = (vector_score * 0.4 + 
                  term_score * 0.2 + 
                  authority_score * 0.2 + 
                  recency_score * 0.1 + 
                  context_score * 0.1)
    
    return final_score

Citation Tracking and Source Attribution

For domains like legal research, proper citation is critical:

Source documents for all retrieved information
Specific locations within documents (page, section, paragraph)
Citation formatting based on domain standards
Confidence scores for each piece of information

This enables the system to generate properly cited responses and allows users to verify information directly from source documents.

The Summarization Layer: From Retrieval to Understanding

Retrieving relevant chunks is only half the battle. The summarization layer transforms these chunks into coherent, comprehensive answers.

LexRank Implementation for Extractive Summarization

LexRank implementation works in several stages:

Sentence segmentation: Breaking retrieved chunks into individual sentences
Similarity matrix construction: Calculating pairwise similarities between sentences
Graph creation: Building a graph where sentences are nodes and similarities are edges
Centrality calculation: Applying a modified PageRank algorithm
Sentence selection: Extracting the most central sentences

def lexrank_summarize(chunks, query, num_sentences=5):
    # Extract all sentences from chunks
    sentences = []
    for chunk in chunks:
        chunk_sentences = split_into_sentences(chunk.text)
        for sentence in chunk_sentences:
            sentences.append({
                "text": sentence,
                "source": chunk.metadata,
                "chunk_id": chunk.id
            })
    
    # Create similarity matrix
    embeddings = [embed_text(s["text"]) for s in sentences]
    similarity_matrix = cosine_similarity(embeddings)
    
    # Apply threshold to create graph
    threshold = 0.3
    for i in range(len(similarity_matrix)):
        for j in range(len(similarity_matrix)):
            if similarity_matrix[i][j] < threshold:
                similarity_matrix[i][j] = 0
    
    # Calculate LexRank scores
    scores = calculate_lexrank_scores(similarity_matrix)
    
    # Select top sentences
    ranked_sentences = [(sentences[i], scores[i]) 
                        for i in range(len(sentences))]
    ranked_sentences.sort(key=lambda x: x[1], reverse=True)
    
    # Return top sentences with their sources
    return [rs[0] for rs in ranked_sentences[:num_sentences]]

The key advantage of LexRank is that it extracts actual sentences from the documents, preserving the original wording—critical for accuracy in technical and legal domains.

Combining RAG with LexRank

This framework combines RAG and LexRank in a complementary way:

RAG retrieves relevant chunks based on semantic similarity
LexRank identifies the most representative sentences within those chunks
The LLM generates a coherent answer using both the retrieved chunks and the LexRank summary
Citations are maintained throughout the process

This hybrid approach leverages the strengths of both techniques—RAG’s ability to find relevant information and LexRank’s ability to identify the most important points.

Maintaining Context Across Multiple Documents

One challenge with complex queries is maintaining context across multiple documents:

Cross-document entity resolution: Identifying when different documents refer to the same entities
Temporal alignment: Understanding the chronological relationship between documents
Contradiction detection: Identifying and resolving conflicting information
Context graphs: Building relationships between information across documents

This allows the system to synthesize information coherently even when it’s spread across multiple sources.

Ensuring Factual Accuracy

To maximize accuracy, this framework includes several verification mechanisms:

Fact verification: Checking generated content against retrieved chunks
Citation validation: Ensuring citations actually support the claims
Confidence scoring: Indicating the system’s confidence in each part of the response
Uncertainty highlighting: Explicitly marking areas where information is ambiguous or conflicting

def verify_response(generated_response, retrieved_chunks):
    # Extract claims from response
    claims = extract_claims(generated_response)
    
    # Verify each claim against retrieved chunks
    verification_results = []
    for claim in claims:
        support = find_supporting_evidence(claim, retrieved_chunks)
        contradictions = find_contradicting_evidence(claim, retrieved_chunks)
        
        verification_results.append({
            "claim": claim,
            "supported": len(support) > 0,
            "support_strength": calculate_support_strength(support),
            "contradicted": len(contradictions) > 0,
            "sources": [s.metadata for s in support]
        })
    
    # Annotate response with verification results
    annotated_response = annotate_response(generated_response, verification_results)
    
    return annotated_response

The Interaction Layer: Making It User-Friendly

The final component is the interaction layer—the interface between users and the underlying system.

Agentic Models for User Interaction

Implementation of an agentic approach where the model:

Interprets the user’s question to understand intent
Formulates a retrieval strategy based on the question type
Executes the appropriate retrieval and summarization steps
Generates a response with appropriate citations
Maintains conversation context for follow-up questions

This agent-based approach allows the system to adapt its strategy based on the specific question rather than using a one-size-fits-all approach.

Not all user queries are well-formed:

Query classification: Identifying the type of question (factual, analytical, comparative)
Query expansion: Adding relevant terms to improve retrieval
Ambiguity resolution: Identifying and clarifying ambiguous terms
Decomposition: Breaking complex questions into simpler sub-questions

def process_query(query, conversation_history=None):
    # Classify query type
    query_type = classify_query(query)
    
    # Expand query based on type
    expanded_query = expand_query(query, query_type)
    
    # Check for ambiguities
    ambiguities = detect_ambiguities(expanded_query)
    if ambiguities and not conversation_history:
        # If this is the first interaction, ask for clarification
        return generate_clarification_request(ambiguities)
    
    # For complex queries, decompose into sub-questions
    if query_type == "complex":
        sub_questions = decompose_query(expanded_query)
        sub_answers = [retrieve_and_generate(sq) for sq in sub_questions]
        return synthesize_answers(sub_answers, query)
    else:
        # Standard retrieval and generation
        return retrieve_and_generate(expanded_query)

Presenting Results with Citations

The presentation of results is critical for usability:

Structures responses logically with clear sections
Includes inline citations linked to source documents
Provides confidence indicators for different parts of the response
Offers direct quotes from source documents for key points
Includes a summary of sources used at the end

Handling Follow-up Questions

One of the most powerful features is the ability to handle follow-up questions with full context awareness:

Maintaining conversation history across interactions
Resolving references to previous questions and answers
Reusing relevant retrievals from previous questions
Tracking the user’s focus as it shifts during conversation

This creates a much more natural interaction pattern where users can drill down into topics of interest.

Making It Modular and Scalable

The key to making this framework reusable across different domains is its modular architecture.

Component-Based Architecture

This framework is built around clearly defined components:

Document Processor: Handles ingestion and chunking
Embedding Service: Manages vector embeddings
Retrieval Engine: Finds relevant information
Summarization Service: Applies LexRank and other techniques
Response Generator: Creates the final output
Interaction Manager: Handles user queries and conversation

Each component has well-defined interfaces, allowing them to be replaced or modified independently.

Separation of Concerns

This modular approach enables clear separation of concerns:

Domain-specific knowledge is contained in the document corpus
Retrieval logic is independent of document content
Summarization techniques can be swapped based on domain needs
Interaction patterns can be customized for different user types

This makes the framework adaptable to different domains without rewriting core components.

Configuration-Driven Customization

Rather than hardcoding domain-specific logic, we can use configuration files to customize behavior:

domain_config:
  name: "legal_research"
  
  # Document processing
  chunking:
    strategy: "hierarchical"
    section_delimiters: ["##", "Section", "Article"]
    max_chunk_size: 1000
    
  # Embedding
  embedding:
    model: "text-embedding-3-large"
    dimensions: 1536
    
  # Retrieval
  retrieval:
    vector_db: "weaviate"
    top_k: 25
    reranking: true
    filters:
      - metadata_field: "jurisdiction"
        required: true
      - metadata_field: "document_type"
        required: false
        
  # Summarization
  summarization:
    primary_method: "lexrank"
    extractive_ratio: 0.7
    min_sentences: 3
    
  # Response generation
  response:
    citation_format: "bluebook"
    include_confidence: true
    max_length: 1000

This configuration-driven approach allows quick adaptation to new domains without code changes.

Deployment Patterns

The framework supports several deployment patterns:

Standalone application: For self-contained deployments
Microservices architecture: For scalable cloud deployments
Hybrid deployment: With sensitive components on-premises
Serverless functions: For cost-effective scaling

This framework can be deployed in both cloud and on-premises environments.

Implementation Case Study: Legal Document Analysis

To illustrate the framework in action, let’s look at our implementation for a legal research application.

The Challenge

A law firm needed to analyze thousands of contracts, case law documents, and regulatory filings to:

Identify relevant precedents for specific legal questions
Extract key clauses and provisions from contracts
Track regulatory changes affecting client obligations
Generate properly cited legal memos

Implementation Details

Document Processing:
- Specialized legal document parsers for different document types
- Hierarchical chunking based on legal document structure
- Entity recognition for legal entities, citations, and provisions
Retrieval Strategy:
- Primary focus on authoritative sources
- Jurisdiction-based filtering
- Temporal awareness for legal precedents
- Citation graph analysis
Summarization Approach:
- Heavy emphasis on extractive summarization
- Preservation of exact legal language
- Structured output following legal memo formats
User Interaction:
- Query templates for common legal questions
- Explicit confidence indicators
- Full citation tracking
- Interactive exploration of legal reasoning

Lessons Learned

Domain-specific chunking is critical: Legal documents required specialized chunking based on their structure
Authority matters more than recency: Unlike some domains, older but more authoritative sources often take precedence
Citation precision is non-negotiable: Even small citation errors significantly reduced trust in the system
Explanation is as important as answers: Attorneys valued the system’s ability to explain its reasoning

Final Thoughts: Building Your Own Framework

Here are the key takeaways for building your own implementation:

Go beyond basic RAG: Combine multiple techniques like LexRank for better results
Invest in document processing: The quality of your chunking and metadata extraction directly impacts everything downstream
Make modularity a priority: Design for reusability across domains from the start
Focus on attribution: Proper citation and source tracking builds trust and enables verification
Adapt to your domain: Customize retrieval and summarization strategies based on domain-specific needs

The most exciting aspect of this approach is its adaptability. Variations of this framework can be deployed across legal, medical, financial, and technical documentation domains—each time adapting the components rather than starting from scratch.

If you’re dealing with large document collections and need more than just basic search or generic AI responses, this modular approach combining RAG, LexRank, and agentic models provides a powerful foundation that can be tailored to your specific needs.

The future of document intelligence isn’t just about finding information—it’s about understanding it, synthesizing it, and presenting it in ways that augment human expertise rather than just attempting to replace it.

This post draws on my experience building document intelligence systems across multiple domains. The approaches outlined here have been implemented in production environments handling millions of documents and thousands of daily queries.