Building Efficient RAG Servers for LLMs with Go

Building Efficient RAG Servers for LLMs with Go

Large Language Models (LLMs) have gained tremendous popularity in recent years, transforming how we interact with technology, access information, and perform various tasks. RAG stands for Retrieval-Augmented Generation, and a RAG server is a system that enhances Large Language Models (LLMs) by integrating them with external knowledge sources. Using Go for an RAG server combines high performance, ease of use, and a robust set of features that align well with the needs of modern web service. Go's efficiency and strong concurrency support make it ideal for building high-performance, scalable RAG servers. The architecture typically features a Go-based server handling client requests and orchestrating the retrieval process, interfacing with a vector database and external APIs. This approach enables the creation of innovative AI-driven applications with reliable real-world performance, providing a solid foundation for advanced natural language processing systems that can efficiently serve multiple users with low latency.

RAG Server:

A Retrieval-Augmented Generation (RAG) server is a system designed to enhance the capabilities of language models (LLMs) by combining two key processes: retrieval and generation. This architecture allows for more accurate and contextually relevant responses by leveraging external knowledge sources.

Key components:

  • Retriever: Finds relevant information from external sources

  • Generator (LLM): Produces responses based on input and retrieved info

  • Vector Database: Stores embeddings for efficient similarity search

  • External Knowledge Sources: Databases, APIs, document stores, etc.

Architecture

Architecture for LLM using Rag Server.

Building RAG server in Go

In this example, we will be building an RAG server that will be able to add new data to the knowledge base and perform the query on added context.

We will be using the following.

  1. gin for HTTP server in Go

  2. weaviate for vector database

  3. Google Gemini Model as LLM

  4. Google's Text embedding model to calculate vector embeddings of documents and questions from the request

First, we would need an API key for Gemini. Visit this link to create one.

Click on Get API Key, create an API key in the new Project, and save the key. This key can't be retrieved again.

We need to start the vector DB locally and this can be done using the docker.

docker run -p 5555:5555 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.26.4

Google already has an SDK genai that helps with initializing both generative and embedding models. Something like this

import(
  "github.com/google/generative-ai-go/genai"
)
// ....
genClient, err := genai.NewClient(ctx, option.WithAPIKey(geminiKey))

Similarly, weaviate also has SDK for go for the integration. Can be done something like

client, err := weaviate.NewClient(weaviate.Config{
        Host:   "localhost:5555"),
        Scheme: "http",
    })

Now we need two endpoints

  1. Post the documents to add contexts

  2. Query the posted document

We will use gin for it something like below.

gin.SetMode(gin.ReleaseMode)
router := gin.New()
router.Use(gin.Recovery())

router.POST("/document", addDocument)
router.POST("/ask", askQuestion)

Adding Document to Context

We would need to add documents to the context for LLM. We will leverage the embedding model provided genai to batch-embed the provided documents. Something like

        batch := rag.embedModel.NewBatch()
    for _, doc := range documents {
        batch.AddContent(genai.Text(doc))
    }
    embedModelResp, err := rag.embedModel.BatchEmbedContents(ctx, batch)

And we would add the same to our vector database. The embedding model provides us the vector for each document we add, we will use the same to add it to vectorDB, something like

for i, doc := range documents {
        weavObjects[i] = &models.Object{
            Class: "Document",
            Properties: map[string]any{
                "text": doc,
            },
            Vector: embedModelResp.Embeddings[i].Values,
        }
    }

_, err = rag.vectorDBClient.Batch().ObjectsBatcher().WithObjects(weavObjects...).Do(ctx)

This will do the context addition part.

Asking a Question

To ask a question we would need the vector value for the given text. So first we would use the embedding model to get the vector, something like

embedModelResp, err := rag.embedModel.EmbedContent(ctx,genai.Text(question))

We will then use the vector value from the response to fetch relevant context for LLM. Something like

grahpQ := rag.vectorDBClient.GraphQL()
result, err := grahpQ.Get().WithNearVector(grahpQ.NearVectorArgBuilder().WithVector(embedModelResp.Embedding.Values)).
        WithClassName(rag.class).
        WithFields(graphql.Field{Name: "text"}).
        WithLimit(4).
        Do(ctx)

We will combine the result from the above query as context with the question on a template to pass to our LLM. The template for LLM could be something like

### Question:
%s

### Context:
%s
### Instructions:
- Provide a clear and concise response based on the context provided.
- Stay focused on the context and avoid making assumptions beyond the given data.
- Use the context to guide your response and provide a well-reasoned answer.
- Ensure that your response is relevant and addresses the question asked.
- If the question does not relate to the context, answer it as normal.

### Expected Answer Format (Optional):
[Specify any preferred format, such as bullet points, paragraphs, or specific instructions if needed.]`

And send the request to LLM, something like

ragQuery := fmt.Sprintf(template, question, strings.Join(vectorContexts, "\n"))
llmResp, err := rag.generativeModel.GenerateContent(ctx, genai.Text(ragQuery))

So the response can passed to the user.

The full code is here.

Example

Adding Documents

Asking Question

Asked about who are you

Asked about how it was developed

Asked about Kings beach from context

Handling the Trade-offs

Retrieval Latency

One of the key challenges with RAG systems is ensuring the retrieval process is quick enough to meet the expectations set by Gemini Flash’s fast inference capabilities. A slow retrieval process can bottleneck the system. To manage this, we can optimize the vector database’s indexing mechanisms and search algorithms. Utilizing HNSW (Hierarchical Navigable Small World) graphs or Faiss can significantly reduce search time, especially with large datasets.

Complexity of Nuanced Responses

While LLMs like Gemini Flash excel at delivering quick, high-level responses, generating more nuanced answers requires optimizing the prompt and retrieval quality. Techniques like prompt engineering could be crucial here. By crafting highly specific prompts, we can guide LLMs to make the most of the retrieved context and generate more complex responses without adding much latency.