Large Language Models (LLMs) have gained tremendous popularity in recent years, transforming how we interact with technology, access information, and perform various tasks. RAG stands for Retrieval-Augmented Generation, and a RAG server is a system that enhances Large Language Models (LLMs) by integrating them with external knowledge sources. Using Go for an RAG server combines high performance, ease of use, and a robust set of features that align well with the needs of modern web service. Go's efficiency and strong concurrency support make it ideal for building high-performance, scalable RAG servers. The architecture typically features a Go-based server handling client requests and orchestrating the retrieval process, interfacing with a vector database and external APIs. This approach enables the creation of innovative AI-driven applications with reliable real-world performance, providing a solid foundation for advanced natural language processing systems that can efficiently serve multiple users with low latency.
RAG Server:
A Retrieval-Augmented Generation (RAG) server is a system designed to enhance the capabilities of language models (LLMs) by combining two key processes: retrieval and generation. This architecture allows for more accurate and contextually relevant responses by leveraging external knowledge sources.
Key components:
Retriever: Finds relevant information from external sources
Generator (LLM): Produces responses based on input and retrieved info
Vector Database: Stores embeddings for efficient similarity search
External Knowledge Sources: Databases, APIs, document stores, etc.
Architecture
Building RAG server in Go
In this example, we will be building an RAG server that will be able to add new data to the knowledge base and perform the query on added context.
We will be using the following.
gin
for HTTP server in Goweaviate
for vector databaseGoogle Gemini Model
as LLMGoogle's Text embedding model
to calculate vector embeddings of documents and questions from the request
First, we would need an API key for Gemini. Visit this link to create one.
Click on Get API Key, create an API key in the new Project, and save the key. This key can't be retrieved again.
We need to start the vector DB locally and this can be done using the docker.
docker run -p 5555:5555 -p 50051:50051 cr.weaviate.io/semitechnologies/weaviate:1.26.4
Google already has an SDK genai
that helps with initializing both generative and embedding models. Something like this
import(
"github.com/google/generative-ai-go/genai"
)
// ....
genClient, err := genai.NewClient(ctx, option.WithAPIKey(geminiKey))
Similarly, weaviate
also has SDK for go for the integration. Can be done something like
client, err := weaviate.NewClient(weaviate.Config{
Host: "localhost:5555"),
Scheme: "http",
})
Now we need two endpoints
Post the documents to add contexts
Query the posted document
We will use gin for it something like below.
gin.SetMode(gin.ReleaseMode)
router := gin.New()
router.Use(gin.Recovery())
router.POST("/document", addDocument)
router.POST("/ask", askQuestion)
Adding Document to Context
We would need to add documents to the context for LLM. We will leverage the embedding model provided genai
to batch-embed the provided documents. Something like
batch := rag.embedModel.NewBatch()
for _, doc := range documents {
batch.AddContent(genai.Text(doc))
}
embedModelResp, err := rag.embedModel.BatchEmbedContents(ctx, batch)
And we would add the same to our vector database. The embedding model provides us the vector for each document we add, we will use the same to add it to vectorDB, something like
for i, doc := range documents {
weavObjects[i] = &models.Object{
Class: "Document",
Properties: map[string]any{
"text": doc,
},
Vector: embedModelResp.Embeddings[i].Values,
}
}
_, err = rag.vectorDBClient.Batch().ObjectsBatcher().WithObjects(weavObjects...).Do(ctx)
This will do the context addition part.
Asking a Question
To ask a question we would need the vector value for the given text. So first we would use the embedding model to get the vector, something like
embedModelResp, err := rag.embedModel.EmbedContent(ctx,genai.Text(question))
We will then use the vector value from the response to fetch relevant context for LLM. Something like
grahpQ := rag.vectorDBClient.GraphQL()
result, err := grahpQ.Get().WithNearVector(grahpQ.NearVectorArgBuilder().WithVector(embedModelResp.Embedding.Values)).
WithClassName(rag.class).
WithFields(graphql.Field{Name: "text"}).
WithLimit(4).
Do(ctx)
We will combine the result from the above query as context with the question on a template to pass to our LLM. The template for LLM could be something like
### Question:
%s
### Context:
%s
### Instructions:
- Provide a clear and concise response based on the context provided.
- Stay focused on the context and avoid making assumptions beyond the given data.
- Use the context to guide your response and provide a well-reasoned answer.
- Ensure that your response is relevant and addresses the question asked.
- If the question does not relate to the context, answer it as normal.
### Expected Answer Format (Optional):
[Specify any preferred format, such as bullet points, paragraphs, or specific instructions if needed.]`
And send the request to LLM, something like
ragQuery := fmt.Sprintf(template, question, strings.Join(vectorContexts, "\n"))
llmResp, err := rag.generativeModel.GenerateContent(ctx, genai.Text(ragQuery))
So the response can passed to the user.
The full code is here.
Example
Adding Documents
Asking Question
Handling the Trade-offs
Retrieval Latency
One of the key challenges with RAG systems is ensuring the retrieval process is quick enough to meet the expectations set by Gemini Flash’s fast inference capabilities. A slow retrieval process can bottleneck the system. To manage this, we can optimize the vector database’s indexing mechanisms and search algorithms. Utilizing HNSW (Hierarchical Navigable Small World) graphs or Faiss can significantly reduce search time, especially with large datasets.
Complexity of Nuanced Responses
While LLMs like Gemini Flash excel at delivering quick, high-level responses, generating more nuanced answers requires optimizing the prompt and retrieval quality. Techniques like prompt engineering could be crucial here. By crafting highly specific prompts, we can guide LLMs to make the most of the retrieved context and generate more complex responses without adding much latency.