Hosting and scaling LLMs on OKE for production-grade GenAI solutions

Aaron · December 4, 2024, 11:34pm

This is a nice blog post by OCI talking about best ways to run AI models on OCI.

In this post, we explore an efficient, cost-effective approach to hosting and scaling LLMs, specifically Meta’s Llama 3 models, on OCI Kubernetes Engine (OKE). We also examine the critical infrastructure components needed to deploy production-grade GenAI solutions, such as API rate limiting, autoscaling, performance testing, and observability through robust monitoring tools.

What do you think? Is your company talking about the costs of running LLMs?

vikingforties · December 5, 2024, 9:48am

This is a good starter setup for PoCs. Capable of RAG embedding:

You’ll need to drop into the AIO-RAG brang though:

git clone https://github.com/AmpereComputingAI/ampere-ai-llama-chat.git
git branch 
git branch -r
git checkout aio-rag

Then comment out milvus DB in the compose.yaml file unless you have it installed already.

dneary · December 5, 2024, 11:40pm

I had never come across RAG until recently, now it’s all anyone seems to be talking about!

What exactly is RAG? How are pre-trained LLMs able to use customer data to generate custom answers without retraining the model or using some kind of reinforced learning?

dneary · December 5, 2024, 11:49pm

I did a fun thing (well, at least fun to me) and asked ChatGPT to help me understand RAG. It was pretty good! Sharing here because I for one learned something!

I said:

How does someone deploying an LLM augment it using RAG? What is the process?

ChatGPT said:

To augment a large language model (LLM) with Retrieval-Augmented Generation (RAG), the process typically involves:

Setting up a retrieval system: Integrate a database or index of relevant documents.
Query generation: The LLM generates a query based on the user’s input to search for relevant information in the database.
Retrieving information: The system fetches documents or data that match the query.
Generating response: The LLM combines the retrieved data with its knowledge to generate a more accurate and contextually rich answer.

This enables the model to access external knowledge, improving response accuracy.

I said:

How is this different from retraining the model using reinforced learning?

ChatGPT said:

RAG and reinforcement learning (RL) serve different purposes. RAG focuses on improving response accuracy by retrieving relevant information during inference, combining external data with the LLM’s knowledge to generate richer answers.

In contrast, RL, specifically reinforcement learning with human feedback (RLHF), involves retraining the model by adjusting its parameters based on feedback to optimize performance on tasks, like improving alignment with human preferences. RL is a model fine-tuning process, while RAG is an augmentation method that integrates real-time information retrieval without retraining.

I said:

But how does the LLM know how to augment its answers using authoritative data? Doesn’t it need context?

ChatGPT said:

Yes, you’re absolutely right! For an LLM to use authoritative data effectively in a retrieval-augmented generation (RAG) setup, it does need context. Here’s how the process works, step by step, to ensure the model knows how to augment its answers appropriately:

1. User Input (Query Understanding):

The process starts with the user’s input or query. The LLM interprets this input and decides what type of information is needed. The model understands the intent behind the query and formulates a strategy to gather relevant, authoritative information.

2. Contextual Query Generation:

The model then generates a contextual query or search terms. This could involve key phrases, entities, or concepts extracted from the original user prompt. The goal is to create a query that is likely to retrieve relevant documents from a knowledge base or external data source.

3. Information Retrieval:

Using the generated query, a retrieval system (e.g., a vector database, Elasticsearch, or a search engine) finds the most relevant documents or snippets of information. This database is often pre-built and indexed with authoritative data sources such as academic papers, web pages, knowledge bases, or any domain-specific corpus.

The model might search for documents that answer specific parts of the query, or it might retrieve general background information.
It can rank and filter results based on relevance or specificity.

4. Augmenting with Retrieved Information:

The key part of RAG is the integration of the retrieved information. Once the relevant documents are found, the LLM uses them as additional context to generate a response.

The model might extract specific data points (e.g., facts, dates, statistics) from the retrieved documents and integrate them into its answer.
It could also rephrase or synthesize the information in the context of the user’s question.
Sometimes, the retrieved text can even be directly quoted or referenced to ensure the response is authoritative.

5. Final Generation:

With both the user’s input and the augmented context (i.e., the relevant documents), the model generates a response. It combines its internal knowledge (based on training) with the external, authoritative data to provide a more accurate, informed, and context-aware answer.

How the LLM Knows What’s Authoritative:

The LLM can use retrieval systems that rank the sources based on trustworthiness or relevance. For example:

Documents might be rated by a relevance score or based on the trustworthiness of the source (e.g., academic papers are often considered more authoritative than general web pages).
The retrieval system might also prioritize sources that are consistent with known high-quality datasets (e.g., structured databases like Wikidata, or specialized corpora in legal, medical, or technical domains).

The LLM doesn’t directly “know” the authority of data, but it relies on the design of the retrieval system to pull in the most relevant and high-quality data available for augmenting its generation.

Why Context is Key:

The LLM is aware of context in several ways:

The model uses its understanding of the user query to determine which documents are most relevant.
It then uses that context to filter the most authoritative data from the retrieval system.
Finally, it augments its generated response with that context, ensuring the answer is informed by both internal knowledge and authoritative external information.

Example Workflow:

Let’s say a user asks, “What are the benefits of solar energy?”

The LLM recognizes the need to provide detailed and authoritative information on solar energy.
It generates a query like “benefits of solar energy” and retrieves a set of documents that contain authoritative information on this topic (e.g., research papers, government reports, etc.).
The model synthesizes the content from those documents (e.g., environmental benefits, cost savings, etc.) and generates a final response, ensuring it includes information from trusted sources.

Difference from Reinforcement Learning:

While RAG relies on external knowledge to enrich the model’s response in real time, reinforcement learning (RL) is a training process designed to optimize a model’s performance over time based on feedback. RL would involve adjusting model parameters to improve its behavior, but it doesn’t directly augment the model’s answers with external data.

RAG is more focused on improving real-time performance by allowing the model to search for and leverage external information when generating responses.
Reinforcement Learning optimizes the model’s internal behaviors or decisions over time, based on interaction with an environment (feedback loops), but doesn’t directly involve pulling in external authoritative data.

In Summary:

In RAG, the model augments its answers by retrieving authoritative data from external sources based on the context of the user’s query. It uses this external information to enhance its generated response, ensuring the information is both accurate and contextually relevant. The model doesn’t inherently “know” the authority of data; it depends on the retrieval system to pull in the most relevant and trusted sources, which the model then integrates into its response.

I said:

Would it be fair to describe RAG as basically improving an LLM’s search-fu by giving greater weight to an information source considered authoritative?

ChatGPT said:

Yes, that’s a fair and succinct description of what RAG (Retrieval-Augmented Generation) does at a high level! It essentially enhances an LLM’s search capabilities (its “search-fu”) by integrating external, authoritative information sources and using that data to generate more accurate and informed responses.

Here’s how that plays out:

1. Improved Search (Search-fu):

The LLM itself doesn’t inherently have access to up-to-date or external data. However, in a RAG setup, the model is paired with a retrieval system (like a search engine or database) that helps the model perform a much more targeted and relevant search based on the input query.
The retrieval system finds documents or data that are relevant to the user’s question.

2. Greater Weight to Authoritative Sources:

RAG ensures that the retrieved data comes from trusted, high-quality sources. For example, academic papers, domain-specific databases, or expert-curated resources can be prioritized to ensure the model’s responses are grounded in authoritative content.
The LLM uses the retrieved data to guide its final response, which can then be adjusted to place greater emphasis on more credible sources. So, if multiple sources provide conflicting answers, the model can weigh those documents based on authority, relevance, or trustworthiness.

In essence:

RAG improves an LLM’s search “skills” by allowing it to query an external database or knowledge base to find the most relevant information.
Then, it combines that retrieved data with its internal knowledge to generate a response that is both more accurate and contextually richer than what the model could provide alone.

So, it’s not just about making the LLM better at finding information; it’s about ensuring that the information it finds is from sources that are authoritative, trustworthy, and relevant to the question. This ultimately improves the overall quality and credibility of the LLM’s answers.

vikingforties · December 6, 2024, 8:31am

Well I won’t repeat ChatGPT’s essay
Open WebUI has a Workspace for each user which can be used to store your own text, PDF docs and web URLs that will be added to the vector DB. You can tag them into broad categories.
For example, to make an Ampere expert I would write a scraper that took all the Ampere blogs, white papers etc and turned them into text. Upload those into my Open WebUI workspace with tags like, Cloud, Containers, Virtualisation, Features, Inference etc. Then I would start my prompts with a hash that would allow you to pick responses from all or a tagged subset of documents.

X86_Curious · December 6, 2024, 5:11pm

This is basically why I’m looking into Ampere

Topic		Replies	Views
Women in Big Data- Simran Gambani May 15th Events	2	44	May 14, 2025
Generative AI: why CPU inference is growing Content and Articles	2	258	April 19, 2024
Ampere AI Support Information AI/ML ai	0	495	February 27, 2023
On Demand Build Infrastructure on OCI Ampere A1 with Terraform Content and Articles terraform , oci , ampere-a1	0	406	December 12, 2022
Llama-cpp-python binding missing symbol in llama.cpp libllama docker image General Discussion	6	194	March 31, 2025

Hosting and scaling LLMs on OKE for production-grade GenAI solutions

I said:

ChatGPT said:

I said:

ChatGPT said:

I said:

ChatGPT said:

1. User Input (Query Understanding):

2. Contextual Query Generation:

3. Information Retrieval:

4. Augmenting with Retrieved Information:

5. Final Generation:

How the LLM Knows What’s Authoritative:

Why Context is Key:

Example Workflow:

Difference from Reinforcement Learning:

In Summary:

I said:

ChatGPT said:

1. Improved Search (Search-fu):

2. Greater Weight to Authoritative Sources:

In essence:

Related topics