DIY LLM

Ever wondered how so many startups can have their own custom LLMs?

You’ve seen the headlines: GPT-this, DeepSeek-that, and endless announcements from small AI startups claiming they’ve built their own “custom model”. These are advertised to give you financial advice or give you historical sports facts. There might even be services that advertise the ability to “chat to your own data!”, generates custom study notes, or assists specifically with your research.

But here’s the thing: training and serving these models costs millions and takes months. So how are tiny startups pulling this off?

The short answer is that often, they actually don’t - they’ll be using stock ChatGPT, Claude, Gemini or some other established large model in the background. The key difference is the inclusion of an additional system, which allows for Retrieval-Augmented Generation, or RAG for short.

What is a RAG?

A RAG is a large language model whose Generated response is Augmented by relevant information that is Retrieved from a custom database (hence why it’s called Retrieval-Augmented Generation).

Say, for example, you are asking an LLM what the weather will be like today - something an LLM wouldn’t know. What a RAG could do is inject the relevant information into the input for the LLM, which then allows the LLM to use this information to answer the question. It does this by searching through information in a database based on your chat input, and simply adds it in as additional text for the LLM to refer to. In this case, it might find today’s weather report off the internet, add that text to the bottom of your chat input, and then (hopefully) this helps the LLM generate an accurate response.

Diagram of a RAG. Sourced from snorkel.ai

Generally, when small startups have their own RAG setup, they will put a lot of effort into curating a focused and well-designed text database for a retrieval function to search through - maybe it’s a comprehensive sports almanac, or maybe it’s a live-updating stock market RSS feed. Google, Microsoft and OpenAI, on the other hand, will connect their large language models to the entire internet through a RAG!

Why use a RAG?

The idea of using RAGs was initially intended to “plug holes” in the LLM’s base knowledge, particularly given the knowledge cutoff of training data [link to time capsule]. This could be used by LLMs to answer things like “who is the current premier of Queensland?” or “what’s the latest interest rate?” more accurately, given that the answer will likely change as time moves on.

But RAGs can do more than just fill knowledge gaps. They can also specialise or guide the response of an LLM.

If you’re building a chatbot for a specific audience and want it to respond with a certain tone, you could attach a style guide or service handbook to a RAG. The model then draws from that guidance to match your intended voice.

Limitations

There are limitations to this approach, however. For one, the number of documents you can use with a RAG is inherently limited by the length of the context window, and the suitability/power of the search engine (think Bing vs Google) affects the quality of injected documents.

Less obviously though, and perhaps unintuitively, the way the RAG is built means that a model would not be able to glean “summary” data of the entire dataset, like how many entries are in this dataset, or what topic areas are covered by the whole database, et cetera. This is because the LLM would never be able to see the whole dataset at once - only ever the pieces of information given to it at the time it is responding to the question.

Also consider how suitable the LLM would be for interpreting the injected texts - imagine giving a year 10 high school maths student a postgraduate textbook on advanced linear algebra! There will only be so much that an LLM (or a poor high school student) will be able to specifically understand - if it is not fine-tuned to be a specialist in that area, it might struggle to make sense of the provided data!

Below is a simple RAG widget which takes the first 5 search results from Google and injects them into each user input. The non-RAG LLM response is on the left, while the RAG response is on the right. Have a play, and note the differences in response, where the RAG seems to be helping, or more interestingly where the RAG seems to be hindering!

Note: the raw RAG input is purposefully exposed to you in this widget for transparency - in real deployments, the chat history will look identical!

Reflections

What AI products have you interacted with recently that might’ve been a RAG without you knowing?
How could you imagine using this for your own use cases?
What kind of data would be good for a RAG? What would be bad?
How does this differ from a traditional search engine? Why might you use a RAG instead of just a search engine?

Recommended learning

RAGChat - local RAG chat app developed by the QUT GenAI lab that lets you build your own RAGs with Ollama
What is Vector Search? - a fairly approachable yet in-depth explanation of one of the more popular methods of document retrieval in RAGs: vector search
Which is better, retrieval augmentation (RAG) or fine-tuning? Both. - A short blogpost from Snorkel.AI going through the advantages and disadvantages of RAGs compared to fine-tuning.

⌂ Home

◀ Previous Next ▶

GenAI Arcade

COMMUNITY NOTE

DIY LLM

DIY LLM

What is a RAG?

Why use a RAG?

Limitations

Reflections

Recommended learning