Running LLM locally with Ollama + RAG
What is Ollama
Just like GPT, Gemini, and Deepseek, Ollama is an LLM that we can use as an assistant. What most distinguishes Ollama from other LLMs is that the available models can be downloaded, allowing them to run on a local computer
This is very useful if we want to learn how LLMs work because we have full access to the model. There are several types of models available, ranging from Embedding, Vision, Tools, and Thinking. Each of those models has its own capability You can check the Ollama website for more details.
What is RAG
RAG stands for Retrieval Augmented Generation. As its definition suggests, RAG allows us to augment data as context when prompting an LLM. By augmenting the data, the LLM can provide us with more specific information.
To make it easier to understand, when you type “Who is Linus Torvalds?”, the LLM can easily provide an answer because it already has plenty of data about Linus Torvalds. However, if you ask the LLM “Who was your principal when you were in school?”, the LLM will start to hallucinate because it has no data related to you.
With RAG, we can create our own dataset; all the data we have can be stored, and later the LLM will use that data as context when answering. Very powerful, isn’t it?
In this article, I will explain the steps to get started with Ollama and RAG.
Setup Ollama
Install Ollama
Ollama has a CLI command that can be executed through a Python script. Download Olama here.
Model for embeding
Embedding is the process of converting data such as text, code, or images into vector form. Data in vector format helps us perform queries and obtain more accurate results.
Was trained with no overlap of the MTEB data, which indicates that the model generalizes well across several domains, tasks and text length.
For detail, you can checkout here
mxbai-embed-large
Model for code
After we perform embedding using the mxbai-embed-large model, we need to run queries with another model. CodeLlama is a model that can be used because it is specifically designed for code.
Code Llama is a model for generating and discussing code, built on top of Llama 2. It’s designed to make workflows faster and efficient for developers and make it easier for people to learn how to code. It can generate both code and natural language about code. Code Llama supports many of the most popular programming languages used today, including Python, C++, Java, PHP, Typescript (Javascript), C#, Bash and more.
For detail, you can checkout here
codellamma
Install dependencies
Installing
onnxruntimethroughpip install onnxruntime.- For MacOS users, a workaround is to first install
onnxruntimedependency forchromadbusing:
conda install onnxruntime -c conda-forgeSee this thread for additonal help if needed.
- For Windows users, follow the guide here to install the Microsoft C++ Build Tools. Be sure to follow through to the last step to set the enviroment variable path.
- For MacOS users, a workaround is to first install
ChromaDB untuk database vector
pip install chromadb==0.5.0
- Langchain Lib sebagai tools
pip install langchain-community==0.2.3
pip install langchain==0.2.2
After all dependencies installed. Now we can try to run the python script to see how the magic works.
Run
python query_ollama.py
To see the full source code and try it by your on, you can check out here
That's it. Please sure put all your files that related to context into
datafolder
Happy Coding~~