Running LLM Model on Local Machine: Ollama, LlamaIndex and Langchain

Показать описание

#ai #genai #llm #langchain #llamaindex #ollama #aimodels

Ollama is a free application for locally running generative AI Large Language Models. Currently, it's available for MacOS and Linux, with Windows support in preview. Additionally, on Windows, you can easily utilize it within Windows Subsystem for Linux and Docker containers.

The Ollama application allows you to pull the desired Large Language Models (LLMs) locally for running and serving. You can interact with the models directly from a command-line interface (CLI) or access them via a simple REST API. It can leverage your GPU, delivering lightning-fast performance on devices like MacBook Pro and PCs equipped with a good GPU.

Specific models, such as the extensive Mistral models, require ample resources to run locally. Quantization plays a crucial role in compressing models and reducing their memory footprint.

One of the recommended starting points among various quantization options is q4_0. This variant provides an optimal balance between memory savings and model performance.

Quantization ensures efficient memory usage while maintaining the effectiveness of the model. Familiarizing yourself with and leveraging quantization can significantly enhance your experience with Ollama models.

During your exploration of Ollama models, you may encounter tags beginning with "q" followed by a number (and sometimes a "k") [Show]

Each model in the Ollama Library is accompanied by various tags that offer specific insights into its functionality. These tags are denoted by the text following the colon in the model's name. Here are some key points to understand about tags:

The primary tag is commonly known as the "latest" tag, although it may not always represent the most recent version of the model. Instead, it indicates the most popular variation.

When you don't specify a tag, Ollama automatically selects the model with the "latest" tag.

Within the latest tag, you can find details such as the model's size, the beginning of the sha256 digest, and the age of that particular model variation.

The right side of the tag presents the command required to run that specific version.

Key features of Ollama:

Automatic Hardware Acceleration: Ollama automatically detects and utilizes the best available hardware resources on Windows systems, including NVIDIA GPUs or CPUs with modern instruction sets like AVX or AVX2. This feature optimizes performance, ensuring efficient execution of AI models without the need for manual configuration. It saves time and resources, making projects run swiftly.

No Need for Virtualization: Ollama eliminates the need for virtualization or complex environment setups typically required for running different models in AI development. Its seamless setup process allows developers to focus on their AI projects without worrying about setup intricacies. This simplicity lowers the entry barrier for individuals and organizations exploring AI technologies.

Access to the Full Ollama Model Library: The platform grants unrestricted access to a comprehensive library of AI models. Users can experiment with and deploy various models without the hassle of sourcing and configuring them independently. Whether the interest lies in text analysis, image processing, or any other AI-driven domain, Ollama's library meets diverse needs.

Always-On Ollama API: Ollama's always-on API seamlessly integrates with projects, running in the background and ready to connect to powerful AI capabilities without additional setup. This feature ensures that Ollama's AI resources are readily available, enhancing productivity and blending seamlessly into the development workflow.

CLI Commands:
ollama, ollama run model, ollama pull model, ollama create, ollama rm model, ollama cp model my-model, ollama list, ollama serve
llamaindex - LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs)

Langchain - LangChain is essentially a library of abstractions for Python and Javascript, representing common steps and concepts necessary to work with language models

Docker:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker start ollama
docker stop ollama
docker exec -it ollama ollama
docker exec -it ollama ollama run model

Prompt: Cite 20 famous people

CURL:

-d '{"prompt": "Cite 20 famous people", "model": "phi", "options": {"temperature": 0.75, "num_ctx": 3900}, "stream": false}'