Full Transcript
00:00
Hello everybody, this is Jim from Augusto Digital. Today I’m going to talk about running LLMs locally and self-hosting your AI. To do this, I’m going to walk through a demo. I’m going to show a tool called OLAMA. It’s one of a few different ones that you can run your LLMs locally
00:17
on your own hardware, and I’ll be demoing this on a Mac. I’ll showcase that, and I’ll definitely show Thank you. Some different models. We’ll talk through some of the benefits, but overall I want people to be able to see what this looks like
00:30
and have a vision, both for your development team or your organization or as a business owner, to understand what this looks like to run things locally. Before I start, let’s just quick touch on why people run LLMs locally. a couple of ones that we see as a company quite often
00:51
is privacy. Many businesses want to keep their data in-house. They don’t really want to share their queries or the data they put out into the cloud. they want to keep data away from third parties. I’ll demonstrate kind of how that works locally and show you that.
01:06
Control. They want to have the control of your infrastructure, whether that’s- it’s building out equipment with two GPUs, ten GPUs, or spreading it across multiple machines, in addition to multiple models. Third is speed. You know, with the cloud you have latency going out to the cloud versus
01:26
coming back to your own environment. that doesn’t matter for things maybe that run in the background, but when you run real live data, that low latency is really- nice to have. You can save costs running locally, however you can also spend a lot of money on hardware.
01:40
There’s definitely a balance to be had. I know, for some companies it’s definitely a cost savings. From a development perspective, it’s really nice to be able to limit the number of AI tools you need to have access to, using those tokens and take some of that work internally.
01:55
Things like embedding- being like, running queries, you know, for ETL jobs are pretty nice to have running your own GPUs in the background. Flexibility, you can definitely choose as you run your own local LLMs, how you augment that data, whether that’s customizing a model,
02:11
whether that’s bringing a model down from a third party or building in your own data. I won’t do that in this video. There will be a part two on how you can augment- Thank you for your local LLMs with your data and keep everything secure and local.
02:24
And then there’s offline use, the idea that you can still use your LLM if you’re not connected to the internet and your data doesn’t get connected to the internet in any way. And lastly, what I like to share with a lot of people is that ability to
02:37
do local prototypes fast inexpensively and your developers can take and use this locally as you can as well. So let me quick just jump into a demo. I’m gonna jump onto a Macbook here and I’m just using a Macbook Pro. I want everyone to see really quick.
02:56
There’s just one single GPA name. It’s got 16 cores. It looks like. And it’s an M1 Pro. It’s not a very new machine. I think we’ve had this for a while. It just happened to be one that I have here that I have. Access to. And then during this video, you’re going to see down here.
03:10
I have a window that shows the GPU usage. And you may occasionally see me switch over here. Where I also have the GPU usage and memory usage showing. I want to do this because I want people to see that different models have different sizes and they have different use cases and what that means and
03:28
benefits it has to run it locally. So I am running at a llama. I talked about that a minute ago. You can see it in the top. Old llama is very powerful. It actually, let me switch back. Old llama is available for Mac OS windows and Linux. A lot of people can run this in AWS on a Linux
03:44
machine. You can run it locally. You can run it in Docker. there’s many, many things. Docker is very challenging with the llama because you typically don’t have much access to your CP. So running in a hardware is very, very common. another thing to note is there’s quite a few models that you
03:58
can see directly in and you can download straight from all llama. OpenAI has its open source GPT, deep seek as well, and there’s Q in libraries. I’ll demo these as I go. we’ll talk about the size of the model and what it means where the number of parameters. So let me go,
04:13
and I guess lastly, let me also talk about getting other models. Hug and Face is a very common location where there’s different models to be available to users. you can query your models, maybe in a specific sector or industry. you name it, there are probably a few out there.
04:27
the nice part about llama is you can find a few of these that fit and you can run them locally as well. So back to my demo here. I’ve shown you that I’ve got a MacBook M1 Pro. I’m only doing this because it’s really simple for me to demo.
04:40
we have other machines more powerful. You can definitely set up a Windows machine or a Linux machine with multiple GPUs to give you that massive horsepower that might be necessary. before we go any further, I am running. let me quick go and showcase. In case when I’m running here, I am running.
04:58
Where is it? Let’s do the version. I am running a llama 0110 and then I think let’s take a minute to take a look at the list of models. I have them in my machine. Like I showed you on the website, I have the GPT open source. It is a 13 gig model.
05:16
It’s a fairly. Large model considering that my GPU probably only has 16 gigs of size in it. So that will take up a lot. If not everything of the memory, I have my GPU. What makes that large is it’s a 20 billion parameter model. there are different, different sizes.
05:34
if you have a larger GPU, you can run a larger model, meaning it has more parameters, which overall has. Better understanding, right? It has more, bigger parameters, have more general knowledge. it will do fewer hallucinations and theory. but it’s being able to run these larger models.
05:50
in addition, I have some smaller models. You see this one’s only 274 meg. Alright, we use this to embed text into a rag. It’s very, very common. It’s really good at taking that text and, Putting it into a rag. So different models have different reasons for their usages.
06:07
You can see I have deep-seeking here, which has 1.5 billion parameters, way smaller than the 20 billion. And then I also have QN25, which is 2.5. And then lastly, I have this one that’s called TinyLama. I just want to demonstrate the different sizes here. You can see QN is at, 4.7 gig.
06:25
But again, back to GPT at 13. And then lastly, this TinyLama down here, it’s like 600 and some odd mag. So let’s start by showing what it looks like. Again, I’m in a command line. I will move to a user interface in a few minutes, but I want people to see kind of the
06:41
way that it runs and have some data behind it. So what I’m going to do here is I’m going to run a llama and I’m going to use TinyLama, the smallest one I have in verbose mode. I’m doing this so that I’ve loaded it in memory. It’s in my GPU. And I’m going to ask you the question.
06:55
I’m going to ask the same question for each one of our models. Why is the sky blue? And you can see this model small. There wasn’t much knowledge transfer processing needed. And it came back with a quick four point answer. It says there’s several factors on why the sky appears blue,
07:13
especially during sunset. Here are some of them. Color difference. It’s a pretty succinct, quick, high level answer. Not much detail at all. What’s important here is a couple of things that happen. The duration of the call itself, the total duration was four seconds or four point seven seconds.
07:28
And it shows that my GPU can run this at about 134 tokens per second. Right? That token being able to amount of work. It can do inside of a query. To kind of expand on that, a token is really a chunk that the model can process. It’s not always a whole word.
07:46
For example, running locally might be broken into running and then locally. It’s, it’s really a token is pieces that are put together and reads and generates against a model. Tokens really equate to per second speed. The speed of this model is pretty quick. Let’s move- move to a larger
08:02
model. I’ve got deep-seek running. It’s got 105 parameters. Let’s load that into memory and let me ask the same questions. Why is the sky blue? And again, when I ran the first question, you see my CPU or GPU spike, you’re going to see it spike again. So down here is the GPU showing
08:25
kind of the height. It’s also showing memory. Now, one thing I do want to show, I think this is important. I’m going to run the command called Olama PS. So I have two models in memory. I have deep-seek, which is two gigs. I have tiny llama that filled up 1.4 gig.
08:43
This is filling up my GPU’s memory. We’re now going to go to, oh, I’m going to go back. You can see deep-seek. A couple of different things. When it was looking for it, it did some more reasoning inside of its answer. blue of the sky is called blue sky or blue color.
08:57
It’s the result of the combining factors. It’s a much, it’s a greater information inside of its answer. It gave some key points and it broke it down for me. Now, you’ll notice that this doesn’t run as fast. It runs at about 77 tokens per second. And the query took about seven seconds,
09:13
again, against the larger model. Let’s move up to QN5. Well, that’s the same question. And again, I’m loading this into memory. this might take a little longer. It’s a larger model. Why is the sky blue? And then, while this is running, I want to go back to what’s the models in
09:29
memory. Now I have three models. You can see here, I’ve got a QN, which is 5.6 gig. I- I have two gigs, and then I’ve got another one. So I’m roughly nine plus gig in memory. You can see the GPU here. As it ran, this question took a little longer. So QN’s answer,
09:47
similar. Why is the sky blue? It appears blue during a phenomenon called the rally scattering. So we didn’t- went into the reasoning of why it’s blue in the process behind it. And it tells more of a story behind that. This is our earlier answers where it talks about,
10:01
the differences in it. And then the first one, which is just a quick summary. Now, I’m gonna load up the old llama GPT, OS, the 20 billion one. Now, before I do that, let’s take a look at what’s running. I have, this one has about a minute left. As soon as I run this one,
10:16
it’s going to kick out the other models. Cause, This model is a larger model. So, we can actually watch this happen. As I did that, it’s loading in the, GPT model. And that’s 13 gigs. Really, this shows that certain models, and you might have multiple models running in your local environment,
10:36
need more GPUs. This is where you build out a larger infrastructure with more, with more powerful, machines, more memory on the machines in the GPU space. one other thing to note is because my machine is a little low powered, that this model is using 81% of the GPU,
10:51
and it’s still using some of the CPU. So, there will be an issue here on performance, when I go to use this. So, if we quickly look, the, tiny llama took about 4.5 seconds and, and it was a speed of about 134 tokens per second. Deepseak was about 77 tokens per second,
11:09
with about a 7 second result. Q-win was 26 tokens per second. Again, we’re going down on each one, and it took a little longer to run. Now I’ve loaded this one in. Let’s go ahead and ask the question, why is the sky blue? I might need to go away as this runs, and we’ll talk about the
11:25
duration. Why? Bye. Is the sky blue? And while we’re doing that, we’ll go look at what’s running. And again, only one model’s in there. And effectively, my machine has slowed down dramatically. I’m now using almost all my RAM.
11:41
The GPU hasn’t quite kicked in yet. There’s a balance between as it’s doing. The look up and the query. In fact, you can see right here it’s running. I will come. What the heck is it starts to build its answer? Alright, you can see here, GPT, very similar when you’re using like a chat GDP,
11:57
is actually doing some thinking before it gives me an answer. Right, this model has got the idea to do the processing, the thinking in the background before it drives itself back to a full answer. it’s still running. you can see down here that the GPU is really getting
12:12
tested. Okay. And some of that reason it’s not fully taxed is that the CPU still has to, add some value to this as it runs because it’s a larger model. Now, the reason why I did this in the, the command line window is so you can see the performance, take a look at why the different models are
12:29
different, and it helps me showcase the time for it. While this is running in the background, I’m gonna go ahead and move to the- the GUI or the more of the user interface of a llama. So you can see up here, I’m gonna click on open a llama, and it’s gonna bring up a window. Now again,
12:44
my machine is definitely running slow at the moment. Very similar to what you would see in chat GTP. it’s got a window where you can see my history. You can see I’ve done why is this guy blew a few times. And, and it lets you start a new thing. in a addition to asking it
12:58
a question here and seeing it in a threaded view, like a lot of the chat systems you’ve used, whether it’s co-pilot, anthropopic, or chat GTP. It’s a very similar model against your own local LLM. I do want to share that. here you can see that it shows you a few models that are out in online.
13:16
So for example, I have the GPT OSS-2-1 20 billion, installed parameters. I do not have this gamma-3, or this other one. If I wanted those, I could click on it and it will quickly ask me if I start to ask it a question, how performant or who is the president.
13:36
Before I run this, it’s gonna actually ask me to obtain- so you can see it started to download that model. Now I’m gonna cancel this one because I don’t need to pull that model down right now, but it’s a really good way in interface to go ahead and get a
13:50
model. You can do that in the command line as well, but right here it makes it really easy. And earlier we saw that I was using, tiny llama, that’s in here because I’ve already downloaded it, and I could ask it the same question I did. We’ll do that here in a minute. One. . It’s the chat GDP,
14:06
a larger version is done running. Alright, that GPT version has finished. Now let’s take a couple of things here to look at. One. Let’s look at this answer. this answer did some thinking. It came back with what happens to the sunlight based on the sky is
14:21
blue. why did we see blue? It gave me what changes. It gave a quick. Demo to try it home. It threw in some fun facts, and it gave me a bottom line. You can see here that this took four minutes and 17 seconds to run. Right? It did some thinking. It did some deeper research.
14:37
while it was running, you can see my GPU ran for that whole duration of the time, making the machine run quite slow. Now again, this is a desktop level PC that I’m running this on. Typically, an organism- patient here to build out a dedicated GPU based machine
14:53
to do this with larger models. You can see that it took 72 tokens. And lastly, that tokens per second, really again, tokens per second is that speed. However, I did get a much more, thorough answer. If we go back, the QN1 was roughly 26 tokens per second.
15:10
Again, that’s performance and speed. The deep seek was around 77. And then this tiny llama was about 134. Right? That’s a good gauge to try as you build out these machines to be more powerful. And if I go back and show the user interface, I can do the exact same question. Why is the sky blue?
15:30
And before I do that, let’s also go see what is loaded into, what models are actually, Running, so I’ll do a Docker PS. You can see that GPT is still loaded. It will unload itself in about three minutes. But what I’m gonna do is go ahead and run tiny llama.
15:44
When I do that, all of a sudden, notice that tiny llama answer came back really quick, again, because of the speed. But if I look, it unloaded that, one above. It unloaded the 13 gigs to make room for that model. Olama’s controlling what’s in, there, balancing request,
16:00
and loading things into memory. let’s switch the model here again, and let’s go to QN, which I know I have. And I’ll do why. Sky-blue. And it’s gonna, should be loading that model. And now I’ve got two models in memory. It’s taking a minute to load it. But then I should get the similar
16:18
or same answer that I did when I did the command line. And then I’m gonna do one more. We’re gonna go ahead and do the same thing I did before by loading up all the different models. I’ll do deep-seek. I should have that one loaded. And then why is the sky blue?
16:34
So I’ve done everything in the command line, as well as in the user interface. You can see that deep-seek is also doing that, thinking and logic. The same thing that was done with, The GPT version. Well, what’s nice here is deep-seek is a little smaller of a,
16:48
model, so it runs a little faster. And again, it gives a pretty darn good answer for being that tight. And you can see all free or in memory, because of the size of my GPU. The limitation there is the size of the model and the size of your GPU. You can stack them up.
17:04
The last thing I want to showcase is, How would you use this on a local network? If this was a server running in the back room, in your data center, your team can set this up to be accessible by other team members. you can do a multitude of ways, different tools.
17:17
we’ll definitely dig into that in our next video. But really quick, I’ll show the power of Olama. Olama allows you to expose it to the local network via both the way we were doing it here by a straight- connection or being able to use a web API. So I’m just going to use a command called curl,
17:35
which will be calling the machine by its IP address and the API endpoint. I’ll be using tiny lama and I’m going to ask it the same question we’ve been asking it. Why is this guy blue? So what it’s doing here is it’s calling from me through a web command,
17:50
a web hook in this case an API. it’s telling me I’m using. I’m using tiny lama and here’s the response I got back and the duration of time was about 1.9 seconds. We can do the same thing as that, but we can also change the model we wanted to run against.
18:06
again, since this is already in memory, I think I can quickly, let’s do this quick. Do the same thing with Q and let’s go ahead and run the same question. Notice that I’m using Q in 2.5 and it should run against it and bring me back some data.
18:24
And it did. That one took about seven seconds and it gave a longer answer. So the power here is it’s not just you running it on a local desktop. It’s running in your local network. we see a lot of people asking and questioning how you could do this locally. This video is just to show.
18:38
What small sampling of what can be done. We love the talk with you on the options or maybe building out a strong and powerful, AI system locally. there’s definitely different engines other than Olama. you can do this in the cloud as well. In our next video,
18:52
we’re going to tie together this local LLM Olama with a few other tools and workflow to make yourself have a very powerful analogy. Hello, LLM locally. Got questions? Reach out. Thank you.
Let's work together.
Partner with Augusto to streamline your digital operations, improve scalability, and enhance user experience. Whether you're facing infrastructure challenges or looking to elevate your digital strategy, our team is ready to help.
Schedule a Consult