In this post, we will take a look at Microsoft Foundry Local. The frontier model LLMs, such as GPT-5, Claude Sonnet, Gemini and Grok, are all hosted in the cloud. They require massive compute resources. When you access those LLMs, you have to pay for the compute resources you use.
But there are smaller, open-source LLMs that can run on a desktop or even laptop computer. These include LLaMa from Meta, Phi from Microsoft, and Qwen from Alibaba. And there is a huge catalog of models on HuggingFace that has special purpose models that can be run locally.
That’s just one piece of the puzzle. You also need a host running on your computer to serve those models. Popular examples include Ollama and LM Studio. Recently Microsoft has announced Foundry Local to let you run LLMs on your local computer. This has a number of advantages including:
- Prototyping and development
- Control over data storage
- On device AI for edge and IoT solutions
- Low latency
- Reduced cloud costs
It gets better. Microsoft Foundry Local is free, runs across platforms on Windows and macOS, and you don’t need an Azure subscription. Microsoft Foundry Local does not come with any strong opinions about the development environment or tooling you are using. It does work best with Visual Studio Code AI Toolkit extension, but you can integrate it with your preferred workflow.
Note that Microsoft Foundry (as of time of writing) is in public preview. Some features may change or be removed at GA.
Installation
The recommend way to install Microsoft Foundry is at the command line. For Windows use WinGet, which is included with Windows.
winget install Microsoft.FoundryLocalOn macOS you’ll first need to install the HomeBrew package manager for macOS at brew.sh. Then run these two commands
brew tap microsoft/foundrylocal
brew install foundrylocalYou can also find binary installers on the GitHub releases page.
Now at the command line run the foundry command to see the usage information along with a list of commands it supports. If you got this far, you’re ready to host an LLM!
Working With Models
Next we need to download and run a model. After all, that’s the whole reason we are here.
With Foundry Local installed, return to the command line. Run this command to list the models that are “pre-optimized” to run with Foundry Local.
foundry model listThe output may look overwhelming but that’s because of how the models are listed. There are really (as of time of writing) only about 2 dozen of these “pre-optimized” models.
For example, take Qwen 2.5 with 0.5 billion parameters:
qwen2.5-0.5b GPU chat 0.52 GB MIT qwen2.5-0.5b-instruct-trtrtx-gpu:2
GPU chat, tools 0.52 GB apache-2.0 qwen2.5-0.5b-instruct-cuda-gpu:4
GPU chat, tools 0.68 GB apache-2.0 qwen2.5-0.5b-instruct-generic-gpu:4
CPU chat, tools 0.80 GB apache-2.0 qwen2.5-0.5b-instruct-generic-cpu:4
There are four different optimizations of this model, each with a different ID. The first column show the type of hardware. There are three optimizations for GPU and one for CPU. This is an exciting feature of Foundry Local, models can be optimized for computers without a GPU. I’ve run it on as little as a 4 core Intel i5 with 16GB of RAM. While it’s noticeably slower than a GPU, it’s usable, especially if you are prototyping.
The “generic-gpu” optimization is what macOS often uses. The “cuda-gpu” optimization is best for hardware that supports CUDA. And the “trtrtx” optimization works well with the TensorRT platform on NVIDIA RTX GPUs. Also, notice that all of the optimizations support chat completions and three of them support tool calling as well. Foundry Local also supports text-to-speech with some models.
There are also optimizations for an NPU. An NPU is a “neural processing unit”. It’s an ASIC (Application Specific Integrated Circuit) designed for handling numerical data structures that are often used in AI applications.
Let’s download a model. This is the only time a network connection is required to use Microsoft Foundry Local. If you run this command
foundry model download qwen2.5-0.5bFoundry Local will download the optimization it thinks is best for your hardware. You can also explicitly download an optimization, for example
foundry model download qwen2.5-0.5b-instruct-cuda-gpu:4Now you can run the model
foundry model run qwen2.5-0.5b-instruct-cuda-gpu:4This will make the model available via the CLI, SDK and start a REST API that can be accessed by any programming language that supports HTTP requests and JSON. If the model had not already been downloaded, the run command will automatically download it.
After the model loads, the CLI starts the interactive mode and will display a prompt and you can chat and interact with the model.
🕙 Loading model…
🟢 Model qwen2.5-0.5b-instruct-trtrtx-gpu:2 loaded successfully
Interactive Chat. Enter /? or /help for help.
Press Ctrl+C to cancel generation. Type /exit to leave the chat.
Interactive mode, please enter your prompt
>
We can ask it the “hello world” of LLMs “What is the capital of France?” and it will respond with something similar to
🧠Thinking…
🤖 The capital of France is Paris.
At this point you can continue to chat with the model. In the CLI use the /new command to start a new chat. And to exit interactive mode use the /exit or /bye commands.
Managing Models
To list the models and optimizations you have downloaded, use the cache command.
foundry cache listThis will display the alias and ID for each model.
You also use the cache command to remove a model
foundry cache remove qwen2.5-0.5b-instruct-cuda-gpu:4Summary
Microsoft Foundry Local gives you the ability to run LLMs on your local computer. This frees you from being tethered to the cloud. This allows you to develop and prototype AI applications locally, before scaling them in the cloud. You’ll save on cloud compute costs and don’t have to worry about latency as Foundry Local can run as fast as your hardware. It’s also optimized for different hardware, even CPUs! Microsoft provides about 2 dozen “pre-optimized” models that you can use right away. The CLI let’s you download and run models. And you can chat with models in interactive mode. There are also CLI commands to manage models you have downloaded. Microsoft Foundry Local is in public preview. You can keep up to date with all of the update at foundrylocal.ai.
