Getting Started with Microsoft Foundry Local

In this post, we will take a look at Microsoft Foundry Local. The frontier model LLMs, such as GPT-5, Claude Sonnet, Gemini and Grok, are all hosted in the cloud. They require massive compute resources. When you access those LLMs, you have to pay for the compute resources you use.

But there are smaller, open-source LLMs that can run on a desktop or even laptop computer. These include LLaMa from Meta, Phi from Microsoft, and Qwen from Alibaba. And there is a huge catalog of models on HuggingFace that has special purpose models that can be run locally.

That’s just one piece of the puzzle. You also need a host running on your computer to serve those models. Popular examples include Ollama and LM Studio. Recently Microsoft has announced Foundry Local to let you run LLMs on your local computer. This has a number of advantages including:

Prototyping and development
Control over data storage
On device AI for edge and IoT solutions
Low latency
Reduced cloud costs

It gets better. Microsoft Foundry Local is free, runs across platforms on Windows and macOS, and you don’t need an Azure subscription. Microsoft Foundry Local does not come with any strong opinions about the development environment or tooling you are using. It does work best with Visual Studio Code AI Toolkit extension, but you can integrate it with your preferred workflow.

Note that Microsoft Foundry (as of time of writing) is in public preview. Some features may change or be removed at GA.

Installation

The recommend way to install Microsoft Foundry is at the command line. For Windows use WinGet, which is included with Windows.

winget install Microsoft.FoundryLocal

winget install Microsoft.FoundryLocal

On macOS you’ll first need to install the HomeBrew package manager for macOS at brew.sh. Then run these two commands

brew tap microsoft/foundrylocal
brew install foundrylocal

brew tap microsoft/foundrylocal
brew install foundrylocal

You can also find binary installers on the GitHub releases page.

Now at the command line run the foundry command to see the usage information along with a list of commands it supports. If you got this far, you’re ready to host an LLM!

Working With Models

Next we need to download and run a model. After all, that’s the whole reason we are here.

With Foundry Local installed, return to the command line. Run this command to list the models that are “pre-optimized” to run with Foundry Local.

foundry model list

foundry model list

The output may look overwhelming but that’s because of how the models are listed. There are really (as of time of writing) only about 2 dozen of these “pre-optimized” models.

For example, take Qwen 2.5 with 0.5 billion parameters:

qwen2.5-0.5b GPU chat 0.52 GB MIT qwen2.5-0.5b-instruct-trtrtx-gpu:2 GPU chat, tools 0.52 GB apache-2.0 qwen2.5-0.5b-instruct-cuda-gpu:4 GPU chat, tools 0.68 GB apache-2.0 qwen2.5-0.5b-instruct-generic-gpu:4 CPU chat, tools 0.80 GB apache-2.0 qwen2.5-0.5b-instruct-generic-cpu:4

There are four different optimizations of this model, each with a different ID. The first column show the type of hardware. There are three optimizations for GPU and one for CPU. This is an exciting feature of Foundry Local, models can be optimized for computers without a GPU. I’ve run it on as little as a 4 core Intel i5 with 16GB of RAM. While it’s noticeably slower than a GPU, it’s usable, especially if you are prototyping.

The “generic-gpu” optimization is what macOS often uses. The “cuda-gpu” optimization is best for hardware that supports CUDA. And the “trtrtx” optimization works well with the TensorRT platform on NVIDIA RTX GPUs. Also, notice that all of the optimizations support chat completions and three of them support tool calling as well. Foundry Local also supports text-to-speech with some models.

There are also optimizations for an NPU. An NPU is a “neural processing unit”. It’s an ASIC (Application Specific Integrated Circuit) designed for handling numerical data structures that are often used in AI applications.

Let’s download a model. This is the only time a network connection is required to use Microsoft Foundry Local. If you run this command

foundry model download qwen2.5-0.5b

foundry model download qwen2.5-0.5b

Foundry Local will download the optimization it thinks is best for your hardware. You can also explicitly download an optimization, for example

foundry model download qwen2.5-0.5b-instruct-cuda-gpu:4

foundry model download qwen2.5-0.5b-instruct-cuda-gpu:4

Now you can run the model

foundry model run qwen2.5-0.5b-instruct-cuda-gpu:4

foundry model run qwen2.5-0.5b-instruct-cuda-gpu:4

This will make the model available via the CLI, SDK and start a REST API that can be accessed by any programming language that supports HTTP requests and JSON. If the model had not already been downloaded, the run command will automatically download it.

After the model loads, the CLI starts the interactive mode and will display a prompt and you can chat and interact with the model.

🕙 Loading model…
🟢 Model qwen2.5-0.5b-instruct-trtrtx-gpu:2 loaded successfully

Interactive Chat. Enter /? or /help for help. Press Ctrl+C to cancel generation. Type /exit to leave the chat.

Interactive mode, please enter your prompt >

We can ask it the “hello world” of LLMs “What is the capital of France?” and it will respond with something similar to

🧠 Thinking…
🤖 The capital of France is Paris.

At this point you can continue to chat with the model. In the CLI use the /new command to start a new chat. And to exit interactive mode use the /exit or /bye commands.

Managing Models

To list the models and optimizations you have downloaded, use the cache command.

foundry cache list

foundry cache list

This will display the alias and ID for each model.

You also use the cache command to remove a model

foundry cache remove qwen2.5-0.5b-instruct-cuda-gpu:4

foundry cache remove qwen2.5-0.5b-instruct-cuda-gpu:4

Summary

Microsoft Foundry Local gives you the ability to run LLMs on your local computer. This frees you from being tethered to the cloud. This allows you to develop and prototype AI applications locally, before scaling them in the cloud. You’ll save on cloud compute costs and don’t have to worry about latency as Foundry Local can run as fast as your hardware. It’s also optimized for different hardware, even CPUs! Microsoft provides about 2 dozen “pre-optimized” models that you can use right away. The CLI let’s you download and run models. And you can chat with models in interactive mode. There are also CLI commands to manage models you have downloaded. Microsoft Foundry Local is in public preview. You can keep up to date with all of the update at foundrylocal.ai.

Getting Started with Microsoft Foundry Local

Installation

Working With Models

Managing Models

Summary

By Douglas Starnes

Leave a Reply Cancel reply

You Missed

Getting Started with Microsoft Foundry Local

Testing Python Applications with pytest and Visual Studio Code

Getting Started with Azure AI Document Intelligence

Building a Countdown Application with Python, Django, Azure and GitHub – (1)

Getting Started with Microsoft Foundry Local

Installation

Working With Models

Managing Models

Summary

By Douglas Starnes

Related Post

Leave a Reply Cancel reply

You Missed

Getting Started with Microsoft Foundry Local

Testing Python Applications with pytest and Visual Studio Code

Getting Started with Azure AI Document Intelligence

Building a Countdown Application with Python, Django, Azure and GitHub – (1)