Integrating AI into your Windows application can be achieved through two primary methods: a local model or a cloud-based model. For the local model option, you have the ability to utilize a pre-existing model or train your own using platforms like TensorFlow or PyTorch, and then incorporate it into your application via OnnxRuntime. The Windows Copilot Runtime offers APIs for various functions, including OCR or utilizing the Phi Silica model. On the other hand, hosting your model on the cloud and accessing it through a REST API allows your application to remain streamlined by delegating resource-intensive tasks to the cloud. See Use Machine Learning models in your Windows app for more information.
You can use any programming language you prefer. For instance, C# is widely used for creating Windows client apps. If you require more control over low-level details, C++ is an excellent option. Alternatively, you might consider using Python. You can also use the Windows Subsystem for Linux (WSL) to run Linux-based AI tools on Windows.
We recommend using OnnxRuntime.
Respecting the privacy and security of user data is essential when developing AI-powered apps. You should follow best practices for data handling, such as encrypting sensitive data, using secure connections, and obtaining user consent before collecting data. You should also be transparent about how you are using data and give users control over their data. Make sure to read Developing Responsible Generative AI Applications and Features on Windows too.
System requirements for Windows apps that use AI depend on the complexity of the AI model and the hardware acceleration used. For simple models, a modern CPU may be sufficient, but for more complex models, a GPU or NPU may be required. You should also consider the memory and storage requirements of your app, as well as the network bandwidth required for cloud-based AI services.
To optimize AI performance in Windows apps, you should consider using hardware acceleration, such as GPUs or NPUs, to speed up model inference. Windows Copilot+ laptops are optimized for AI workloads and can provide a significant performance boost for AI tasks. See also AI Toolkit for Visual Studio Code overview.
Yes, you can use pre-trained AI models in your Windows app. You can download pre-trained models from the internet or use a cloud-based AI service to access pre-trained models. You can then integrate these models into your app using a framework like OnnxRuntime.
DirectML is a low-level API for machine learning that provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
Open Network Neural Exchange, or ONNX, is an open standard format for representing ML models. Popular ML model frameworks, such as PyTorch, TensorFlow, SciKit-Learn, Keras, Chainer, MATLAB, etc., can be exported or converted to the standard ONNX format. Once in ONNX format, the model can run on a variety of platforms and devices. ONNX is good for using an ML model in a different format than it was trained on.
OnnxRuntime, or ORT, is a unified runtime tool for executing models in different frameworks (PyTorch, TensorFlow, etc) that supports hardware accelerators (device CPUs, GPUs, or NPUs).
PyTorch and TensorFlow are used for developing, training, and running deep learning models used in AI applications. PyTorch is often used for research, TensorFlow is often used for industry deployment, and ONNX is a standardized model exchange format that bridges the gap, allowing you to switch between frameworks as needed and compatible across platforms.
A Neural Processing Unit, or NPU, is a dedicated AI chip designed specifically to perform AI tasks. The focus of an NPU differs from that of a CPU or GPU. A Central Processing Unit, or CPU, is the primary processor in a computer, responsible for executing instructions and general-purpose computations. A Graphics Processing Unit, or GPU, is a specialized processor designed for rendering graphics and optimized for parallel processing. It is capable of rendering complex imagery for video editing and gaming tasks.
NPUs are designed to accelerate deep learning algorithms and can remove some of the work from a computer's CPU or GPU, so the device can work more efficiently. NPUs are purpose-built for accelerating neural network tasks. They excel in processing large amounts of data in parallel, making them ideal for common AI tasks like image recognition or natural language processing. As an example, during an image recognition task, the NPU may be responsible for object detection or image acceleration, while the GPU takes responsibility for image rendering.
To check the type of CPU, GPU, or NPU on your Windows device and how it's performing, open Task Manager (Ctrl + Alt + Delete), then select the Performance tab and you will be able to see your machine's CPU, Memory, Wi-Fi, GPU, and/or NPU listed, along with information about it's speed, utilization rate, and other data.
WinML, or Windows Machine Learning, is a high-level API for deploying hardware-accelerated machine learning (ML) models on Windows devices that enables developers to utilize the capabilities of the device to perform model inference. The focus is on model loading, binding, and evaluation. WinML utilizes the ONNX model format.
An LLM is a type of Machine Learning (ML) model known for the ability to achieve general-purpose language generation and understanding. LLMs are artificial neural networks that acquire capabilities by learning statistical relationships from vast amounts of text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are often used for Text Generation, a form of generative AI that, given some input text, generates words (or “tokens”) that are most likely to create coherent and contextually relevant sentences in return. There are also Small Language Models (SLMs) that have fewer parameters and more limited capacity, but may be more efficient (requiring less computational resources), cost-effective, and ideal for specific domains.
In Machine Learning, model training involves feeding a dataset into a model (an LLM or SLM), allowing it to learn from the data so that the model can make predictions or decisions based on that data, recognizing patterns. It may also involve adjusting the model parameters iteratively to optimize its performance.
The process of using a trained machine learning model to make predictions or classifications on new, unseen data is called “Inferencing”. Once a language model has been trained on a dataset, learning its underlying patterns and relationships, it's ready to apply this knowledge to real-world scenarios. Inference is an AI model's moment of truth, a test of how well it can apply information learned during training to make a prediction or solve a task. The process of using an existing model for inference is different from the training phase, which requires the use of training and validation data to develop the model and fine-tune its parameters.
Fine-tuning is a crucial step in machine learning where a pre-trained model is adapted to perform a specific task. Instead of training a model from scratch, fine-tuning starts with an existing model (usually trained on a large dataset) and adjusts its parameters using a smaller, task-specific dataset. By fine-tuning, the model learns task-specific features while retaining the general knowledge acquired during pre-training, resulting in improved performance for specific applications.
Prompt engineering is a strategic approach used with generative AI to shape the behavior and responses of a language model. It involves thoughtfully crafting input prompts or queries to achieve the desired result from a language model (like GPT-3 or GPT-4). By designing an effective prompt, you can guide an ML model to produce the type of response you want. Techniques include adjusting the wording, specifying context, or using control codes to influence model output.
Hardware acceleration refers to the use of specialized computer hardware designed to speed up AI applications beyond what is achievable with general-purpose CPUs. Hardware acceleration enhances the speed, energy efficiency, and overall performance of machine learning tasks, such as training models, making predictions, or offloading computation to dedicated hardware components that excel at parallel processing for deep learning workloads. GPUs and NPUs are both examples of hardware accelerators.
What are the differences between a Data Scientist, ML Engineer, and App Developer who wants apply AI features in their app?
The process of creating and using ML models involves three main roles: Data Scientists: Responsible for defining the problem, collecting and analyzing the data, choosing and training the ML algorithm, and evaluating and interpreting the results. They use tools such as Python, R, Jupyter Notebook, TensorFlow, PyTorch, and scikit-learn to perform these tasks. ML Engineers: Responsible for deploying, monitoring, and maintaining the ML models in production environments. They use tools such as Docker, Kubernetes, Azure ML, AWS SageMaker, and Google Cloud AI Platform to ensure the scalability, reliability, and security of the ML models. App Developers: Responsible for integrating the ML models into the app logic, UI, and UX. They use tools such as Windows Copilot Runtime, OnnxRuntime, or REST APIs and process the user input and model output. Each role involves different responsibilities and skills, but collaboration and communication between these roles is required to achieve the best results. Depending on the size and complexity of the project, these roles can be performed by the same person or by different teams.