Skip to main content

2 posts tagged with "GPU"

GPU

View All Tags

Introduction to Multi-GPU Training 1: From First Principles to Production

· 26 min read
note

This tutorial assumes you already know the basics of PyTorch and how to train a model.

If you've been training deep learning models on a single GPU and wondering how to scale up, or if you've heard terms like "data parallelism" and "AllReduce" thrown around without really understanding what they mean - this series is for you. We're going to build your understanding from the ground up, starting with the absolute basics of how multiple GPUs can work together, all the way to implementing production-ready distributed training systems.

This first article lays the foundation. We'll understand why we need multiple GPUs, how they actually communicate with each other, and what happens under the hood when you run distributed training. By the end, you'll have the mental model needed to understand everything that comes next in this series.

Something You Need to Know About GPUs

· 8 min read
note

When using the provided server, everything including the driver and CUDA toolkit is already installed, so you might not need to worry about these details initially. However, I strongly encourage you to understand these concepts because you might one day need to maintain your own server (though hopefully you won't have to).

Introduction

Back in the day, I always wondered why we could run PyTorch code on our local machine without a GPU, but when it came to compiling or training local library, we suddenly needed CUDA toolkit. What's going on under the hood?

In this article, we’ll break down the mystery behind CUDA, cuDNN, and all the other buzzwords. By the end, you’ll have a clearer (and hopefully less intimidating) understanding of how they all fit together.