NVIDIA presents: Using Python with GPUs
Transcript
OK, thanks for coming, everybody. I am Jess Eaton. You may have received emails from me. My colleague Liz Quan is there in the back, and we're both here from CYT Research Computing Services. We're a group at CYT that works to empower researchers through providing software tools as well as trainings and workshops like these.
This is our second session with NVIDIA this semester, and today's session is a direct result of the feedback that we received from folks in the first session on what they wanted to see. So to that end, please fill out the super short survey I send after this with what you guys would want to hear either from NVIDIA or other folks, other research computing topics. We listen to you.
We are a small office up on 132nd Street, so that's how you can get your voices to us. And yeah, so the other upside of filling out that survey is you will get the slides from today's presentations after you fill it out. So I think, oh, the other thing is, while this is an in-person workshop, we have a Zoom going so we can record it in case anyone wants to go back and reference it.
I'll get that to you guys. But just so you know, if you speak up, your voice may be recorded in a recording. So you have been informed.
And without further ado, we'll pass this over to Michael Keith, our NVIDIA expert. Thank you, Jess. Thank you, Liz.
Very happy to be here, so appreciate all the coordination and planning that went into this. I'm a solutions architect on the higher ed and research team at NVIDIA. We worked with students, researchers, professors, research computing and HPC teams, university leadership, all with the goal of accelerating your work, whether that's in the classroom, especially on the research side, and on some university needs.
So by a show of hands, who's gotten through at least everything to set up their account and enter the event code? All right, anyone still working through that? All right, we got a couple. All right, so who else is working through that? I'd love to do a check on you guys. Yeah, if you're still working through it and having any issues, especially with the authentication, recommended browser is Chrome.
Chrome incognito mode tends to sort out a lot of the login issues. Once you get to the course home page, you'll see a start arrow on the lower right-hand side. Click Start.
That'll start to spin up the Cloud GPU instance. OK, anybody else still getting the course set up? One more, OK. Anybody else? All right.
So this is the course page. Looks like I'm already launched. So if your other tab happens to close, once this starts up, you should see the launch arrow.
So click Launch. That should launch a Jupyter environment in another tab. You can see I was already working in here today.
And if you can start by at least executing that first cell, that will make sure that you're able to send and receive from this AWS-based Cloud instance. While you guys are trying that, put your hands. How many people would consider yourself a Python expert in the room? Wow, some people actually put their hands up.
Very good. How many people are like pretty OK with Python? You know, you're way around a notebook, and you're not like super expert. OK, perfect.
And save space. Anybody either new to Python? All right, fantastic. How many people, same question, but expert at using GPUs? How many people are totally new to using GPUs? How many people tried it one time, and it kind of worked OK, but you want to learn how to do it better? OK, fantastic.
So please ask questions today. It's much more engaging if you do. So is everybody comfortable with executing a command in a computer notebook? OK, perfect.
So continuing with learning a bit more about you, show of hands, who do we have, undergraduate students here? Did you create? All right, graduate students, postdocs and professors, researchers, other staff. Great. So we have a mix of people.
What about your fields? Who comes from a science background, say chemistry, physics, biology, computer science? Or actually, I'll say computer science. Chemistry, biophysics. OK, computer science.
All right, a lot of CS folks in here, so definitely more knowledgeable and skilled at CS than I am. What about engineering disciplines? EE, mechanical engineering. All right, great.
A few. Humanities? I guess. OK.
Humanities-ish. I don't know what that is anymore. Business.
Any other fields that I haven't mentioned? All right, great. So now I know a little bit about your backgrounds. I'm a physicist by training, so studied condensed matter theory and then worked in applied quantum sensing and computing for a while.
Picked up some AI and data science along the way. And here we are to learn how to accelerate scientific workloads and data science-based workloads with GPUs and Python. So in terms of accessing the workshop materials, I think we've gotten everyone on there.
Just know that if your computer goes to sleep, you'll lose the connection. You may have to close the Jupyter Environment tab and then just click Launch again. It should keep your system active.
In terms of the Jupyter Environment, these materials will be available to you for six months. So feel free to log back in, execute some of the notebooks. Again, feel free to download the notebooks, either individual files or as a tarball.
There will be a limit on your total GPU time. All right, I think it's about eight hours. And it's a two-hour course.
So you'll have some time to get back in there over the next six months. But at a certain point, you will run out of GPU hours. It is not meant as a general GPU computing platform.
So if you try to run your own workloads, you may get flagged and kicked off. I've gotten close. I haven't fully gotten kicked off yet.
There's some people smiling at the camera. Yes. So in terms of what we're gonna cover today, so we've covered how to access the workshop materials.
I'll go through a very brief and high-level overview of GPU computing and compare it with CPU computing, how they can work together. And then the bulk of our technical focus today will be on using Python to GPU accelerate workloads. And that's with numerical computing and data science.
Finally, we'll apply these tools that we've learned or techniques that we've learned to a case study that's representative of a geospatial dataset. Today, we'll be working on NVIDIA A10 GPUs. So you can see the specs here.
We'll show you how to access the specs on your machine and get a snapshot of the GPU memory, current utilization, power, temperature. One of the things you may encounter is having the right GPU for the right job. So when you're running workloads, you may want newer GPUs.
You might want many GPUs. You might want many nodes of many GPUs for exploratory purposes. This is an excellent GPU for what we're gonna be doing today.
All right, so everyone's gone through setting up their developer account, getting to learn.nvidia.com and entering the code. All right, and then once you've entered that code, you should only have to do it once. After that, if you go to the dropdown menu next to your name, the course will show up in progress under My Learning.
All right, and for this first part, we're gonna be working in the first section of that first notebook. So overall, how today's gonna go, I'll present a few slides, talk about what we'll be looking at in the notebooks, give you some time to interact with and read the notebooks on your own, work through them, execute the cells, change things around, and then we'll come back together, share any observations or questions that you've had. So for the first part, you'll notice that this environment has everything pre-configured.
So it has the GPU, CPU, all the software that's needed. This is done by using containers. Containers have all of the runtime components needed for an application.
So you can think they're lighter weight than a VM, which has its own operating system. Container, think everything that uses the host operating system and lower level drivers and hardware, but has all the runtime components. So it includes possibly a different type of Linux OS and then everything that's built on top of that.
The Rapids container from NVIDIA's GPU catalog has this same environment. So if you're interested in utilizing some of these tools and making sure that they all work with each other, you can utilize the Rapids container. And I would say, reach out to your CUIT points of contact to understand how you can run containerized workloads on HPC systems.
All right, I know there's excellent documentation on the website already. So where does GPU computing come into play? The answer is really across the spectrum of computing jobs. And we can break this down into three different categories.
So the first, think traditional HPC workloads, all right, where you're using GPUs to accelerate tasks, which you can parallelize. The second and really the earliest use of GPUs, they're graphical processing units. So they were used for computer graphics, right? So for generating, especially visualizations for video games, this field of computer graphics has now expanded well past video games.
So those still includes it to include simulating virtual worlds, right? For training autonomous agents and finally artificial intelligence. So around 2012, researchers realized that GPUs because they're matrix multiplying machines are amenable to training deep neural networks. And the combination of large curated datasets, advances in models and computing power drove this revolution of AI forward.
Since then, GPUs have been increasingly tailored towards AI workloads. So some GPUs still focus more on the computer vision side, some focus totally on the AI side and some have capabilities that do both. So you might wanna run an AI model that's generating graphics.
So you need both the AI backend, but also the ray tracing capability. The majority of our workloads, when we interact with our desktop computers, laptops are on CPUs and CPUs are optimized for serial tasks. They're very low latency.
So you do something, you get a response and your computer is gonna carry out a lot of different tasks sequentially. GPUs on the other hand are optimized for parallel computing. So if you have a large number of tasks to carry out and they can be paralyzed.
So think of paralyzing a for loop where there's no dependence in between the steps of the loop. Think about training any kind of machine learning or AI model where you're passing different dataset or different batches of data at the same time through you can paralyze all of that both forward and reverse pass. So the challenge here is identifying the portions of your code that can be paralyzed and then using the appropriate level API to paralyze that part of your code.
All right, we'll show you a couple of ways to do that today. These speedups kind of come at the time where Moore's law is ending. All right, so Moore's law, the density of transistors in a certain area on a chip, it was doubling at a certain rate, I think 18 months for a long time.
And as that became harder and harder because you're going down to smaller and smaller nanometer size scale features, it's harder and harder to realize those density gains. We measure computing power in terms of floating point operations per second. So the number of flops, which rose with that density of transistors on a chip also started to tail off.
But in terms of HPC workloads, scaling up machine learning workloads to multiple GPUs, multiple nodes, even using lower precision in order to carry out some of these computations, the trend was able to continue by GPU acceleration. So the main takeaway here is that this has powered a lot of advances within the fields of science. So whether that's using convolutional neural networks to discover gravitational waves, drug discovery, which is a hybrid workflow, which requires some HPC aspects and some AI accelerated aspects, fusion reactors, real-time CFD, right? So computational fluid dynamics is a very mature and computationally intensive field.
One way that both research community and industrial communities addressing that now is to train AI models as surrogate models. They may not be as good as a full simulation, but it allows you to iterate much faster, tune some parameters on the fly, get real-time visualizations, and then guide where you would wanna look for your full-scale simulation. There are a number of different ways to accelerate your code on GPUs.
So often a question we get from researchers is, how should I do this? And the answer is, there's no one way to do this. So one of the families of approaches is through accelerated standard languages. NVIDIA contributes to these language standards for C++ and Fortran.
And with every generation of standard releases, there's more and more parallel capability that's incorporated into these libraries. And then based on the compiler that is used, the hardware backends, you're able to run the same code on a CPU, multiple CPUs, GPU, multiple GPUs. Python doesn't have a standard, right? But there's a de facto standard because it's a community-based language, right? So there's no standard Python, but the de facto standards also include acceleration, right? So one of the great features about Python is the very diverse ecosystem of developers.
So a lot of different open source tools, and we'll be looking at Cunumeric as one of the examples. If you're working on the C++ side, two other ways that you can start to accelerate your code are by using OpenMP and OpenACC. So here, you're going to annotate or decorate your code with these pragmas.
And the pragmas provide hints to the compiler to indicate where the code can be parallelized. So these are examples of parallel for loops, which are then able to be accelerated on GPUs. Finally, at the lowest level in terms of like how close to the middle you're thinking about programming, the CUDA family of languages, right? So CUDA C, CUDA C++.
Here, you're programming down to individual blocks and threads within a GPU. So this is when you may want to eke out those last couple of percents of performance for your application. If you're already using deep learning libraries, fortunately, a lot of this is done for us, all right? So there's a lot of effort that goes in both from the PyTorch team at Meta, also from the NVIDIA team that works on PyTorch to optimize PyTorch for running on NVIDIA GPUs, all right? So a lot of this CUDA backend stuff for say PyTorch, TensorFlow, JAX is already taken care of.
But you may find that you'll need to understand how this works if you find it's a limiting factor in the new architecture you're developing. When we think about the different fields this spans, so certainly numerical computing, all of your data analysis, any kind of data-driven solution, artificial intelligence, and increasingly quantum computing. So quantum computing is a active area of R&D looked at as a possible next phase of computers.
It's one of the areas that I work on. So one of the ways that NVIDIA looks at quantum computing is that it's not its own distinct entity, but it's gonna be another capability within a high-performance computing system. So think CPU, GPU, QPU, tightly interconnected so that you can leverage the right tools and also the right combinations of tools for the right job.
So thinking about CUDA on the right-hand side, right? Or even C, Fortran, these are compiled high-performance languages. On the left-hand side, we have Python. And it is a very developer-friendly language.
The challenge for us is to bridge the gap between Python and the performance that's realized on some of these lower-level languages. Within NVIDIA, there is a lot of software development activity now in order to bridge this gap. So the goal is to make GPU acceleration available through Python APIs to give you the same, if not very close to the same performance you would get if you were programming in CUDA C, C++.
There are a few different ways this is done, right? We have a lot of Python experts in here. JIT compilation, right? So just-in-time compilation. Python by default is not a compiled language, but you can decorate certain functions or methods to be compiled at runtime.
Low-level language bindings. If you're familiar with using NumPy or SciPy, the back ends for all those libraries are C-based. So the numerics themselves run quickly.
And oftentimes if you profile a workload in Python, you'll see that the numerical compute runs very quickly, but when you have to go back up to the Python layer, it slows down and you go back down and it runs fast. There are a couple of ways that we can look at how to get the most out of GPU computing because the GPU doesn't act in isolation. So think about a CPU and a GPU and each have their own memory.
And there's an interconnect between the CPU and the GPU. Each of those links has its own bandwidth. So how fast can you load data from one aspect of the system to another? And you'll see that both the host memory to CPU and especially the interconnect between the CPU and the GPU are two of the slowest links within the system.
All right, so keep this link in mind because moving data from the CPU to the GPU and vice versa tends to be a bottleneck in a lot of applications. Fortunately, it's not a showstopper. So there are ways to, we call it like hide this latency.
All right, so you wanna be able to overlap your communication and computation so that you're not bottlenecked by a lower bandwidth here. Amdahl's law describes how effectively programs can be parallelized. So there are certain algorithms applications which are more serial in nature.
So think about maybe a Monte Carlo simulation where one step depends on the previous step. Those are harder to parallelize. So it may make sense to keep the sequential components of your program on the CPU because they're gonna run with lower latency, right? And they're better suited towards CPU based operations.
But if you have portions of your application that are parallelizable, so think about any kind of large matrix vector operations, offload those to the GPU, run them quickly, right? And then feed them back into the rest of the program. In terms of where to look, so developer.nvidia.com is our developer website. The best way I stay on top of what's going on within NVIDIA is through NVIDIA's blogs.
So we have blogs like general community blogs. We also have technical blogs. So a lot of times they're associated with each other.
So if you wanna learn at a high level, what's going on within the entire NVIDIA ecosystems. This is NVIDIA, researchers who are using NVIDIA platforms, partner companies, highly recommend taking a look at some of the blogs. There are often links to NVIDIA's GitHub repos.
So vast majority of our code is open source. So great places to look for some pointers. Profiling tools, right? So when you're running an application and you want it to run faster, one of the best ways to first understand how your program is running and then optimize your program is through profiling.
NVIDIA releases a family of profiling tools, Insight, all right? So Insight Systems is your systems level profiling tool. It allows you to run an application and look to see how each part of your program are utilizing memory, compute, and how they're communicating back and forth with each other. I mentioned the containers.
So if you wanna check out what containers are available, ngc.nvidia.com is a good place to look. All right, so on the lower right, it shows a schematic of a four-way 8100 system. All right, so this is a single node.
So it would be a single CPU associated with the system. The interconnect between the CPU and the GPU is by PCIe. All right, so that's your lower bandwidth interconnect.
Within that node, there are high bandwidth interconnects between the four GPUs. So when you start to scale out to multiple GPU jobs, you wanna leverage these interconnects in ways that maximize the utility of the hardware. So any questions on the overview, CPU, GPU, computing in general, and any experiences people wanna share with getting up and running on GPUs, maybe some bottlenecks you observed? All right, who's trained any, did you have something? Loading from one, you know, they're in front of a lot of GPUs, but you get that like pause and then.
Yes, what kind of applications were you working with? I just. Yeah. Like, to be honest, the library, I guess.
Okay, yeah, which is awesome. But the best way to learn is by playing. Yeah, so there are ways of kind of hiding that latency.
Right, so if you can, you know, have it prefetch some of that data before the GPU needs it, that tends to, it's still there, right? But you can hide that latency. Yes. We've seen a lot of the images out there of combining the multiple GPUs where you get stacks of layers and data centers.
It shows like there are all the individual GPUs are all interconnected, as you've just shown in the previous slide. And there is one CPU outside of that stack that is connected through the PCI. Yeah, so that's a good point.
So that is one way to do it, right? In that case, all of your traffic in between nodes would have to traverse the PCI bridges between CPUs. There are other networking architectures where there's a fabric that connects GPUs through switches. And in that way, it won't have to go through that PCI interconnects.
So it all depends on how the system was designed. But in general, what you said is, it's a common architecture, especially for HPC systems. Systems that are more optimized for AI tend to have a different fabric architecture or network architecture.
So there's theoretically no limitations of how many GPUs, but the bottleneck would be the PCIe capability to enable communication between the CPU and the GPU stack. Yes. Yep, and just over, yes, PCIe and overall networking capability.
Yep. Yeah, so we'll take a look at some examples of scaling later on in this workshop. And you'll see that ideally, so if you were say running a fixed job size and doubling your compute power, that your runtime would go down by a factor of two every time, right? In practice, it doesn't follow that linear scaling because there is some overhead with that communication.
So the first set of libraries we'll look at are NumPy and CuPy. So NumPy may be familiar to you by show of hands, who's used NumPy? All right, very large majority, if not everyone. It's the backbone of numerical computing in Python.
All right, so I've spent quite a bit of time working with NumPy, especially before I knew that GPU accelerated alternatives exist. CuPy is an open source project. NVIDIA contributes to and supports it, but does not own this project.
And it's designed to be the same API as NumPy, but it's amenable to GPU acceleration. All right, so the same underlying structures can be transferred from NumPy to CuPy and vice versa. NumPy is tuned for CPUs.
All right, so all of the typical parallelism that you get by NumPy indexing and by doing array-like operations instead of, you know, brute force for loops, they're very sophisticated and they run very well on CPUs. CuPy is designed to do the same thing on GPUs. As NumPy is the backend for many other data science libraries, so SciPy and Pandas, CuPy acts as a backend for GPU accelerated data science libraries.
So here's a code snippet that shows ideally what you'd be able to accomplish. With only the change of an import statement, you should be able to run the same code on CPUs as you would on GPUs. You may see some examples out here in the literature, or not even literature, like in documentation or even on NVIDIA's blogs, that you can just import CuPy as NP and make that single one-line change to your code.
There are pros and cons of doing that, right? So yes, as a developer, it would be nice to just change a single line of code and realize GPU acceleration, but you may want to be more thorough just to make sure that you have one-to-one support for both languages. The main takeaway is that by using the same types of APIs, you can realize a 10x, 20x, 50x speedup by running these applications on GPUs. So at this point, we'll take a look at the notebooks and we'll work through an introduction to CuPy.
So if we shift over here, I'll take a minute to work through the introduction. So, hello world, make sure that your Jupyter notebook is live. Do you have a question or comment? I have a question.
So here, the example you showed me made me think about JAX. Is it similar or the goal of the library? I think so, yeah. So JAX is a numerical computing library that's GPU-accelerated.
I would say JAX offers additional features that CuPy does not. So JAX's native auto-diff capability, I think, is one of the unique aspects of it. But that's a great point.
So a lot of your general purpose computing, you could also do in JAX. We evaluate the second cell. So anytime you prepend a command with an exclamation point in a Jupyter notebook, it will run as if you're running in the terminal.
So exclamation point, NVIDIA SMI. This is NVIDIA system management interface. It gives you a snapshot of the GPU utilization at the instant that it's run.
So you can see here, it shows you the version of the drivers and CUDA that is installed, which GPU is being used, if your fan is running or not, current temperature. If you want, you can launch this in a separate tab. And if you want to run either NVIDIA SMI or watch NVIDIA, SMI, so this will refresh on a two-second interval.
As you start to run jobs in the Jupyter notebook, you can see how the fan starts to kick on, the temperature starts to go up, memory starts to be utilized. The A10 has 24 gigs of GPU memory. You'll see 23 here, right? So the system and drivers take up a little bit of memory on there.
It's also good to know what CPU you're using. So this LSCPU will list the attributes of the CPU. So with that, I'll give you some time to work through this first section.
The goal is to do a side-by-side comparison of the NumPy and CuPy APIs and to observe any differences in how well they perform. Sorry, I'm going to be a bit late. Do you know if you can show a slide where you have the links? Yes.
By a show of hands, who's made their way through that first section? All right. Who could use another five minutes or so? All right. Let's take a look.
So what were some of your observations? Just this introduction to CooPy and stopping at the kernel overhead. All right. So if you've gotten through here, any timing differences you notice between CPU and GPU? The same.
The same? Okay. Maybe we need new GPUs. Yeah.
So hopefully you saw that the GPU ran faster, especially on larger data. All right. So there's always overhead with compute.
There's overhead with parallelization. And on smaller data, you might find that the CPU runs comparably or even faster than the GPU. But as soon as you get to a certain critical data size, you'll find the GPU is much faster.
So one of the observations was that when you were performing the matrix multiply, it was performed once before the timing started. And the question was, why is that done? The answer is there's some overhead that is entailed with just-in-time compilation. So the first time that operation is created and run, it's going to be compiled at runtime and then executed.
So the first time it has to compile. Once the operation is compiled, if it's reused again or once it's compiled, it's cached. So then if it's reused again, it doesn't have to be recompiled.
So that's why typically, whether you're multiplying matrices using CooPy or even running deep learning training or you're training deep learning models, you'll find that the first time through typically takes longer. And then some of those initial runtime overheads have been executed already. They're cached and then everything else runs faster.
So that overhead is amortized. The larger your data sets are, the more you run these. This is a schematic of what happens.
So you call it the first time. There's a decision tree where if you've run it already, then it's going or if you haven't run it already, it's the first time calling it. It's going to compile and cache it.
If it's not your first time, then it'll use that cached version and run the compiled machine code. If you change anything in there, then it'll have to recompile it. So it's checking there.
So the compilation is one type of overhead. There's also a kernel launch overhead. A kernel is a set of instructions that runs on a GPU.
You can think of a kernel on a GPU analogous to a low-level function on a CPU. So there's some overhead in launching that kernel and getting it running on the GPU. And then once that's done, you're going to incur that overhead even once it's compiled every time a kernel is launched.
So one of the ways, you know, if you have multiple small functions and each of those is compiled just in time and has its own kernel associated with it, there's kernel overhead for each of those function losses. In order to have less kernel overhead, you can fuse different functions into a single kernel. So this cp.fuse decorator allows programmers to fuse multiple operations, additions or subtractions, multiplications into a single kernel.
And now you're only paying the overhead of a single kernel launch and then able to execute that instruction set in parallel. You'll be able to quantify this in the notebook. So if you work through the next section on kernel overhead, you'll see some of the time differences.
You'll also notice this CUDA device synchronize. So the print function or the timing is going to be running on the CPU. So this just says that the CPU has to wait for the GPU to finish that job before it runs the final step of timing that entire operation.
Because they're asynchronous. So the CPU will launch the job on the GPU. But if you don't have this CUDA device synchronize step, then it won't wait for the GPU to finish that before it executes the next step.
So work through this next section, check in in about five minutes, and then we'll look at data movement. How's everybody doing with that section? Show of hands, we could use a few more minutes. Ready to move on.
I'm not sure off the top of my head. But one way you'd be able to look at this. So there are a couple of notebooks we're not going to cover today, which look at Numba, which is a compiler for Python.
And Numba uses both CPU and GPU backends, depending upon which option you choose. So you'd be able to run a comparison and take a look at how you're using the same library if it takes longer to compile for CPU versus GPU. In terms of compiling, say, C or C++ code, I haven't seen a major difference using the same compiler, whether you're compiling for one or the other.
JIT compilation, in my understanding, it's a technique that is introduced by using CoolPy that is not actually in place in NumPy. Yes. Yeah, and JIT compilation in general is available through a number of different Python libraries.
Right, so I don't think. Let's see. Yeah, for NumPy, did you notice any difference? I don't think NumPy is compiled by default.
Right, but this JIT compiler pretty much allows you to compile portions of code at runtime. And there are different entry points to it. Right, so one is through CoolPy, Numba is another one.
I think other libraries have other compilers. We're going to cover the next two sections. So, next two sections are data movement and memory management.
So, going back to this picture of, you know, schematic of CPU and the GPU with each with their own memory, right, you have a memory bandwidth associated with each device and then interconnect between the two. We can see that this, you know, if you're doing primarily a GPU accelerated workload, once it's in GPU memory, both the compute and the memory bandwidth are very fast. But the bandwidth between the CPU and the GPU is fairly slow.
Right, so think, you know, it's about an order of magnitude slower than the GPU memory bandwidth. And this compares what would have to happen in terms of time if the data is created on the GPU and processed on the GPU versus created on the CPU, moved over to the GPU, processed on the GPU. Right, so you can see that both the time it would take, especially if the process can be parallelized.
So, this block takes more time than this block, and then there's a massive overhead in the middle. Right, so the biggest thing is to keep in mind that this bottleneck exists. There are certain cases where you'll be able to generate or load your data directly to the GPU.
There are other times when you won't be able to do this. So, keep it in mind if you're seeing slowdowns at a certain point, it could be because you're bottlenecked by this data copy. Right, finally, managing GPU memory.
So, similar to CPUs, GPU memories are finite, and especially in high-level languages like Python, there can be a lot of memory that's used and then not released once the job is done. So, you may want to manage the memory on your own. Right, and the goal here, you know, from the Python side, as I mentioned earlier, is to balance this productivity, so a very developer-friendly language, with performance.
And in order to, you know, realize that performance, one of the ways you can look at this is through a scaling study. There are two types of scaling.
Compute you have access to, you increase the size of the problem. So in this case, you're increasing the size of the dataset that you're processing. Ideal weak scaling would be a horizontal line.
So the same amount of time that it takes you to perform a certain workload on a single GPU, as you double that workload a number of times and double the number of GPUs, you would like to see this go straight across. Legate is a family of libraries that enables scaling of these approaches to multiple GPUs and multiple nodes. You can see here that cuPy running on a single GPU is slightly faster than using legate with cuNumeric.
So I think multiple node generalization of cuPy on a single GPU. So again, parallelism incurs overhead. So you want to incur that overhead when it makes sense.
So if you know that your job is going to run on a single GPU, there's a marginal difference here, maybe 10%, but it could make a difference to you. But if you know that you're writing code that will run on a variety of system sizes, then it may make sense to pay this overhead. Because as you go to two and four and eight GPUs, you're able to realize the same amount of time to solution for larger problem sizes.
There's one thing I'm going to point out inside this notebook. So does anyone have experience programming GPUs and using memory management using malloc or cuda malloc? Ever hear these terms before? All right. So some familiarity.
One of the things to look at here is how memory is allocated. And I think this is one of the more interesting aspects of especially current computing systems. So where the memory is allocated matters.
In this case, the example just shows that if you're allocating more memory than you have available on your GPU, and you're only using cuda malloc, then you'll get a memory error, right? So you'll run out of memory. You can also create this mempool. And if you use cuda malloc managed, you're able to utilize both the CPU and the GPU memory.
And malloc managed allows the GPU to access CPU memory and vice versa. There are some details that go on under the hood in terms of where certain data is physically put into memory. And there are some more advanced techniques to work with heterogeneous systems where you're really trying to leverage both CPU and GPU memory and compute.
But this shows that you can allocate a memory pool that expands past the size of your GPU memory, right? So useful when you're working on bigger and bigger jobs. The CPU is still a valuable aspect of the computing infrastructure, right? And especially the memory component of it, because it tends to be larger than the GPU. So as you scale problem sizes up, you see that by using this malloc managed, you can extend your memory pool.
In terms of the next 15 minutes or so, I'll check in 15 minutes. All right? It should give you enough time, hopefully, to work through some of this. Also take a five minute break or so.
All right? So if you need to get up, walk around, grab a drink of water. We'll pick back up at 225.
Welcome back, everyone. So we have a little bit more than a half hour left. Any observations from working through the remainder of that notebook? Questions? All right, so Jeffrey brought up a good point about taking this to your own work, right? So one of the questions and maybe things I want to learn from you on is, where do you see some of these techniques being helpful in accelerating your work? Are there any that stand out? What do you got? Okay, great.
And do you code a lot of that stuff by hand? Yeah. Okay, awesome. So this is an opportunity to leverage that.
Anyone else? If nothing else, how about this time.time T1, T2 and print your computation time, right? This is before you get into profiling tools and looking at traces and memory management and things like that. This is a great way to look at elements of your code, how fast they run and how much faster they could run. So don't take for granted how long it takes to run a certain thing now.
One of the paradigm shifts is how fast can I run this? And it's not just about running it faster, but think about your iteration time through a research or learning cycle. The faster you can implement, run, observe and then optimize, right? That is going to drive your, whether it's learning path or research path faster. And I come from a research background.
So running simulations of quantum systems is very compute and time intensive. And for a long time, they just took however long they took. I worked that into my workflow.
So I knew simulations were going to run overnight. So I'd work on them during the day, launch before I go home, come in, read the results. If that could be turned down into an hour or into 10 minutes and that massively accelerates your iteration cycle.
The next notebook, we're going to take a look at Rapids. So Rapids contains GPU accelerated versions of all of the Python or many of the Python data science libraries within, I would say the traditional Python data science stack. It includes tasks going from your visualization analysis to your model training and things like data visualization versus analytics and results visualization.
These different GPU accelerated versions, they can work together with task scheduling. So if you're using task for larger scale CPU workloads, these can be combined with some of these GPU accelerated libraries. And they're all built on this Apache memory framework, which is going to optimize the way your data is stored and then loaded into compute.
So similar to what we saw in the early examples of where NVIDIA is accelerating these different domains, all these different domains can be sped up through Rapids. So this is an example of a possible workflow where you're doing some molecular dynamic simulations. You can run them on a CPU or there are GPU accelerated versions, which are going to run faster.
And then you're going to do some midstream analysis. There's a lot of either feature identification or kind of like classical ML outside of deep learning that go into some of the intermediate steps. You can accelerate that with Rapids and then you're already on the GPU.
So you can run this entire pipeline, including training a neural network on a GPU. When we think about the traditional tools, Pandas is one of the most widely used libraries for data preprocessing, especially for tabular data. So think multivariate time series data where you have lots of different elements or devices that you're analyzing, number of different features or measurements that are coming in over long periods of time.
All the preprocessing that's done in Pandas and then downstream modeling tasks and scikit-learn. Anybody work on graph analytics problems? All right, we've got a couple. So graphs are really interesting.
They show representations between a lot of real world systems. Fundamentally, they're nodes and edges. So you can think of a road network as being well represented by a graph where the nodes are cities and then the edges are the roads between those cities.
If you're looking at distances, the weights of those edges could be the distance between the cities or you could incorporate the speed on those roads if you're looking for throughput through a logistics chain. NetworkX is a common Python-based graph library. Things for finding the shortest path between two points.
And then down to the deep learning side of the stack. Deep learning has the advantage that it really came of age during the time when GPUs was being used for machine learning and deep learning or data-driven tasks in general. So I would say, unlike some of these other libraries, deep learning libraries are natively GPU-accelerated.
And then finally, visualization. So matplotlib or other libraries that are built on top of that for visualization, these all have counterparts in the GPU-accelerated space. So instead of pandas, we have cuDF for preprocessing, cuML for machine learning, cuGraph for graph analytics, and then not a whole lot of change to the deep learning ecosystem.
So if there's a takeaway for how you might implement this in some of your own work, if you're using any pandas, scikit-learn, NetworkX, other libraries, take a look at the RAPIDS ecosystem, and it will show you what the drop-in replacements should be. Similar to what we saw with cuPy and numPy, the APIs are designed to be as close as possible, if not identical. So the goal is to make them identical where possible.
Especially with the data science side of the stack, I recommend that you make sure that the results you get out of one are the same as what you get out of the other. So one area that I've seen these two differ, if you're doing a lot of group-by-apply combined operations, or if you're doing group-by-apply combined operation in pandas versus cuDF, I've seen different behavior on the default sorting within groups. So just something to be aware of that you might need to make explicit in one library that's implicit in the other, and the default is slightly different.
So here's this one-liner drop-in replacement. Import cuDF as pd. Then all of your pandas data frame operations, like your read CSV, head, tail, anything you can do downstream from that will be GPU accelerated.
It's interoperable with a large number of different GPU accelerated libraries, whether they're more on the numeric computing side or on the deep learning side. I think it makes sense to go through these next two sections in one block. So once you do your preprocessing and initial visualization and you get your data into a format that's ready to be modeled, your next task is to run a bunch of models and see how well they predict.
I think this picture makes me think of the drug screening pipeline. So when pharmaceutical developers are trying to identify candidate molecules for drugs, they go from a massive space. I think it's like 10 to the 40 or something like that, so really massive.
While there's certainly still need for both the numerical modeling that's based on the chemistry of these systems and the benchtop work that goes into, you know, testing to make sure they're biocompatible and effective and not harmful, a lot of the work is done in the top parts of these cones. So to the extent that machine learning can be applied, and especially GPU-accelerated machine learning to narrow that search space down to identify better candidate molecules for certain purposes, this is one area where these types of workflows from end to end can be very helpful to advancing the fields. Scikit-learn, QML, they have a bunch of pre-built data sets, right? These are not necessarily representative of what you'll see in the wild, but it's a good example of a non-linear classification problem, right? So this is something where a linear classifier will not effectively separate these two half moons.
So as you make the data set bigger, you might experiment with QDF and QML in order to generate, analyze, and then model these data sets. The algorithm ecosystem continues to grow. So there are, at least when this was created and last time I checked, there is not a full coverage, or there wasn't a full coverage of all the algorithms available in Scikit-learn in QML.
All right, this is changing over time. So the goal is always to map one for one. Scikit-learn is under active development.
It's a community project, so there are always new contributions to it. There are also certain algorithms which are more amenable to parallelization than others. So you'll see the speed-ups vary by what the algorithm is.
And if you're calling the same high-level algorithm, let's look at something like logistic regression or linear regression. The defaults might be different in how it's done because it may be faster using one method than another, right? So a good example, I think, is linear regression. Linear regression has an analytical solution.
So you have your data, right? Your X can be multivariate, and you have your targets, Y, so single value. And by multiplying X by its transpose, inverting that, right, you can find your coefficients. So it involves a matrix inversion, right? And that's one way to do it.
Matrix inversions are accelerated on GPUs. Another way to do that is through gradient descent, right? So here you're just using your loss function, taking the gradient of that, and then taking steps. So those different approaches might realize different speed-ups going from CPUs to GPUs.
Taking a look at the notebooks. So when you got to the end of notebook one, if you ran this cell, you'll see kernel restarting. It shows you a message.
It'll restart automatically. Even with that, I find sometimes it's still utilizing memory. So if you right-click on the first notebook, shut down kernel, that will shut that kernel down.
You can start working through notebook two. You'll see some comparisons between pandas and cuDF. Again, notice what you're going to see or take note of what you observe when you go from smaller to larger datasets, and also when you go from simpler to more complex operations.
So definitely when we get down into some of the merging, and I think there's an example in here. Let's see. Aggregation, right? Where you're aggregating and summing.
See how fast this runs on pandas data frames versus cuDF. So NVIDIA, as an organization, is obsessed with performance. And when I say performance, we're usually talking about how well are we utilizing the compute resources, and also what's the time to results.
So one of the aspects or functionalities of cuML is this benchmark runner. So it has this built-in speedup comparison where you're able to run these algorithms both in scikit-learn and in cuML and look at the speedup. So it's also a pattern that you may want to apply to some of your own code.
If you're doing a comparison, you have a runner, right? You have a runner that captures certain aspects of your computing workflow, whether you're trying to look at the time a single block takes or end-to-end application performance. And it's just one more way to make your code more modular and then give you a way to assess how well it performs. So we'll take the next 10 minutes or so to work through this notebook.
I'll check in then. Any questions? All right, show of hands of people using scikit-learn. A lot of pandas.
All right, great. I hope this takes some of your time to results down. Is anybody, once you're done with this code, opening up one of your own projects and dropping something in and being like, let me just see? Have you been doing that? I think a couple people have been like, if you do have a type project, if you run through the notebook quickly and can open up one of your own things, give it a shot.
I guess you have to have the environment set up. You do. So you have the compute time on here.
Also, Google Colab. So Google Colab, you have T4 access. It's a great interactive environment.
So you can access GPU by changing the runtime. If you load this extension, cuDF.pandas, you won't have to make any changes to your code. It will just automatically run all of your pandas code using cuDF.
So it's a good hack to experiment with a notebook, upload one of your own notebooks, and run it in that Colab environment. We've got a question next to you. Oh, yes.
Sorry. I think it's quite different. Is it still interesting to try to use this kind of mix of or is it a complicating thing for the game to start again? I mean, you can't hope for, I mean, you have to see which data is used anymore.
Do you have any good ways to handle this problem? Yeah, I would say, you know, overall, time it and see what's best, right? So if you find that, you know, we've seen that once you scale your data set to a certain size, GPU is gonna run faster. If you start to exceed that memory and it starts to run slower than it does on CPU, then you know you need to optimize even further or just run it on a CPU, right? In general, I know that the most recent version of CUDF, it supports this unified memory by default. So if you have CUDF data frames that are larger than GPU memory, it will automatically split them up.
And then that rapid memory manager where you are CUPI memory manager, where you create a mempool, it's a good way to do that. It's not out of the box gonna be optimized for performance. And that's because there are some default ways that the data is gonna be split up and then accessed.
So there are some kind of further fine-tunings that you can perform in order to get the best performance out of it. But I would say like the golden rule is try it, see if it's fast enough for you, or if it's faster than CPU only. And if it's not what you expect in terms of speed, look for ways to optimize it.
In the way we test our GPUs, do we put them in any way to serve the CPU? I mean, would you share the information across the GPUs or something like that? That is gonna depend on how your environment is set up and how you launch the job. All right, in general, when you go to multiple GPUs, there's more that you have to do by hand to run on multiple GPUs. So usually it's either a wrapper around what you'd normally be calling, or you can use Dask to launch jobs that are then split across multiple GPUs.
So it took us a little bit longer to get through these first two notebooks than I had planned. So let me at least ask you to pause. I'll point you towards this final notebook, the two additional notebooks, which we didn't cover, but you can play around with.
We do have the room until 3.30. So if you're willing and would like to stay, happy to stick around for a while. If not, then at least you know what you can work on at your own time. So looking at the second notebook, I think these merge and group by commands from the data pre-processing are two of the strongest examples.
All right, so in terms of iteration time, you're going from roughly five seconds down to 20 to 30 milliseconds. All right, so these are operations that are very parallelizable. I've worked on data-driven workflows where the group by apply some operation and combine the outputs of those operations with the slowest point in the workflow.
And then everything downstream of that, all the machine learning and deep learning was much faster. So understanding your end-to-end workflow, where the bottlenecks are and how to speed up those bottlenecks, it gives you a good indication of where you might want to accelerate code first. So you don't always have to change everything all at once.
Get a high-level understanding of your entire workflow, understand what the slowest parts are, and then work to remove those bottlenecks and then iterate based on that. So think about developer time versus where you're really gonna gain the most. So this'll take a little while to run down at the bottom, but hopefully you saw that the comparison or you will see that the comparison between machine learning, running machine learning models on scikit-learn CPU and using QML on the GPU is quite different.
And there's gonna be some variations in there. If you're looking for the latest snapshots of where Rapids is, so rapids.ai, it's a community-driven open-source project. And there are some comparisons on here.
So in terms of the course, let's take a quick look at the final notebook, which combines a variety of techniques that we've covered today. All right, so this case study, it's meant to simulate a geospatial dataset. So one of the underlying calculations you may need to make is when you're working with any kind of geospatial data is to calculate distances on a sphere.
So once you've done that, or maybe as part of your machine learning approach, say a K-means or a K-nearest neighbors algorithm, calculating those distances and being able to do that efficiently will drive the overall performance of that workflow. So in this case, calculating the distance between different points on the sphere will entail knowing their latitude and longitude, and then being able to calculate the distance along the surface of the sphere. So this gives you an overview of the problem, right? And it takes you through a comparison of a conventional for loop, which I think, especially in Python, is not maybe the best comparison.
NumPy broadcasting, I think, is a fair comparison. So brute force for loop, NumPy broadcasting, and then scikit-learn K-nearest neighbors clustering, and compare that to broadcasting in CuPy and CuML. All right, and you may or may not see similar times as are quoted here, all right? But interesting always to see what factor speedup you're getting versus what is quoted or advertised.
All right, here's one additional place where you'll see this mempool, right? So Rapids has its own memory manager. So here, Rapids memory manager to initialize this pool, and then use the CUDA device, right? And you'll go through a series of different comparisons, right? So first, looping through every element in that set, or every pair of elements, then broadcasting, right? And then finally, comparing that with broadcasting on CuPy. So understanding that some people have to take off, I do wanna share a couple of places you can go after this course.
All right, so first, as Jess mentioned, provide your feedback. So filling out the form helps provide feedback to CUIT, and also to us in terms of how we can improve this course in the future. Timing and budgeting more time is one that I have at the top of my list.
Second one, think about where you might wanna continue learning, right? So if there are courses within the NVIDIA Deep Learning Institute portfolio that you think would be interesting, they range from an hour or two to full day courses. Fortunately, we're able to offer these for free to academic institutions. So a great opportunity to learn more.
As Jess mentioned, this is our second event this semester. First one was focused on different ways to accelerate code on GPUs. And it was a presentation, but not hands-on, more focused on the HPC side of the stack.
This hands-on numerical computing and data science with Python, especially if you're interested in some of the deep learning oriented courses, whether that's at a fundamental level or you're starting to scale training jobs up to multiple GPUs, we have offerings in those spaces and also have an extended version of this course, so GPU Accelerated Data Science, which is eight hours long. If you're looking just for some interesting places to learn more about deep learning, PyTorch tutorials are excellent. Oftentimes they have Jupyter notebooks, they have code repos, and they have YouTube videos that all focus on a single topic.
They're grouped by intermediate, sorry, beginner, intermediate, and advanced. So it's one of the first places I go when I'm trying to learn new topics. And Andrej Karpathy, one of the pioneers of deep learning, has spent time at Tesla and OpenAI and is now creating an educational platform for deep learning.
I think one of the best places to learn if you really wanna get into the details of how neural networks train. In terms of checking out some of the latest GenAI resources, build.nvidia.com is a web-based interface where you can go to interact with GenAI models. So whether those are large language models, text-to-image generation models, even molecular modeling, rag workflows, a cool place to check out what's available and mess around there.
And like I mentioned, developer.nvidia.com, checking out the blogs to see what people are doing with NVIDIA technology. We plan to be back on campus on December 3rd. So we're gonna have an NVIDIA day with Cambridge Computer.
One part will be focused towards research computing on HPC folks, one towards researchers. So stay tuned to multiple channels to hear more about that.
This is our second session with NVIDIA this semester, and today's session is a direct result of the feedback that we received from folks in the first session on what they wanted to see. So to that end, please fill out the super short survey I send after this with what you guys would want to hear either from NVIDIA or other folks, other research computing topics. We listen to you.
We are a small office up on 132nd Street, so that's how you can get your voices to us. And yeah, so the other upside of filling out that survey is you will get the slides from today's presentations after you fill it out. So I think, oh, the other thing is, while this is an in-person workshop, we have a Zoom going so we can record it in case anyone wants to go back and reference it.
I'll get that to you guys. But just so you know, if you speak up, your voice may be recorded in a recording. So you have been informed.
And without further ado, we'll pass this over to Michael Keith, our NVIDIA expert. Thank you, Jess. Thank you, Liz.
Very happy to be here, so appreciate all the coordination and planning that went into this. I'm a solutions architect on the higher ed and research team at NVIDIA. We worked with students, researchers, professors, research computing and HPC teams, university leadership, all with the goal of accelerating your work, whether that's in the classroom, especially on the research side, and on some university needs.
So by a show of hands, who's gotten through at least everything to set up their account and enter the event code? All right, anyone still working through that? All right, we got a couple. All right, so who else is working through that? I'd love to do a check on you guys. Yeah, if you're still working through it and having any issues, especially with the authentication, recommended browser is Chrome.
Chrome incognito mode tends to sort out a lot of the login issues. Once you get to the course home page, you'll see a start arrow on the lower right-hand side. Click Start.
That'll start to spin up the Cloud GPU instance. OK, anybody else still getting the course set up? One more, OK. Anybody else? All right.
So this is the course page. Looks like I'm already launched. So if your other tab happens to close, once this starts up, you should see the launch arrow.
So click Launch. That should launch a Jupyter environment in another tab. You can see I was already working in here today.
And if you can start by at least executing that first cell, that will make sure that you're able to send and receive from this AWS-based Cloud instance. While you guys are trying that, put your hands. How many people would consider yourself a Python expert in the room? Wow, some people actually put their hands up.
Very good. How many people are like pretty OK with Python? You know, you're way around a notebook, and you're not like super expert. OK, perfect.
And save space. Anybody either new to Python? All right, fantastic. How many people, same question, but expert at using GPUs? How many people are totally new to using GPUs? How many people tried it one time, and it kind of worked OK, but you want to learn how to do it better? OK, fantastic.
So please ask questions today. It's much more engaging if you do. So is everybody comfortable with executing a command in a computer notebook? OK, perfect.
So continuing with learning a bit more about you, show of hands, who do we have, undergraduate students here? Did you create? All right, graduate students, postdocs and professors, researchers, other staff. Great. So we have a mix of people.
What about your fields? Who comes from a science background, say chemistry, physics, biology, computer science? Or actually, I'll say computer science. Chemistry, biophysics. OK, computer science.
All right, a lot of CS folks in here, so definitely more knowledgeable and skilled at CS than I am. What about engineering disciplines? EE, mechanical engineering. All right, great.
A few. Humanities? I guess. OK.
Humanities-ish. I don't know what that is anymore. Business.
Any other fields that I haven't mentioned? All right, great. So now I know a little bit about your backgrounds. I'm a physicist by training, so studied condensed matter theory and then worked in applied quantum sensing and computing for a while.
Picked up some AI and data science along the way. And here we are to learn how to accelerate scientific workloads and data science-based workloads with GPUs and Python. So in terms of accessing the workshop materials, I think we've gotten everyone on there.
Just know that if your computer goes to sleep, you'll lose the connection. You may have to close the Jupyter Environment tab and then just click Launch again. It should keep your system active.
In terms of the Jupyter Environment, these materials will be available to you for six months. So feel free to log back in, execute some of the notebooks. Again, feel free to download the notebooks, either individual files or as a tarball.
There will be a limit on your total GPU time. All right, I think it's about eight hours. And it's a two-hour course.
So you'll have some time to get back in there over the next six months. But at a certain point, you will run out of GPU hours. It is not meant as a general GPU computing platform.
So if you try to run your own workloads, you may get flagged and kicked off. I've gotten close. I haven't fully gotten kicked off yet.
There's some people smiling at the camera. Yes. So in terms of what we're gonna cover today, so we've covered how to access the workshop materials.
I'll go through a very brief and high-level overview of GPU computing and compare it with CPU computing, how they can work together. And then the bulk of our technical focus today will be on using Python to GPU accelerate workloads. And that's with numerical computing and data science.
Finally, we'll apply these tools that we've learned or techniques that we've learned to a case study that's representative of a geospatial dataset. Today, we'll be working on NVIDIA A10 GPUs. So you can see the specs here.
We'll show you how to access the specs on your machine and get a snapshot of the GPU memory, current utilization, power, temperature. One of the things you may encounter is having the right GPU for the right job. So when you're running workloads, you may want newer GPUs.
You might want many GPUs. You might want many nodes of many GPUs for exploratory purposes. This is an excellent GPU for what we're gonna be doing today.
All right, so everyone's gone through setting up their developer account, getting to learn.nvidia.com and entering the code. All right, and then once you've entered that code, you should only have to do it once. After that, if you go to the dropdown menu next to your name, the course will show up in progress under My Learning.
All right, and for this first part, we're gonna be working in the first section of that first notebook. So overall, how today's gonna go, I'll present a few slides, talk about what we'll be looking at in the notebooks, give you some time to interact with and read the notebooks on your own, work through them, execute the cells, change things around, and then we'll come back together, share any observations or questions that you've had. So for the first part, you'll notice that this environment has everything pre-configured.
So it has the GPU, CPU, all the software that's needed. This is done by using containers. Containers have all of the runtime components needed for an application.
So you can think they're lighter weight than a VM, which has its own operating system. Container, think everything that uses the host operating system and lower level drivers and hardware, but has all the runtime components. So it includes possibly a different type of Linux OS and then everything that's built on top of that.
The Rapids container from NVIDIA's GPU catalog has this same environment. So if you're interested in utilizing some of these tools and making sure that they all work with each other, you can utilize the Rapids container. And I would say, reach out to your CUIT points of contact to understand how you can run containerized workloads on HPC systems.
All right, I know there's excellent documentation on the website already. So where does GPU computing come into play? The answer is really across the spectrum of computing jobs. And we can break this down into three different categories.
So the first, think traditional HPC workloads, all right, where you're using GPUs to accelerate tasks, which you can parallelize. The second and really the earliest use of GPUs, they're graphical processing units. So they were used for computer graphics, right? So for generating, especially visualizations for video games, this field of computer graphics has now expanded well past video games.
So those still includes it to include simulating virtual worlds, right? For training autonomous agents and finally artificial intelligence. So around 2012, researchers realized that GPUs because they're matrix multiplying machines are amenable to training deep neural networks. And the combination of large curated datasets, advances in models and computing power drove this revolution of AI forward.
Since then, GPUs have been increasingly tailored towards AI workloads. So some GPUs still focus more on the computer vision side, some focus totally on the AI side and some have capabilities that do both. So you might wanna run an AI model that's generating graphics.
So you need both the AI backend, but also the ray tracing capability. The majority of our workloads, when we interact with our desktop computers, laptops are on CPUs and CPUs are optimized for serial tasks. They're very low latency.
So you do something, you get a response and your computer is gonna carry out a lot of different tasks sequentially. GPUs on the other hand are optimized for parallel computing. So if you have a large number of tasks to carry out and they can be paralyzed.
So think of paralyzing a for loop where there's no dependence in between the steps of the loop. Think about training any kind of machine learning or AI model where you're passing different dataset or different batches of data at the same time through you can paralyze all of that both forward and reverse pass. So the challenge here is identifying the portions of your code that can be paralyzed and then using the appropriate level API to paralyze that part of your code.
All right, we'll show you a couple of ways to do that today. These speedups kind of come at the time where Moore's law is ending. All right, so Moore's law, the density of transistors in a certain area on a chip, it was doubling at a certain rate, I think 18 months for a long time.
And as that became harder and harder because you're going down to smaller and smaller nanometer size scale features, it's harder and harder to realize those density gains. We measure computing power in terms of floating point operations per second. So the number of flops, which rose with that density of transistors on a chip also started to tail off.
But in terms of HPC workloads, scaling up machine learning workloads to multiple GPUs, multiple nodes, even using lower precision in order to carry out some of these computations, the trend was able to continue by GPU acceleration. So the main takeaway here is that this has powered a lot of advances within the fields of science. So whether that's using convolutional neural networks to discover gravitational waves, drug discovery, which is a hybrid workflow, which requires some HPC aspects and some AI accelerated aspects, fusion reactors, real-time CFD, right? So computational fluid dynamics is a very mature and computationally intensive field.
One way that both research community and industrial communities addressing that now is to train AI models as surrogate models. They may not be as good as a full simulation, but it allows you to iterate much faster, tune some parameters on the fly, get real-time visualizations, and then guide where you would wanna look for your full-scale simulation. There are a number of different ways to accelerate your code on GPUs.
So often a question we get from researchers is, how should I do this? And the answer is, there's no one way to do this. So one of the families of approaches is through accelerated standard languages. NVIDIA contributes to these language standards for C++ and Fortran.
And with every generation of standard releases, there's more and more parallel capability that's incorporated into these libraries. And then based on the compiler that is used, the hardware backends, you're able to run the same code on a CPU, multiple CPUs, GPU, multiple GPUs. Python doesn't have a standard, right? But there's a de facto standard because it's a community-based language, right? So there's no standard Python, but the de facto standards also include acceleration, right? So one of the great features about Python is the very diverse ecosystem of developers.
So a lot of different open source tools, and we'll be looking at Cunumeric as one of the examples. If you're working on the C++ side, two other ways that you can start to accelerate your code are by using OpenMP and OpenACC. So here, you're going to annotate or decorate your code with these pragmas.
And the pragmas provide hints to the compiler to indicate where the code can be parallelized. So these are examples of parallel for loops, which are then able to be accelerated on GPUs. Finally, at the lowest level in terms of like how close to the middle you're thinking about programming, the CUDA family of languages, right? So CUDA C, CUDA C++.
Here, you're programming down to individual blocks and threads within a GPU. So this is when you may want to eke out those last couple of percents of performance for your application. If you're already using deep learning libraries, fortunately, a lot of this is done for us, all right? So there's a lot of effort that goes in both from the PyTorch team at Meta, also from the NVIDIA team that works on PyTorch to optimize PyTorch for running on NVIDIA GPUs, all right? So a lot of this CUDA backend stuff for say PyTorch, TensorFlow, JAX is already taken care of.
But you may find that you'll need to understand how this works if you find it's a limiting factor in the new architecture you're developing. When we think about the different fields this spans, so certainly numerical computing, all of your data analysis, any kind of data-driven solution, artificial intelligence, and increasingly quantum computing. So quantum computing is a active area of R&D looked at as a possible next phase of computers.
It's one of the areas that I work on. So one of the ways that NVIDIA looks at quantum computing is that it's not its own distinct entity, but it's gonna be another capability within a high-performance computing system. So think CPU, GPU, QPU, tightly interconnected so that you can leverage the right tools and also the right combinations of tools for the right job.
So thinking about CUDA on the right-hand side, right? Or even C, Fortran, these are compiled high-performance languages. On the left-hand side, we have Python. And it is a very developer-friendly language.
The challenge for us is to bridge the gap between Python and the performance that's realized on some of these lower-level languages. Within NVIDIA, there is a lot of software development activity now in order to bridge this gap. So the goal is to make GPU acceleration available through Python APIs to give you the same, if not very close to the same performance you would get if you were programming in CUDA C, C++.
There are a few different ways this is done, right? We have a lot of Python experts in here. JIT compilation, right? So just-in-time compilation. Python by default is not a compiled language, but you can decorate certain functions or methods to be compiled at runtime.
Low-level language bindings. If you're familiar with using NumPy or SciPy, the back ends for all those libraries are C-based. So the numerics themselves run quickly.
And oftentimes if you profile a workload in Python, you'll see that the numerical compute runs very quickly, but when you have to go back up to the Python layer, it slows down and you go back down and it runs fast. There are a couple of ways that we can look at how to get the most out of GPU computing because the GPU doesn't act in isolation. So think about a CPU and a GPU and each have their own memory.
And there's an interconnect between the CPU and the GPU. Each of those links has its own bandwidth. So how fast can you load data from one aspect of the system to another? And you'll see that both the host memory to CPU and especially the interconnect between the CPU and the GPU are two of the slowest links within the system.
All right, so keep this link in mind because moving data from the CPU to the GPU and vice versa tends to be a bottleneck in a lot of applications. Fortunately, it's not a showstopper. So there are ways to, we call it like hide this latency.
All right, so you wanna be able to overlap your communication and computation so that you're not bottlenecked by a lower bandwidth here. Amdahl's law describes how effectively programs can be parallelized. So there are certain algorithms applications which are more serial in nature.
So think about maybe a Monte Carlo simulation where one step depends on the previous step. Those are harder to parallelize. So it may make sense to keep the sequential components of your program on the CPU because they're gonna run with lower latency, right? And they're better suited towards CPU based operations.
But if you have portions of your application that are parallelizable, so think about any kind of large matrix vector operations, offload those to the GPU, run them quickly, right? And then feed them back into the rest of the program. In terms of where to look, so developer.nvidia.com is our developer website. The best way I stay on top of what's going on within NVIDIA is through NVIDIA's blogs.
So we have blogs like general community blogs. We also have technical blogs. So a lot of times they're associated with each other.
So if you wanna learn at a high level, what's going on within the entire NVIDIA ecosystems. This is NVIDIA, researchers who are using NVIDIA platforms, partner companies, highly recommend taking a look at some of the blogs. There are often links to NVIDIA's GitHub repos.
So vast majority of our code is open source. So great places to look for some pointers. Profiling tools, right? So when you're running an application and you want it to run faster, one of the best ways to first understand how your program is running and then optimize your program is through profiling.
NVIDIA releases a family of profiling tools, Insight, all right? So Insight Systems is your systems level profiling tool. It allows you to run an application and look to see how each part of your program are utilizing memory, compute, and how they're communicating back and forth with each other. I mentioned the containers.
So if you wanna check out what containers are available, ngc.nvidia.com is a good place to look. All right, so on the lower right, it shows a schematic of a four-way 8100 system. All right, so this is a single node.
So it would be a single CPU associated with the system. The interconnect between the CPU and the GPU is by PCIe. All right, so that's your lower bandwidth interconnect.
Within that node, there are high bandwidth interconnects between the four GPUs. So when you start to scale out to multiple GPU jobs, you wanna leverage these interconnects in ways that maximize the utility of the hardware. So any questions on the overview, CPU, GPU, computing in general, and any experiences people wanna share with getting up and running on GPUs, maybe some bottlenecks you observed? All right, who's trained any, did you have something? Loading from one, you know, they're in front of a lot of GPUs, but you get that like pause and then.
Yes, what kind of applications were you working with? I just. Yeah. Like, to be honest, the library, I guess.
Okay, yeah, which is awesome. But the best way to learn is by playing. Yeah, so there are ways of kind of hiding that latency.
Right, so if you can, you know, have it prefetch some of that data before the GPU needs it, that tends to, it's still there, right? But you can hide that latency. Yes. We've seen a lot of the images out there of combining the multiple GPUs where you get stacks of layers and data centers.
It shows like there are all the individual GPUs are all interconnected, as you've just shown in the previous slide. And there is one CPU outside of that stack that is connected through the PCI. Yeah, so that's a good point.
So that is one way to do it, right? In that case, all of your traffic in between nodes would have to traverse the PCI bridges between CPUs. There are other networking architectures where there's a fabric that connects GPUs through switches. And in that way, it won't have to go through that PCI interconnects.
So it all depends on how the system was designed. But in general, what you said is, it's a common architecture, especially for HPC systems. Systems that are more optimized for AI tend to have a different fabric architecture or network architecture.
So there's theoretically no limitations of how many GPUs, but the bottleneck would be the PCIe capability to enable communication between the CPU and the GPU stack. Yes. Yep, and just over, yes, PCIe and overall networking capability.
Yep. Yeah, so we'll take a look at some examples of scaling later on in this workshop. And you'll see that ideally, so if you were say running a fixed job size and doubling your compute power, that your runtime would go down by a factor of two every time, right? In practice, it doesn't follow that linear scaling because there is some overhead with that communication.
So the first set of libraries we'll look at are NumPy and CuPy. So NumPy may be familiar to you by show of hands, who's used NumPy? All right, very large majority, if not everyone. It's the backbone of numerical computing in Python.
All right, so I've spent quite a bit of time working with NumPy, especially before I knew that GPU accelerated alternatives exist. CuPy is an open source project. NVIDIA contributes to and supports it, but does not own this project.
And it's designed to be the same API as NumPy, but it's amenable to GPU acceleration. All right, so the same underlying structures can be transferred from NumPy to CuPy and vice versa. NumPy is tuned for CPUs.
All right, so all of the typical parallelism that you get by NumPy indexing and by doing array-like operations instead of, you know, brute force for loops, they're very sophisticated and they run very well on CPUs. CuPy is designed to do the same thing on GPUs. As NumPy is the backend for many other data science libraries, so SciPy and Pandas, CuPy acts as a backend for GPU accelerated data science libraries.
So here's a code snippet that shows ideally what you'd be able to accomplish. With only the change of an import statement, you should be able to run the same code on CPUs as you would on GPUs. You may see some examples out here in the literature, or not even literature, like in documentation or even on NVIDIA's blogs, that you can just import CuPy as NP and make that single one-line change to your code.
There are pros and cons of doing that, right? So yes, as a developer, it would be nice to just change a single line of code and realize GPU acceleration, but you may want to be more thorough just to make sure that you have one-to-one support for both languages. The main takeaway is that by using the same types of APIs, you can realize a 10x, 20x, 50x speedup by running these applications on GPUs. So at this point, we'll take a look at the notebooks and we'll work through an introduction to CuPy.
So if we shift over here, I'll take a minute to work through the introduction. So, hello world, make sure that your Jupyter notebook is live. Do you have a question or comment? I have a question.
So here, the example you showed me made me think about JAX. Is it similar or the goal of the library? I think so, yeah. So JAX is a numerical computing library that's GPU-accelerated.
I would say JAX offers additional features that CuPy does not. So JAX's native auto-diff capability, I think, is one of the unique aspects of it. But that's a great point.
So a lot of your general purpose computing, you could also do in JAX. We evaluate the second cell. So anytime you prepend a command with an exclamation point in a Jupyter notebook, it will run as if you're running in the terminal.
So exclamation point, NVIDIA SMI. This is NVIDIA system management interface. It gives you a snapshot of the GPU utilization at the instant that it's run.
So you can see here, it shows you the version of the drivers and CUDA that is installed, which GPU is being used, if your fan is running or not, current temperature. If you want, you can launch this in a separate tab. And if you want to run either NVIDIA SMI or watch NVIDIA, SMI, so this will refresh on a two-second interval.
As you start to run jobs in the Jupyter notebook, you can see how the fan starts to kick on, the temperature starts to go up, memory starts to be utilized. The A10 has 24 gigs of GPU memory. You'll see 23 here, right? So the system and drivers take up a little bit of memory on there.
It's also good to know what CPU you're using. So this LSCPU will list the attributes of the CPU. So with that, I'll give you some time to work through this first section.
The goal is to do a side-by-side comparison of the NumPy and CuPy APIs and to observe any differences in how well they perform. Sorry, I'm going to be a bit late. Do you know if you can show a slide where you have the links? Yes.
By a show of hands, who's made their way through that first section? All right. Who could use another five minutes or so? All right. Let's take a look.
So what were some of your observations? Just this introduction to CooPy and stopping at the kernel overhead. All right. So if you've gotten through here, any timing differences you notice between CPU and GPU? The same.
The same? Okay. Maybe we need new GPUs. Yeah.
So hopefully you saw that the GPU ran faster, especially on larger data. All right. So there's always overhead with compute.
There's overhead with parallelization. And on smaller data, you might find that the CPU runs comparably or even faster than the GPU. But as soon as you get to a certain critical data size, you'll find the GPU is much faster.
So one of the observations was that when you were performing the matrix multiply, it was performed once before the timing started. And the question was, why is that done? The answer is there's some overhead that is entailed with just-in-time compilation. So the first time that operation is created and run, it's going to be compiled at runtime and then executed.
So the first time it has to compile. Once the operation is compiled, if it's reused again or once it's compiled, it's cached. So then if it's reused again, it doesn't have to be recompiled.
So that's why typically, whether you're multiplying matrices using CooPy or even running deep learning training or you're training deep learning models, you'll find that the first time through typically takes longer. And then some of those initial runtime overheads have been executed already. They're cached and then everything else runs faster.
So that overhead is amortized. The larger your data sets are, the more you run these. This is a schematic of what happens.
So you call it the first time. There's a decision tree where if you've run it already, then it's going or if you haven't run it already, it's the first time calling it. It's going to compile and cache it.
If it's not your first time, then it'll use that cached version and run the compiled machine code. If you change anything in there, then it'll have to recompile it. So it's checking there.
So the compilation is one type of overhead. There's also a kernel launch overhead. A kernel is a set of instructions that runs on a GPU.
You can think of a kernel on a GPU analogous to a low-level function on a CPU. So there's some overhead in launching that kernel and getting it running on the GPU. And then once that's done, you're going to incur that overhead even once it's compiled every time a kernel is launched.
So one of the ways, you know, if you have multiple small functions and each of those is compiled just in time and has its own kernel associated with it, there's kernel overhead for each of those function losses. In order to have less kernel overhead, you can fuse different functions into a single kernel. So this cp.fuse decorator allows programmers to fuse multiple operations, additions or subtractions, multiplications into a single kernel.
And now you're only paying the overhead of a single kernel launch and then able to execute that instruction set in parallel. You'll be able to quantify this in the notebook. So if you work through the next section on kernel overhead, you'll see some of the time differences.
You'll also notice this CUDA device synchronize. So the print function or the timing is going to be running on the CPU. So this just says that the CPU has to wait for the GPU to finish that job before it runs the final step of timing that entire operation.
Because they're asynchronous. So the CPU will launch the job on the GPU. But if you don't have this CUDA device synchronize step, then it won't wait for the GPU to finish that before it executes the next step.
So work through this next section, check in in about five minutes, and then we'll look at data movement. How's everybody doing with that section? Show of hands, we could use a few more minutes. Ready to move on.
I'm not sure off the top of my head. But one way you'd be able to look at this. So there are a couple of notebooks we're not going to cover today, which look at Numba, which is a compiler for Python.
And Numba uses both CPU and GPU backends, depending upon which option you choose. So you'd be able to run a comparison and take a look at how you're using the same library if it takes longer to compile for CPU versus GPU. In terms of compiling, say, C or C++ code, I haven't seen a major difference using the same compiler, whether you're compiling for one or the other.
JIT compilation, in my understanding, it's a technique that is introduced by using CoolPy that is not actually in place in NumPy. Yes. Yeah, and JIT compilation in general is available through a number of different Python libraries.
Right, so I don't think. Let's see. Yeah, for NumPy, did you notice any difference? I don't think NumPy is compiled by default.
Right, but this JIT compiler pretty much allows you to compile portions of code at runtime. And there are different entry points to it. Right, so one is through CoolPy, Numba is another one.
I think other libraries have other compilers. We're going to cover the next two sections. So, next two sections are data movement and memory management.
So, going back to this picture of, you know, schematic of CPU and the GPU with each with their own memory, right, you have a memory bandwidth associated with each device and then interconnect between the two. We can see that this, you know, if you're doing primarily a GPU accelerated workload, once it's in GPU memory, both the compute and the memory bandwidth are very fast. But the bandwidth between the CPU and the GPU is fairly slow.
Right, so think, you know, it's about an order of magnitude slower than the GPU memory bandwidth. And this compares what would have to happen in terms of time if the data is created on the GPU and processed on the GPU versus created on the CPU, moved over to the GPU, processed on the GPU. Right, so you can see that both the time it would take, especially if the process can be parallelized.
So, this block takes more time than this block, and then there's a massive overhead in the middle. Right, so the biggest thing is to keep in mind that this bottleneck exists. There are certain cases where you'll be able to generate or load your data directly to the GPU.
There are other times when you won't be able to do this. So, keep it in mind if you're seeing slowdowns at a certain point, it could be because you're bottlenecked by this data copy. Right, finally, managing GPU memory.
So, similar to CPUs, GPU memories are finite, and especially in high-level languages like Python, there can be a lot of memory that's used and then not released once the job is done. So, you may want to manage the memory on your own. Right, and the goal here, you know, from the Python side, as I mentioned earlier, is to balance this productivity, so a very developer-friendly language, with performance.
And in order to, you know, realize that performance, one of the ways you can look at this is through a scaling study. There are two types of scaling.
Compute you have access to, you increase the size of the problem. So in this case, you're increasing the size of the dataset that you're processing. Ideal weak scaling would be a horizontal line.
So the same amount of time that it takes you to perform a certain workload on a single GPU, as you double that workload a number of times and double the number of GPUs, you would like to see this go straight across. Legate is a family of libraries that enables scaling of these approaches to multiple GPUs and multiple nodes. You can see here that cuPy running on a single GPU is slightly faster than using legate with cuNumeric.
So I think multiple node generalization of cuPy on a single GPU. So again, parallelism incurs overhead. So you want to incur that overhead when it makes sense.
So if you know that your job is going to run on a single GPU, there's a marginal difference here, maybe 10%, but it could make a difference to you. But if you know that you're writing code that will run on a variety of system sizes, then it may make sense to pay this overhead. Because as you go to two and four and eight GPUs, you're able to realize the same amount of time to solution for larger problem sizes.
There's one thing I'm going to point out inside this notebook. So does anyone have experience programming GPUs and using memory management using malloc or cuda malloc? Ever hear these terms before? All right. So some familiarity.
One of the things to look at here is how memory is allocated. And I think this is one of the more interesting aspects of especially current computing systems. So where the memory is allocated matters.
In this case, the example just shows that if you're allocating more memory than you have available on your GPU, and you're only using cuda malloc, then you'll get a memory error, right? So you'll run out of memory. You can also create this mempool. And if you use cuda malloc managed, you're able to utilize both the CPU and the GPU memory.
And malloc managed allows the GPU to access CPU memory and vice versa. There are some details that go on under the hood in terms of where certain data is physically put into memory. And there are some more advanced techniques to work with heterogeneous systems where you're really trying to leverage both CPU and GPU memory and compute.
But this shows that you can allocate a memory pool that expands past the size of your GPU memory, right? So useful when you're working on bigger and bigger jobs. The CPU is still a valuable aspect of the computing infrastructure, right? And especially the memory component of it, because it tends to be larger than the GPU. So as you scale problem sizes up, you see that by using this malloc managed, you can extend your memory pool.
In terms of the next 15 minutes or so, I'll check in 15 minutes. All right? It should give you enough time, hopefully, to work through some of this. Also take a five minute break or so.
All right? So if you need to get up, walk around, grab a drink of water. We'll pick back up at 225.
Welcome back, everyone. So we have a little bit more than a half hour left. Any observations from working through the remainder of that notebook? Questions? All right, so Jeffrey brought up a good point about taking this to your own work, right? So one of the questions and maybe things I want to learn from you on is, where do you see some of these techniques being helpful in accelerating your work? Are there any that stand out? What do you got? Okay, great.
And do you code a lot of that stuff by hand? Yeah. Okay, awesome. So this is an opportunity to leverage that.
Anyone else? If nothing else, how about this time.time T1, T2 and print your computation time, right? This is before you get into profiling tools and looking at traces and memory management and things like that. This is a great way to look at elements of your code, how fast they run and how much faster they could run. So don't take for granted how long it takes to run a certain thing now.
One of the paradigm shifts is how fast can I run this? And it's not just about running it faster, but think about your iteration time through a research or learning cycle. The faster you can implement, run, observe and then optimize, right? That is going to drive your, whether it's learning path or research path faster. And I come from a research background.
So running simulations of quantum systems is very compute and time intensive. And for a long time, they just took however long they took. I worked that into my workflow.
So I knew simulations were going to run overnight. So I'd work on them during the day, launch before I go home, come in, read the results. If that could be turned down into an hour or into 10 minutes and that massively accelerates your iteration cycle.
The next notebook, we're going to take a look at Rapids. So Rapids contains GPU accelerated versions of all of the Python or many of the Python data science libraries within, I would say the traditional Python data science stack. It includes tasks going from your visualization analysis to your model training and things like data visualization versus analytics and results visualization.
These different GPU accelerated versions, they can work together with task scheduling. So if you're using task for larger scale CPU workloads, these can be combined with some of these GPU accelerated libraries. And they're all built on this Apache memory framework, which is going to optimize the way your data is stored and then loaded into compute.
So similar to what we saw in the early examples of where NVIDIA is accelerating these different domains, all these different domains can be sped up through Rapids. So this is an example of a possible workflow where you're doing some molecular dynamic simulations. You can run them on a CPU or there are GPU accelerated versions, which are going to run faster.
And then you're going to do some midstream analysis. There's a lot of either feature identification or kind of like classical ML outside of deep learning that go into some of the intermediate steps. You can accelerate that with Rapids and then you're already on the GPU.
So you can run this entire pipeline, including training a neural network on a GPU. When we think about the traditional tools, Pandas is one of the most widely used libraries for data preprocessing, especially for tabular data. So think multivariate time series data where you have lots of different elements or devices that you're analyzing, number of different features or measurements that are coming in over long periods of time.
All the preprocessing that's done in Pandas and then downstream modeling tasks and scikit-learn. Anybody work on graph analytics problems? All right, we've got a couple. So graphs are really interesting.
They show representations between a lot of real world systems. Fundamentally, they're nodes and edges. So you can think of a road network as being well represented by a graph where the nodes are cities and then the edges are the roads between those cities.
If you're looking at distances, the weights of those edges could be the distance between the cities or you could incorporate the speed on those roads if you're looking for throughput through a logistics chain. NetworkX is a common Python-based graph library. Things for finding the shortest path between two points.
And then down to the deep learning side of the stack. Deep learning has the advantage that it really came of age during the time when GPUs was being used for machine learning and deep learning or data-driven tasks in general. So I would say, unlike some of these other libraries, deep learning libraries are natively GPU-accelerated.
And then finally, visualization. So matplotlib or other libraries that are built on top of that for visualization, these all have counterparts in the GPU-accelerated space. So instead of pandas, we have cuDF for preprocessing, cuML for machine learning, cuGraph for graph analytics, and then not a whole lot of change to the deep learning ecosystem.
So if there's a takeaway for how you might implement this in some of your own work, if you're using any pandas, scikit-learn, NetworkX, other libraries, take a look at the RAPIDS ecosystem, and it will show you what the drop-in replacements should be. Similar to what we saw with cuPy and numPy, the APIs are designed to be as close as possible, if not identical. So the goal is to make them identical where possible.
Especially with the data science side of the stack, I recommend that you make sure that the results you get out of one are the same as what you get out of the other. So one area that I've seen these two differ, if you're doing a lot of group-by-apply combined operations, or if you're doing group-by-apply combined operation in pandas versus cuDF, I've seen different behavior on the default sorting within groups. So just something to be aware of that you might need to make explicit in one library that's implicit in the other, and the default is slightly different.
So here's this one-liner drop-in replacement. Import cuDF as pd. Then all of your pandas data frame operations, like your read CSV, head, tail, anything you can do downstream from that will be GPU accelerated.
It's interoperable with a large number of different GPU accelerated libraries, whether they're more on the numeric computing side or on the deep learning side. I think it makes sense to go through these next two sections in one block. So once you do your preprocessing and initial visualization and you get your data into a format that's ready to be modeled, your next task is to run a bunch of models and see how well they predict.
I think this picture makes me think of the drug screening pipeline. So when pharmaceutical developers are trying to identify candidate molecules for drugs, they go from a massive space. I think it's like 10 to the 40 or something like that, so really massive.
While there's certainly still need for both the numerical modeling that's based on the chemistry of these systems and the benchtop work that goes into, you know, testing to make sure they're biocompatible and effective and not harmful, a lot of the work is done in the top parts of these cones. So to the extent that machine learning can be applied, and especially GPU-accelerated machine learning to narrow that search space down to identify better candidate molecules for certain purposes, this is one area where these types of workflows from end to end can be very helpful to advancing the fields. Scikit-learn, QML, they have a bunch of pre-built data sets, right? These are not necessarily representative of what you'll see in the wild, but it's a good example of a non-linear classification problem, right? So this is something where a linear classifier will not effectively separate these two half moons.
So as you make the data set bigger, you might experiment with QDF and QML in order to generate, analyze, and then model these data sets. The algorithm ecosystem continues to grow. So there are, at least when this was created and last time I checked, there is not a full coverage, or there wasn't a full coverage of all the algorithms available in Scikit-learn in QML.
All right, this is changing over time. So the goal is always to map one for one. Scikit-learn is under active development.
It's a community project, so there are always new contributions to it. There are also certain algorithms which are more amenable to parallelization than others. So you'll see the speed-ups vary by what the algorithm is.
And if you're calling the same high-level algorithm, let's look at something like logistic regression or linear regression. The defaults might be different in how it's done because it may be faster using one method than another, right? So a good example, I think, is linear regression. Linear regression has an analytical solution.
So you have your data, right? Your X can be multivariate, and you have your targets, Y, so single value. And by multiplying X by its transpose, inverting that, right, you can find your coefficients. So it involves a matrix inversion, right? And that's one way to do it.
Matrix inversions are accelerated on GPUs. Another way to do that is through gradient descent, right? So here you're just using your loss function, taking the gradient of that, and then taking steps. So those different approaches might realize different speed-ups going from CPUs to GPUs.
Taking a look at the notebooks. So when you got to the end of notebook one, if you ran this cell, you'll see kernel restarting. It shows you a message.
It'll restart automatically. Even with that, I find sometimes it's still utilizing memory. So if you right-click on the first notebook, shut down kernel, that will shut that kernel down.
You can start working through notebook two. You'll see some comparisons between pandas and cuDF. Again, notice what you're going to see or take note of what you observe when you go from smaller to larger datasets, and also when you go from simpler to more complex operations.
So definitely when we get down into some of the merging, and I think there's an example in here. Let's see. Aggregation, right? Where you're aggregating and summing.
See how fast this runs on pandas data frames versus cuDF. So NVIDIA, as an organization, is obsessed with performance. And when I say performance, we're usually talking about how well are we utilizing the compute resources, and also what's the time to results.
So one of the aspects or functionalities of cuML is this benchmark runner. So it has this built-in speedup comparison where you're able to run these algorithms both in scikit-learn and in cuML and look at the speedup. So it's also a pattern that you may want to apply to some of your own code.
If you're doing a comparison, you have a runner, right? You have a runner that captures certain aspects of your computing workflow, whether you're trying to look at the time a single block takes or end-to-end application performance. And it's just one more way to make your code more modular and then give you a way to assess how well it performs. So we'll take the next 10 minutes or so to work through this notebook.
I'll check in then. Any questions? All right, show of hands of people using scikit-learn. A lot of pandas.
All right, great. I hope this takes some of your time to results down. Is anybody, once you're done with this code, opening up one of your own projects and dropping something in and being like, let me just see? Have you been doing that? I think a couple people have been like, if you do have a type project, if you run through the notebook quickly and can open up one of your own things, give it a shot.
I guess you have to have the environment set up. You do. So you have the compute time on here.
Also, Google Colab. So Google Colab, you have T4 access. It's a great interactive environment.
So you can access GPU by changing the runtime. If you load this extension, cuDF.pandas, you won't have to make any changes to your code. It will just automatically run all of your pandas code using cuDF.
So it's a good hack to experiment with a notebook, upload one of your own notebooks, and run it in that Colab environment. We've got a question next to you. Oh, yes.
Sorry. I think it's quite different. Is it still interesting to try to use this kind of mix of or is it a complicating thing for the game to start again? I mean, you can't hope for, I mean, you have to see which data is used anymore.
Do you have any good ways to handle this problem? Yeah, I would say, you know, overall, time it and see what's best, right? So if you find that, you know, we've seen that once you scale your data set to a certain size, GPU is gonna run faster. If you start to exceed that memory and it starts to run slower than it does on CPU, then you know you need to optimize even further or just run it on a CPU, right? In general, I know that the most recent version of CUDF, it supports this unified memory by default. So if you have CUDF data frames that are larger than GPU memory, it will automatically split them up.
And then that rapid memory manager where you are CUPI memory manager, where you create a mempool, it's a good way to do that. It's not out of the box gonna be optimized for performance. And that's because there are some default ways that the data is gonna be split up and then accessed.
So there are some kind of further fine-tunings that you can perform in order to get the best performance out of it. But I would say like the golden rule is try it, see if it's fast enough for you, or if it's faster than CPU only. And if it's not what you expect in terms of speed, look for ways to optimize it.
In the way we test our GPUs, do we put them in any way to serve the CPU? I mean, would you share the information across the GPUs or something like that? That is gonna depend on how your environment is set up and how you launch the job. All right, in general, when you go to multiple GPUs, there's more that you have to do by hand to run on multiple GPUs. So usually it's either a wrapper around what you'd normally be calling, or you can use Dask to launch jobs that are then split across multiple GPUs.
So it took us a little bit longer to get through these first two notebooks than I had planned. So let me at least ask you to pause. I'll point you towards this final notebook, the two additional notebooks, which we didn't cover, but you can play around with.
We do have the room until 3.30. So if you're willing and would like to stay, happy to stick around for a while. If not, then at least you know what you can work on at your own time. So looking at the second notebook, I think these merge and group by commands from the data pre-processing are two of the strongest examples.
All right, so in terms of iteration time, you're going from roughly five seconds down to 20 to 30 milliseconds. All right, so these are operations that are very parallelizable. I've worked on data-driven workflows where the group by apply some operation and combine the outputs of those operations with the slowest point in the workflow.
And then everything downstream of that, all the machine learning and deep learning was much faster. So understanding your end-to-end workflow, where the bottlenecks are and how to speed up those bottlenecks, it gives you a good indication of where you might want to accelerate code first. So you don't always have to change everything all at once.
Get a high-level understanding of your entire workflow, understand what the slowest parts are, and then work to remove those bottlenecks and then iterate based on that. So think about developer time versus where you're really gonna gain the most. So this'll take a little while to run down at the bottom, but hopefully you saw that the comparison or you will see that the comparison between machine learning, running machine learning models on scikit-learn CPU and using QML on the GPU is quite different.
And there's gonna be some variations in there. If you're looking for the latest snapshots of where Rapids is, so rapids.ai, it's a community-driven open-source project. And there are some comparisons on here.
So in terms of the course, let's take a quick look at the final notebook, which combines a variety of techniques that we've covered today. All right, so this case study, it's meant to simulate a geospatial dataset. So one of the underlying calculations you may need to make is when you're working with any kind of geospatial data is to calculate distances on a sphere.
So once you've done that, or maybe as part of your machine learning approach, say a K-means or a K-nearest neighbors algorithm, calculating those distances and being able to do that efficiently will drive the overall performance of that workflow. So in this case, calculating the distance between different points on the sphere will entail knowing their latitude and longitude, and then being able to calculate the distance along the surface of the sphere. So this gives you an overview of the problem, right? And it takes you through a comparison of a conventional for loop, which I think, especially in Python, is not maybe the best comparison.
NumPy broadcasting, I think, is a fair comparison. So brute force for loop, NumPy broadcasting, and then scikit-learn K-nearest neighbors clustering, and compare that to broadcasting in CuPy and CuML. All right, and you may or may not see similar times as are quoted here, all right? But interesting always to see what factor speedup you're getting versus what is quoted or advertised.
All right, here's one additional place where you'll see this mempool, right? So Rapids has its own memory manager. So here, Rapids memory manager to initialize this pool, and then use the CUDA device, right? And you'll go through a series of different comparisons, right? So first, looping through every element in that set, or every pair of elements, then broadcasting, right? And then finally, comparing that with broadcasting on CuPy. So understanding that some people have to take off, I do wanna share a couple of places you can go after this course.
All right, so first, as Jess mentioned, provide your feedback. So filling out the form helps provide feedback to CUIT, and also to us in terms of how we can improve this course in the future. Timing and budgeting more time is one that I have at the top of my list.
Second one, think about where you might wanna continue learning, right? So if there are courses within the NVIDIA Deep Learning Institute portfolio that you think would be interesting, they range from an hour or two to full day courses. Fortunately, we're able to offer these for free to academic institutions. So a great opportunity to learn more.
As Jess mentioned, this is our second event this semester. First one was focused on different ways to accelerate code on GPUs. And it was a presentation, but not hands-on, more focused on the HPC side of the stack.
This hands-on numerical computing and data science with Python, especially if you're interested in some of the deep learning oriented courses, whether that's at a fundamental level or you're starting to scale training jobs up to multiple GPUs, we have offerings in those spaces and also have an extended version of this course, so GPU Accelerated Data Science, which is eight hours long. If you're looking just for some interesting places to learn more about deep learning, PyTorch tutorials are excellent. Oftentimes they have Jupyter notebooks, they have code repos, and they have YouTube videos that all focus on a single topic.
They're grouped by intermediate, sorry, beginner, intermediate, and advanced. So it's one of the first places I go when I'm trying to learn new topics. And Andrej Karpathy, one of the pioneers of deep learning, has spent time at Tesla and OpenAI and is now creating an educational platform for deep learning.
I think one of the best places to learn if you really wanna get into the details of how neural networks train. In terms of checking out some of the latest GenAI resources, build.nvidia.com is a web-based interface where you can go to interact with GenAI models. So whether those are large language models, text-to-image generation models, even molecular modeling, rag workflows, a cool place to check out what's available and mess around there.
And like I mentioned, developer.nvidia.com, checking out the blogs to see what people are doing with NVIDIA technology. We plan to be back on campus on December 3rd. So we're gonna have an NVIDIA day with Cambridge Computer.
One part will be focused towards research computing on HPC folks, one towards researchers. So stay tuned to multiple channels to hear more about that.