NVIDIA presents: 5 Ways to Accelerate Your Computing with GPUs

Good afternoon and thank you for joining us today's workshop, Five Ways to Accelerate Your Computing with GPUs. My name is Axinia Radeva, manager of the Research Computing Services Association, and our team, Research Computing Services, supports researchers by providing advanced computing solutions.

We work to ensure that our researchers get the right tools and high-performance systems to optimize their workflows, to drive their projects forward and achieve groundbreaking results. Today with me in person is Jessica Eaton, who is a research analyst on the RCS team. Most of you here know her by email.

We have Ali, who is the research system engineer on the RCS team. So, and Elizabeth, a research computing specialist on our team. We are excited to kick off today's workshop with a video where we'll explore five ways to use GPUs to accelerate your computationally intensive work.

Whether you are new to GPUs or already experienced, this session will offer valuable insights, beginner tips to advance techniques. It's a great opportunity to learn how GPUs can enhance your research and to engage directly with NVIDIA experts. We have Mike Keefe, and Mike is a senior solution architect.

He is an experienced applied scientist and marine officer with a proven track record in leading tech teams. He has a PhD in physics from SUNY Graduate Center. And Jeffrey Lancaster is a senior account manager for higher education and research at Tower College of Columbia.

He's a Columbia alumnus, and actually he completed his PhD in chemistry here in the Bunker Building, where we are right now. And he defines himself as a curious technologist and educator. It's really nice to have you back again.

I'll now get it all up to her. She'll make the short announcements. Hey, folks.

Just as those of you on Zoom know, we're recording this. Everybody is informed that it's being recorded. And then the other thing is, after this, I'm going to send a quick survey.

I know folks hate surveys, but this one has a list of other NVIDIA workshops that potentially could come here. So if you want to weigh in so we can figure out what to bring next, that would be great. And then it's over to Mike for this session.

Great. So thank you to Jess and Xenia for coordinating this opportunity. I'm very excited to be here to speak with everyone today.

And we're going to be talking about five plus ways to accelerate your computing workloads on GPUs. This is meant as a fairly broad overview of how NVIDIA approaches computing for various research and industrial applications. And like Jess said, the more feedback we can get on what you'd like to see next from us, that'll help us tailor some of our presentations in the future.

And we have time for Q&A for the end, too. That is the goal. Okay.

Yes. But if you guys have one interrupt in the middle, if you have a question, survey, like that, please. We want this to be interactive as well.

So to start off with, when we think about accelerating computational workloads, NVIDIA looks at this as an entire platform solution. This might look like some fancy graphics, but it's one of my favorite slides to present because it presents both the front-end applications that NVIDIA works across, but also the depth of the stack from very low-level hardware, so GPUs, CPUs, all the way up through different systems. And that includes a lot of GPU networking and systems integration through the lower-level acceleration libraries, right? A lot of your CUDA-level libraries and libraries that are built specifically on top of that.

And as we work our way up in this visualization, up to the application level. So moduling, bringing out the form of physics-informed machine learning. NI is for AI-enhanced medical imaging.

NEMO neural models, so useful for generative AI models. Won't go through every single one, but you can see that they span a large range of research and industrial applications. So today, I'll talk a little bit about the GPU itself, the architecture combines with and is different from CPU.

And we'll go through five different ways you can accelerate your code on GPUs. I'll briefly touch on some best practices for performance. And then, you know, in this age of generative AI, I think it's important to look at how we can leverage some modern AI tools in various research workflows.

And then I'll point you to some resources that you can explore. So any accelerated computing system may have some combination of CPUs and GPUs, right? So historically, large clusters were built entirely of multi-threaded integrated CPU clusters. And GPUs were originally developed for graphical processing.

And some, well, making certain code bases available for general purpose computing, especially with deep learning and AI leveraging GPUs for computation. GPU acceleration has become part of the modern high-performance computing stack, whether that's applied to scientific applications in HPC context or machine intelligence. One of the main distinctions between a CPU and GPU is that GPUs are optimized for massive parallelism.

So if you have operations that, say, you want to perform the same operation on multiple data streams, a GPU is going to enable that parallelism. I'm just going to try to move this around. I only have one slide.

All right, so for GPU acceleration, there's a certain percentage of any application that is amenable to parallelization, right? So it really depends on how well you're able to parallelize that motion, right? So certain applications, especially deep learning training, are accomplished almost primarily or completely on GPUs, whereas others require a hybrid node of CPUs and GPUs, right? And these are typically connected by PCI express bus, right? And it's worth keeping in mind that passing data from the GPU to the CPU can be a bottleneck in a lot of applications. So GPU architecture in some ways is analogous to a CPU. There are processors and memory.

The relationship between the memory and the processors is constructed in such a way that GPU can rapidly access memory and carry out a lot of instructions in parallel. This provides one example, so the H100, named after Grace Hopper, so the H stands for Hopper here, 80 gigabytes of GPU memory. So when you compare that to previous generations, you'll see that it's quite a bit larger.

When you look at some of the more recent generations that have come out, like H200 and now Blackwell, you can see that it's somewhere in the middle, right? But GPU memory can be a key factor in how fast your application can be accelerated. And each GPU has sets of streaming multiprocessors, right? And each one has its own set of control units, registers. Within a streaming multiprocessor, there are multiple ports, right? So when you start to compound these numbers, it gives you an idea of the parallelism that can be achieved.

And within these cores, there are arithmetic units, right? So these carry out some of the base computations that GPUs are able to accelerate. A typical processing workflow will start on the CPU. So one of the first things that will happen is data will be copied from CPU memory into GPU.

Program is loaded onto the GPU, GPU executes the program, and then data is transferred back to CPU memory in GPU. So CUDA is the general purpose software that accelerates applications on NVIDIA GPUs. So it was initially an acronym for Computer Unified Device Architecture.

Now CUDA is just its own acronym. But that is the lower level library that accelerates applications on NVIDIA GPUs. So to get into the heart of the talk, we're going to look at five different ways to accelerate GPUs.

Maybe five plus one as we get into the last section. And there are a number of trade-offs between flexibility and accessibility. So think on the more accessible side, you may be leveraging higher level applications, frameworks, or APIs to accomplish acceleration on GPUs.

Going towards the right-hand side, you may be controlling and writing your own CUDA to really optimize performance for a specific GPU or NVIDIA GPUs. And there are options somewhere in between. And then standard language parallelism, especially in C, C++, and Fortran, is an emerging area where parallelism is incorporated into standard language features.

So with that, we'll get into the applications. And I think it's worth thinking about applications and frameworks interchangeably. So you can see that across fields, whether you're thinking about it from an industrial perspective or from a research perspective, there are a number of applications that are currently GPU accelerated.

And the performance of these applications, so think relative speed up on the GPUs, it really depends on the application itself. So in the climate and there are numerical weather prediction models that incorporate physics that run on large-scale CPU systems for decades. So parts of those or entire libraries are now being forwarded to GPUs for acceleration.

In the data science and analytics field, there are a number of different tools which are being GPU accelerated. Certainly in the life sciences and computational chemistry realm, there are a number of different frameworks which have been GPU accelerated to some extent. I have a question.

Sorry. Is anybody using any of these? Does anybody see anything up there that you recognize or that you've got experience using? PyTorch and TensorFlow are two things I see that I recognize. Fantastic.

Thanks, Al. Great. So PyTorch and TensorFlow.

So we'll get to those in a second. What about the other tools? Yeah. Okay, great.

So to make this more interactive, what do you think makes this suite of artificial intelligence frameworks different from the rest of the applications on this term? So open-source and fairly widely programmable. Great. So open-source is one of the key aspects of it.

A number of these other tools are open-source as well. But yes, the ability to have open-source frameworks inside integration is certainly one thing. What about chronologically? I guess kind of the AI tools, like what makes them different from the other ones? What makes them so well-suited to GPU acceleration? Well, they have like a sort of framework in a way for development.

I suppose most of the other ones are very application-specific, but this is more like platforms. Yeah. Yeah.

Yeah, that's absolutely true. I want to make sure I'm timing. So I think a couple of areas that I see that stands out under the GPUs are matrix multiplication machines, right? This is the power of graphical processing, right? So think graphics generation or video games.

And that was the key link that really accelerated the field of artificial intelligence. Second one is just thinking chronologically, right? A lot of these fields were already computationally mature as the advent of general purpose GPU computing starts to be more widespread. Whereas these frameworks really co-developed as GPU technology, right? Both the hardware and software advanced.

So I would say in some ways, these, even though you do have an option to run on a CPU or GPU, they're somewhat more natively GPU. Whereas other applications say in the climate and weather space, it takes a lot more work to port them over to GPU and it's a large community effort. So one thing that might come across as you learn more about NVIDIA is that NVIDIA as a company and organization is very focused on application.

Right? So we're always looking to decrease time to results. Think about this in an educational context, the research context, the faster you can iterate, the best you'll be able to drive your fields forward. Some of the integrations with ANSYS and say star CCM from fluid dynamics.

These are relatively recent and they're ongoing, right? So these are very computationally intensive codes that are now starting to benefit more and more from GPU acceleration and not just on single GPU, but even at scale. So when we think about applications and frameworks, applications are to solve a certain problem or catch some lesser problems and frameworks kind of give you the overall structure to develop your own tools and applications. Whereas libraries is something that you, you know, a developer will leverage within a larger focus, right? You have more control over how you're leveraging those libraries.

One of the advantages of GPU acceleration libraries is that there are opportunities for drop in acceleration. And when we say drop in acceleration, the goal of a lot of the libraries that NVIDIA leads the development on, contributes to that are fully open source, are that they're meant to be as drop in replacements for commonly designed programming and libraries. So on the high performance computing side, we have the NVIDIA HPC SDK, right? So software development kit.

This is completely open source and available for download, whether that's through a container solution or accessing it from the cloud. And there are a number of different programming models that this encompasses, right? From your standard language parallelism to OpenACC and OpenMP, which we'll talk about a little bit more in the future and down to the lower order CUDA level. All right.

And this development ecosystem includes more than the language itself. So there are compilers that are specifically tailored towards parallelization in general, but GPU acceleration, right? And then the underlying libraries that generally be leveraged by application developers, right? These then build on math libraries and communication libraries. So the communication libraries become important when you're working with multiple processes or multiple GPUs and communication between those different processors or workers becomes a limiting factor for application.

So I'll show these different tools a couple times today. And being able to profile and debug your code, right, is essential for development. I would say, you know, certainly coming to NVIDIA fairly recently, I was more familiar with debugging tools than I was with profiling.

So profiling tools give you an opportunity to look at how your code runs, how data is transferred post device and back, even look at the computational efficiency of kernels that are running on GPUs. And it provides pointers to how to optimize your code. One of the examples that we'll see a few times today is SACS-Py, right? So this is single precision AX plus Y. You can think of it as the hello world of an HPC application, right? So this is a single precision matrix multiplied by a scalar.

And in order to make this code accelerated, one way to do that at the library level would be to replace library calls with those in an accelerated library, right? So any of your, you know, whatever you're using for single precision AX plus Y could be substituted by this CUDA enable linear algebra. One of the other things we need to manage when we're working with these combined CPU and GPU systems is where the data resides at a given point, right? So managing where the data is stored and then how it's transferred is a key aspect to developing a CUDA-accelerated application. And then finally, especially if you're working on a compiled language side of the stack, making sure that you're compiling and then linking the appropriate libraries so that this will run in a CUDA-accelerated environment.

So the kubectl SACS-Py like function, is it possible to write this in a way that detects if there's a GPU presence and then it defaults to the SACS-Py if there's no GPU presence? Is it possible to wrap it that way? That is a great question. Yes. So the short answer is yes.

And we'll see an example of that. Sorry. That's a great question.

Okay. All right. So there's a lot of development within the Python ecosystem.

So whereas C, C++ and Fortran have been high performance computing language, you know, with standards for decades, Python has got a lot of attention and is widely used, right? There's a lower barrier to entry and a shorter development time when you're using Python. One of the drawbacks of Python is that historically it hasn't been incredibly powerful. And that's due to both some language design choices, but also just due to maturity of the ecosystem.

So KubePy is an open source library that is designed as a drop-in replacement for NumPy as sideline. So the goal within KubePy as an example is to be able to, but I wouldn't recommend doing this per se, just change your import NumPy as np and import KubePy as np and see if everything works. Right? So there are times when you can realize 10x, 100x speed-ups just by substituting libraries.

Within the broader field of data science, so the Python data science stack is fairly mature and well adopted. And this includes tools like Pandas for pre-processing, Scikit-learn for machine learning, and I would say outside of deep learning, graph analytics, right? And the list goes on. Rapids is a collection of libraries that GPU accelerates all of these CPU, Python, data science stack libraries.

And it's designed so that they're drop-in replacements. Right? I will say designed as being a drop-in replacement and being a drop-in replacement are not always the same exact thing. So please do test and verify your code.

One example that we've seen where they might differ is when you're performing data frame operations, your result might be sorted differently within a group by operation by default in one library than another. So it's just important to know that they're designed to be drop-in replacements, but you want to verify that every step is producing what you expect it to produce. Another quick question, is Dask natively GPU accelerated? That is a good question.

Yes, I believe it is. I think it was originally developed as a CPU parallelization technique, but also it'll be GPU accelerated. Rapids enables the ability to work with Dask and these other libraries, right? Like multi-node scales.

Just as a reminder, when other people in the room speak, it doesn't always come across to all of us on Zoom. So if you could recap questions in the room, it will be easier for us to understand the conversation. Yes, thank you for the reminder.

So the question was about Dask being natively GPU enabled, and the response was, yes, originally designed for CPU parallelism, but now supports GPU parallelism and the Rapids stack is designed to work in concert with Dask. All right, so to show a couple of examples, so CUDF is the drop-in replacement for Pandas Databricks, and it's designed to have a Pandas-like API. So this includes how you would ingest data, how you might pre-process your data, whether it's for analysis or downstream modeling.

And QML is similarly analogous for Scikit-learn. So I spent some time as a data scientist, so I've used Scikit-learn quite a bit. I wish I had known about QML earlier in my career.

One of the things I always think about with QML is that it changes some of the paradigms that Scikit-learn even advertises, right? So you may be familiar with, I shouldn't have this chart in here, but Scikit-learn has a great chart to give you a guide to which algorithms to use. And it's based on whether you're performing a regression or a classification task, and it's also determined by the size of your dataset. Not only are the speedups significant for certain algorithms, like a support vector classifier, but it opens up the opportunity to apply those algorithms on much larger datasets than would be amenable on a CPU. So, I'm about halfway through. I have a quick demo.

Before I roll out to the demo, are there any questions? If anyone on Zoom does not see... what's that? You look good. Good? Okay, thank you. Could you drop that URL into the chat? That way, we can copy and paste it, the ones on Zoom.

Yes. I think there's some of the other URLs. And are we... we're going to share the slides, right? Yes.

Okay. Thank you. That is a graphic of the... interesting.

Okay, let me try that one more time. Okay, I got it. You got it? Yeah, yeah.

That's good. Okay. All right.

So, this is an example of the technical blog. So, I think, you know, one of the best ways to experience this is something that you can do on your own in a web-based IDE. And not only is cuDF meant as a drop-in replacement for pandas, but there is this Jupyter extension.

Oh, thank you. Let me change my share. Stop.

Let me share my entire screen. All right. So, there is a Jupyter extension called cuDF on pandas.

In this case, you don't have to change the code at all. It will automatically identify if there is a GPU available. And if so, it will run pandas as cuDF code.

And there's a link in here to Google Colab notebook, which I have on this next tab. And it enables you to see the difference, right? So, starting with a command you may be familiar with, NVIDIA SMI. So, this is NVIDIA system management interface.

It provides a snapshot into GPU utilization. So, in this case, for most unpaid Colab notebooks working with the T4 GPU, you can see it has about 16 gigabytes of GPU memory. If you were to run this cell again while you're running a job, it would show you the instantaneous GPU memory utilization, right? And power usage also.

So, this goes through an example of New York City parking violations. So, if you spend any time in New York City, and especially if you have a car, you may have received parking violations. Growing up here, I certainly had some of them in this area in just uptown where I went to grad school.

So, here we'll import pandas, read some of the data. So, I'm sure it's here. Well, it looks like I'm not connected to the internet now.

So, fortunately, I've run this already today. It goes through a demonstration of how you use pandas to read in data, change the number of operations to group by, aggregate, rename, sort values, and then look at the frequency of parking violations, say, across days. And of course, because we're focused on performance, we time it.

So, you can run this on your own and see that the total wall clock time takes about seven seconds. So, up until now, this was all done using pandas. Skip to the part where I go back, but you'll then start using cuDF.pandas. So, you'll have to restart the kernel in order to kick this off, load the extension, and you can see that even for reading and for running that same sequence of operations, we're now down 700 millisecond range.

So, it's 10x speed up out of the box. So, I know I've built data processing and machine learning pipelines where the preprocessing in pandas, especially for long time series data, was the choke point of that entire workflow. This helps those kinds of jobs.

So, feel free to play around with that and also to explore technical blog. Does that seem like something people would be interested in trying out? Has anybody used that before? I use pandas a lot, and I checked it out. I'm not really invested in testing.

I see that I'm just getting things out. What kind of data are you looking at? I thought when it was like time series, it's like multi-channel recording, but now I work with something that's not as much. Right.

Okay, interesting. So, getting into what I would say is the middle point between this flexibility and accessibility spectrum are OpenACC directives. So, I think this will answer your question from before.

These are compiler ints. So, you're able to provide these compiler ints outside of the structured block of code, and then based on how you compile the code, the same code should run on either CPU or GPU. That's a different executable script.

This is work that's done by the OpenACC organization, right? OpenACC is the standard, but OpenACC is also an organization that promotes accelerated computing very generally. So, one of the opportunities I'd like to point out are the OpenACC hackathons and worksheets. So, the OpenACC hackathons are a little bit different than traditional hackathons where you join the hackathon to work on a problem that the hackathon prescribes.

Here, you can bring your own code, and based on a research problem you're working on, and you'll be paired with experts from universities, industry, national labs to help you accelerate that code. And it spans HPC, AI, hybrid applications. I participated in my first one a couple of months ago that was hosted by Princeton.

It was an excellent experience just across the board, not only because we're able to see researchers accelerate how fast they're able to iterate through solutions and also on the AI side leverage more sophisticated models and larger datasets, but also because you get to learn what else is going on in the ecosystem. So, highly recommended. We have rolling admissions, and a large number of the hackathons are either hybrid or remote, so you're able to join with many.

So, revisiting this SACSpy example, you can see two examples, one in C and one in Fortran, where these pragmatists are going to provide hints to the compiler for how to parallelize that code. There are a number of applications which we covered in the broader slide across many different domains that realize significant speed up based on this type of acceleration. So, one of the benefits of this is that you're able to run the same code on multiple different architectures.

So, especially for community-driven projects where tailoring it to a specific architecture is not part of the goals of the project, it's a great option. Now, we start to get closer to the performance side. I will say across the board, it's incredible how performant all of these options are.

If there are certain aspects of your code, so think of a certain function which is amenable to parallelization, and you want to eke out every little bit of performance, then you would dive into the good programming realm. And this gives you the finest grain of control over how the code executes on the GPU. And it requires knowing the structure of the GPU itself, right? So, the number of threads, and you're able to tune your code for output performance.

So, you'll notice a comparison between serial code on the left and CUDA code on the right. So, the first is this global specifier. So, global indicates that it can be launched either by the host or the CPU or the device for GPU.

But this is going to run on the GPU. These block and thread indices have multiple dimensions, right? And now you can essentially cast the integer based on where you are in that GPU computation space and run this on GPU. You'll notice a kernel is a piece of code or set of operations that runs on a GPU.

So, the syntax for calling this kernel that you developed in CUDA C includes these triple angular braces where you're specifying the number. You gave a definition of what global was. Could you repeat that? I'm not quite sure I understood what you said.

Yeah, thanks. So, global is a GPU. So, it specifies that this is a kernel.

The fact that it says global says that it can be launched by either the CPU or the GPU to run on the GPU. Does that answer your question? Yeah. I think it makes it a little bit more clear.

I admit I'm not a C programmer. I just have general ideas. But I think that that makes it a little bit clearer.

So, there are a number of higher level libraries that enable developers to leverage our GPUs without going down into that thread block level. So, Thrust is one for C++ development with CUDA. And a lot of the features of Thrust are now being incorporated to the standard C++ library.

Great question. Yes. Is Thrust for C++ development similar to Boost or CPU? Analogous in some way, but Boost and Thrust? That is a good question.

I would have to get back to you on that. Do you use Boost? I have used it before. Okay.

Yeah. Not correctly. Yeah.

I'll look into that when I'm curious myself. Thank you. What was the question again? Oh, sorry.

Thanks for the reminder. The question was, is Thrust for GPU accelerated parallel development in C++ analogous to Boost? Thank you. All right.

Here you can see that the blocks and threads are also exposed in CUDA Fortran. All right. So, finally, within the five ways, we'll get to standard language parallelism.

There are a number of languages, right? So, think C, C++, Fortran, Python. And now, parallelism is becoming part of those standards. Python doesn't have a standard.

So, it's more of a facto standard based on what the Python developers and community accepts, whereas C++ and Fortran do have language standardization. But the main takeaway is similar. This parallelism is now being incorporated into standard languages so that you're able to, say, import as NumPy and then just perform your matrix multiplication as you would in normal Python.

One of the examples here is through a hydrodynamics simulation. And this leverages a number of different tools. Cocos is another framework for leveraging standard language parallelism where you don't have to go down into the details, but you can accelerate code that's across the platform.

One of the benefits is when you think about having to develop a code base that would run on either CPUs or GPUs, you might have to have a lot of if statements, right, where you're essentially duplicating code blocks specific to a CPU or GPU. With standard language parallelism, right, using these standard R execution features, it enables you to develop the code once, and then at compile time, it compiles for the CPU or GPU acceleration. And again, the performance is significant.

So, you're comparing how it would run on a CPU versus how it would run on a GPU using the standard language parallelism. The speed of this goes right near a couple of legs, which you might have to invoke in order to run it on the CPU. All right, so this std bar, flag, GPU, or multi-torque.

All right, so those are the five ways, but I think there's a bit of a gap, and it's a lot of what's going on recently. So, before I get to that, we'll go over some best practice performance. So, compute developer tools.

Insight Systems is your system-wide application tuning framework. So, when you launch a job on the command line, instead of launching it with Python or Torch Run, or, you know, launching a compiled C++ application, you'll launch it with Nsys, and then that will profile the runtime of your job. And this is not a great visualization, but you're able to see all the different processes, and it's supposed to device, transfers, and it enables you as a developer to identify bottlenecks in your workflow.

All right, so maybe there are opportunities to invoke more processes in your data transfer, which will shrink some of those times down. Maybe there are opportunities to overlap some of the data transfer with computation, which will then, you know, accelerate the entire application and make your GPU not wait for data from the CPU. All right, so highly recommend starting here with Insight Systems.

It's also very useful for profiling deep learning applications. That graphic is a little small. I can't tell.

All that output is on the command line, or is it in a web-based window? That's a good question. So, it's not. Sorry.

Okay, yeah, so you would typically launch it by command line, but there's a graphical user interface that you can use to analyze the outputs. Could you just direct everything to a command line window instead and not use a web window? You could, and there are certain summary statistics which might be helpful via the command line. So, you might be able to look at which processes take up the most compute time, if you want the granularity to look at the time traces and how they overlap.

I think the GUI is probably the entry point that makes more sense for that. So, it really just depends on what you're looking at and what resolution you want to get into. In addition, you know, there are features of the graphical user interface that will provide hints to how you might optimize your code, so it can be useful in that sense.

I think, you know, if this is of interest, we could always dive deeper into the inside systems and profiling in general in the future. Thank you. So, the next potential area that you can use to accelerate your code and optimize performance is by using containers.

So, containers provide a stable compute environment that has either all or a large number of dependencies that your code will use. NVIDIA makes a large number of containers publicly available through NGC, and I noticed that you're running AppTainer on systems here, right? So, you would have to squash those Docker files into CIF images, right, but the workflow is the same once you're done with that, and they're optimized for performance. So, if you're running PyTorch, NVIDIA PyTorch container has PyTorch, a large number of all the dependencies, a large number of libraries that are commonly used with PyTorch, and it also has the profilers inside the container already.

One of the advantages there in a shared HPC system is that it takes care of a lot of the privilege issues which occur when you're trying to leverage the system profiling tools. Getting some nods and smiles here. So, yeah, we've all worked for some pain points.

All right. So, finally, generative AI workflows. You know, this field is advancing so rapidly that there are different ways of harnessing the capability of generative AI.

The models are trained on massive scales, right? So, we're seeing a change in scale of compute that industry has compared to AppTainer. There are still ways to leverage a lot of these powerful models that are developed by the community, but it requires slightly different workflows, right? So, on the model side, you may be familiar with pumpkin face with large language models. I would say, you know, really brought to the public awareness by OpenAI a couple of years ago, but now a large number of community-driven models, right? So, even Meta is releasing their series of LLAMA, most recently 3.1 models, large language models, even multimodal models, and they're publicly available.

So, my question is, you know, how do I leverage these in my workflows? One of the tasks you might want to accomplish, and this is true for individual research groups, it could be true for departments, it could be true at the university level, is to be able to host your own large language models. And oftentimes, these large language models have broad capability, but users would like to fine-tune them towards their institution or enterprise, right, or use cases. So, there are a number of steps that are involved in, say, taking a model that's pre-trained, curating some internal data, being able to fine-tune that model, because these models are large, doing that in a distributed fashion, going through that customization process, and then thinking about deployment.

So, thinking about anyone within this global university ecosystem is going to be able to use these models, right? How many requests are coming through? Are there retrieval augmented generation applications, which might be of interest, right? So, RAG is a set of techniques where you can bring your own data and have a chat interface that will then search your data and provide answers back in the chat. So, a very common use case, think of whether it's for your coursework or for your research, right, being able to chat with papers just to get a better idea of fields. And then, certainly, guardrails.

So, we want these tools to be used responsibly and ethically, so making sure that both the user inputs, but also the outputs from these models, are within the standards of behavior that are consistent with university values. So, while we've talked a lot about running applications and, you know, maybe a little bit about training deep learning models, leveraging data science tools, inference, you can think of it as, you know, what you do with a model once it's trained, and optimizing a model for inference requires a different set of techniques. So, especially in that RAG example, you're taking a chat model, there's some kind of embedding, you're going to be adjusting your own data, you have to then take that data from, say, PDF format, right, create a vector database that is in the same embedding space as your large language model, so that for vector search, you're stitching a lot of different aspects together.

And it seems to be advancing on its own at this point, so maybe I'm running slightly short on time, but the main takeaways provide you inference tools to be able to stitch together your own workflows, and if you're interested in experimenting with these, you can go to build.video.com to start, you know, first experiencing models in web UI, prototyping your own applications, and then moving towards deployment. And because these are, let's say, NRI solutions, the deployments are meant to be portable, so they should be able to run workstation, on-prem, or cloud. All right, finally, I will wrap up with a couple of resources that you may find interesting.

So, I mentioned build.nvidia.com, the NVIDIA developer ecosystem. I find it a great way to stay on top of all the work that NVIDIA is doing in collaboration with industry and universities, right, so how different libraries, SDKs, frameworks you may leverage for interesting applications to drive fields forward. The vast majority of NVIDIA software is open source, so I highly recommend going to github.com slash NVIDIA.

It changes fast, so one thing to keep an eye on, especially if you join the, if you open a developer account, but even in the absence of that, there's a large number of on-demand videos, so whether you're looking to learn more about new fields, or just looking to see some talks from recent conferences, it's also a place I go to learn about new parts of our stack, parts of our stack that are new to me. When we think about learning, this is where I'll dive more into deep learning, at least conversationally, learn.nvidia.com is the landing page for our formal education offerings. So, on the deep learning side, you can go to browse paths here, and you can explore different learning paths based on your domain.

So, if you're interested in robotics and visualization, that will have a certain set of recommended courses. If you're interested in building large language model applications, that will have a set of recommended courses. One of the courses that we offer and that I teach, which is one of my favorites, is data parallelism.

So, when you're training deep learning models, and you're now reaching the point where training on a single GPU is taking longer than you would like it to, the first step is to then parallelize that process. So, you would have multiple copies of your model across different GPUs, and each one sees a different batch of data. And then, there's communication that has to occur at each step in order to make sure the model in each GPU is the same.

So, that could be, if you're already familiar with fundamentals of deep learning, that could be a common or an interesting next step to take in your learning journey. And so, please feel free to explore. The courses range from 30 minutes to an hour, self-paced, all the way up to full eight-hour long in-person classes.

We'll also say our developer program is fine. So, signing up for that gets you access to a lot of that learning. So, if you ever encounter sort of something where you say, I don't want to pay for this, you know, anything like that, try the developer program first.

If not, you can sort of get in touch with me. I'll figure out a way to get you that content. Yeah, and this is something, because we're seeing a lot of AI development at scale, especially by domain experts now, whether there's a critical mass here in Columbia or even among New York universities in general, we can work with the relationship managers across the city to figure out what works best, but we can talk more after this based on the feedback that we get.

So, with that, I think we have five minutes left for questions. I would like to thank everyone for being here, both in person and online. And thanks for the great things to make sure that everyone online is hearing what's going on in the room.

So, with that, maybe I'll turn first to any questions from the group. Anything from the chat? We've asked any questions. So, yes.

Any questions in the room? I was going to ask how you guys came about joining NVIDIA, the individuals and what your cut paths have been. You mentioned you've done data science for a while. Yeah, so my background is in physics.

I worked for a number of years doing applied physics research. What was the question? Oh, yes. Thank you.

Yeah. So, the question was, what were our paths to NVIDIA? All right. So, I joined NVIDIA earlier this year in January.

So, background in physics, my applied research was in quantum computing, quantum sensing and applications, as opposed to machine learning through solving problems in that realm. I spent a few years in applied data science and was just looking for something new. We were both part of the higher education research team.

So, I think having a background in that broader applied research field and also just being lucky that the team happens to be looking for somebody with a little bit of quantum background, a little bit of physics background, some broad data science skills at the time was just right to join the team. So, that's what brought me to NVIDIA. I was just super excited about joining and the experience so far has not disappointed.

It's a ton to learn. It just probably makes it so fun. I'd tell you afterwards.

We do get a question from online. How does GPU compare with CPU in terms of cost? Did they get outdated at some point as technology improves? Yes, that is a great question. So, one of the ways you can look at the cost of computing is through total positive ownership.

So, the initial outlay is going to be the cost of a CPU system that can run a certain workload versus a GPU system that will run a certain workload. For workloads which are parallelizable and will run performantly on GPUs, the total cost of ownership for the right workloads can be lower for GPUs. But I think it's really a workload specific and not just a single workload, but the breadth of the workloads that, you know, a research group or research computing enterprise is facing to figure out what the right way out is. Do you have anything to add to that? Well, no, I think just in terms of the question about do they get outdated at some point as technology improves, we still see people who are using many generations older GPUs to do their workloads because it's what they have.

And the typical refresh cycle on a lot of technology, you know, a lot of places will be four years, maybe stretched to five years. Some GPUs are still running that are 10 years old. And that's great.

I think the question is what is the tradeoff for how long you want to spend for that job running and what's the energy consumption to do it? The GPUs today are much more energy efficient per compute unit than older GPUs. Challenge is people want to do more compute with them. So you hear a lot about how much energy these things are actually taking.

It's because the problems that people are having and that they're using the GPUs for, they require more, you know, more flops to be able to actually launch. But you could not probably do some of these, you know, LLM training and things like that within your lifetime on older generations of GPUs. So the changes really are about how do you accelerate those workloads to make the most of your time to then be able to put them into practice, to be able to use them and iterate on them.

All right. Last question. This is from Al.

Al, can you suggest any resources that are more specifically focused on GPU use with Python and Jupyter notebooks and LLMs? So any resources targeting LLMs and Jupyter notebooks? See, so you mentioned some of the courses that are available in the developer program. So it's if you create a developer account and then go to learn.nvidia.com with that same login, you'll see a number of courses that are available. And I believe some of our one hour long introduction to LLMs and building apps with LLMs are freely available through the developer program.

And all of our courses are offered in a Jupyter environment. So for some of those courses, you're working purely in Jupyter notebooks. For other courses, it's a combination of Python modules that you will run through a Jupyter interface.

We also have a bunch of workshops, a lot of which, you know, the content is actually on GitHub. So if you just want access directly to the notebooks, you can sort of walk yourself through those if you want to. Many of those have videos associated with them, too.

So if you want to kind of walk through, you know, an instructor-led video with whatever the repo is, you can. Yeah, I'll pull up here also. I'm trying to think of a good example.

Deep learning examples. So NVIDIA's GitHub has a number of examples. Let's say you're interested in deep learning examples in PySorch or in the field of drug discovery.

And you want to learn how transformer models can be applied here. This looks like not a great example. Let me find a Jupyter notebook.

Well, we're just about in time, so we can do that. But I would say, poke around in NVIDIA's GitHub. Within each repository, you know, there's one I know well.

Yes. So Modulus is our framework for physics-informed neural networks and physics-informed machine learning generally. So if you go to Modulus and look at some of the examples, there should be examples right here.

Deep learning weather prediction. Yes, some have Jupyter notebooks, some don't. But I would say, poke around in the examples folders.

Sometimes there are notebooks folders with any of the repositories that align to the subject you're looking at. And there should be some examples there. Thanks, everybody, for coming.

Thank you so much. Thank you. Thanks.

September 26, 2024

NVIDIA led a session on how GPUs can super-power your research computing. Novices can find out when GPU-powered computing is appropriate and how to get started, and more advanced users can bring their discipline-specific questions to Nvidia's GPU experts.