Optimizing HPC Workloads with Python | 2026 HPC Training Series #5

Welcome, everyone. I'm Sam Cho. This is part of the Research Computing Workshop Series at Columbia. Today, I'm going to be covering optimizing HPC workloads with Python. We're going to look at 3 tools and 3 strategies for getting more performance out of our cluster. A quick note on how today's set up. This is a demo focus rather than a hands-on. Presentation. The main reason is because of GPU access. Free-tier accounts that you all were provisioned with don't include GPUs. And running jobs from everyone simultaneously will slow things down and impact other researchers actively using the cluster. So, I'll be running the demos. But everything from today will be shared after the session, the slides, scripts, and the full recording, so you can go back, review it, and try it at your own pace. Another thing to note, the way I use Python and my HPC work may differ. Differ from your research workflows? The goal isn't to be exhaustive, but it's to give you concrete strategies, how you can adapt it to your own work. A little bit about me. I'm a research systems engineer here on the HPC team. My background is in IT systems, IoT, and system admin. I started out building web dashboards in PHP, and then my first supervisor pushed me towards Python. And I kinda never looked back. It's versatile, readable, and far more approachable than C or Pearl. So, yeah, in this role, Python is my default tool, things like automating tasks, generating reports, and even supporting some research workflows. I can't say I'm a Python expert, but you don't need to be to be… to get real value out of it in HPC, so yeah, that's… Part of what I hope you take away from today. Quick little introduction, here's kind of, a project. That I worked on for my previous job that was recently published. It's a NASA-funded plant research Running elevated high CO2 levels. I built a Python control system managing CO2, lighting, and logging temperature and CO2. History in real time. The paper is linked here, so you can check it out after the session. But the thing I wanted to point out is that Python isn't just for scripting or data science. It's flexible enough to Run real scientific control systems like this. Here's another example based on my day-to-day here on the HPC team. It's not a textbook example, but… A research group needed weekly reports on their cluster usage, things like CPU hour, GPU hour, memory, job counts. Before somebody was pulling this data manually, I wrote a script that gathers it automatically and formats it into this clean report. So again, yeah, manual task, now fully automated. That's kind of the practical win Python delivers. And it's a pattern you can apply on your own research workflows. Before we dive into HPC-specific things, let's kind of take a moment to ground ourselves and see what Python… what is Python? Python was created by Guido Van Rosam in 1991, and named after Monty Python's Flying Circus. That's not a coincidence. The design philosophy is the same. It's supposed to be readable, simple, and approachable. And today, it's arguably the most widely used programming language. Here are some real-world examples on the slide to get a sense of its reach. You got e-commerce recommendation engines, like Amazon and Netflix. game scripting at Ubisoft and EA. rendering pipelines at Pixar and Blender, and of course, scientific research on the HPC systems like ours. So yeah, why is Python the choice of language? The first reason is you write dramatically less code to accomplish the same thing. In research, that means less time fighting the tools and more time focused on the actual science. The second is you almost never have to build from scratch. Python has an ecosystem of pre-built libraries that covers virtually every scientific domain. Third, Python is really easy to learn. The syntax reads like plain English, a for loop looks like a for loop, and an if statement looks like an if statement. You don't need to be a software engineer to get a per… To get productive fast. The next is, Python is the undisputed language of AI and machine learning. And last, Python has one of the largest and most active communities in the world. Someone already has… solved your problem and posted the answer. It's… it's not a shortcut, that's just how Python works. In the next slide, I'm gonna cover some A library is commonly used here on the HPC. So here are the com… kind of the five common… third-party libraries use for HPC… on a HPC system. We have NumPy for numerical computing, TensorFlow for deep learning. Pandas for data analysis, MPI for Pi for parallel computing, and PyTorch, again, for deep learning. Let's talk about CPU and GPU hardware before we dive into Python, because understanding what's under the hood helps everything else make sense. So, the CPU is the brain of the node. On Insomnia… on our Insomnia cluster, each node has around 80 physical cores. About 160 with hyperthreading, each one designed for complex sequential work, one task at a time. The superpower is control flow. Decision, branching, logic, the CPU handles anything you throw at it. But when you need to… need the same math operation to run a billion times. A specialist isn't what you need. You need an army. And that army is GPU. Where CPU has a handful of powerful cores, GPU has thousands of tiny identical cores. So we have H100s on the Insomnia cluster, and they have over 14,000 cores, all executing the same instructions simultaneous. Simultaneously, just on different pieces of data. So, the GPU doesn't think, it executes. Hand it a million numbers and, say, multiply each by 2, and it does them all at one shot. Raw volume over versatility, that's the throughput engine. You'll likely hear the term CUDA a lot when working with GPUs. CUDA is NVIDIA's platform that bridges your Python code and the GPU hardware. Without it, writing code for thousands of GPU cores, I mean. Speaking machine code directly, which is not practical. The one line worth remembering is tensor.2 and CUDA. This hands the data off to the GPU. One line of code in Python, and CUDA handles the rest underneath. This is the simplicity of Python, the raw speed, and the raw speed of the GPU hardware. So… choosing between the two. Imagine you're running A factory that produces high-tech sensors, and you have to choose the tool based on the job. On one hand, you have the CPU, which is like a precision 3D printer. It handles the complex custom work The one-off prototypes, unique steps, branching logic, it's brilliant when the task requires thinking and adapting. On the other hand, the GPU approach is like a stamping press. You set the mold, let's say a matrix multiplication. And every time it comes down, it produces thousands of identical results at once. It doesn't think, it just executes at scale. So, yeah, the question is, in which is better, it's which one matches your jobs. In modern HPC, most of the heavy lifting is the stamping press work. And, yeah, taking a step back, why has GPU computing exploded in the last 5 years? 3 dang… 3 things converged at once. Data… number one being datasets got enormous. We're not analyzing hundreds of rolls anymore, we're dealing with billions of data points that CPUs can't keep up with. And most complex problems, like how a brain works, how a storm moves, or how protein folds, you can break that down into millions of simple matrix multiplications, and that's exactly what GPUs are built for. And lastly, it's more economical. One GPU does the work of hundreds of CPUs at a fraction of the cost. So, yeah, every major breakthrough in the last 5 years, like large language models, climate simulations, AlphaFold… That's happened because of the shift. Even our own cluster data kind of tells the same story, which I'll show you shortly. So yeah, that brings us to the importance of Python. So where does Python fit in all of this? Python is the control room, it sits on the CPU, organizes the data, makes decisions, and calls the right tools at the right time. The Python examples we'll go through today are telling the cluster or GPU what to do. Readable, flexible code on the surface with thousands of cores underneath. Python didn't replace C or Fortran, and HPC, it just became the language that tells them what to do, and that makes it a more powerful position. So, yeah, now that I've covered Python and the hardware fundamentals, let's bring it back to our HPC cluster. I'm gonna briefly talk about the footprint of our cluster. But before I go on, if you haven't worked with HPC System before, I'd encourage you to check out The Intro to HPC workshop that was earlier in the series, it covers the foundation of HPC. So, yeah, here you can see the GPU progression across our three clusters. Teramoto, our oldest cluster, is under 9% GPU footprint, and then you move it to our next cluster, Ginsburg, it goes up to 13%. Insomnia, our newest cluster, it's already at 35% and climbing. So each generation reflects how research computing priorities have shifted. And… And this isn't… this isn't unique to Columbia. It mirrors what's happening at HPC centers around the world. The nature of research computing is moving from CPU-dominated to GPU-accelerated. Simulations, machine learning, large data… Large-scale data processing, the GPU is increasingly where that work lives. Let me just quickly, briefly go over the navigating our insomnia cluster. If you attended the Linux and Bash workshops, this all should look familiar. This is how you access the cluster. You'll SSH using your Uni credentials. at insomnia.rcs.columbia.edu. The system will prompt dual authentication and your password. the standard two-factor login on most Columbia systems. On the Mac, the terminal works out of the box. On Windows, you can use PowerShell or Putty to SSH. I'm already logged in here on the side, so we can keep moving, but when you do this yourself, this is how you'll get access to the cluster. Just one quick note on your accounts before we go further. Everyone who signed up again, you've been provisioned with free tier access. This comes with a few limitations worth knowing about. The two main ones is… Again, no dedicated GPU node is in the free tier, and no root access. That's standard for shared cluster environments. Root access on a multi-user system would be a security risk. For GPU access, you're not completely locked out. You can still request a GPU through the short partition. The trade-off is lower priority in the queue, so you may… have to wait a bit longer for your job to start. But the resources are there when you need them. And again, after the workshop, if you want to experiment with the GPU examples yourself. The short partition is the way to get in. Yeah, before running any of our example scripts, we'll need to properly configure our Python environment. To do that, first, I use a module called Anaconda. Anaconda is the recommended Python distribution for HVC. It's, again, it's already installed on Insomnia, you don't have to set up anything yourself. Just load it through our module system, and yeah, and then we'll go from there. The key thing about Anaconda I want to point out, it gives you Conda. Conda is a package and environment manager that's aware of non-Python dependencies. Makes… Python… working with Python packages and dependencies a lot smoother. So again, yeah, Conda is the… package and environment manager. I use this on a day-to-day, and you'll likely use this on the cluster. On a day-to-day basis. The key feature for this is environment isolation. Each project gets its own self-contained environment, its own Python version, its own packages, its own dependencies. Nothing bleeds between projects. A practical example is, let's say one project needs PyTorch version 1 point something, another needs PyTorch version 2 point something. Without Conda, those will conflict. And then with Conda, each lives in its own bubble. So you can create multiple conda environments with different Packaged versions. Anything installed goes in the active environment only, not the system, or not other projects. You simply just activate a different condo environment, and you get a completely different set of packages. So, let's walk through the actual setup commands. This is the sequence you'll run. when you get your research environment ready. So, First, you know, you load the Anaconda module, I'll do it here at the bottom. S20239, and then you can check if it's loaded by typing ML again. You can see it's loaded. Before doing anything else, just type which Python, this will show… the system Python path currently, and then I'll show what happens after you activate your conda environment. Next. I've already done this, but this export conda packages dirs equals… Dollar sign home. This will… Set the package directory to your home folder. This ensures packages are installed in your home directory. Again, I've already done this, And it's already set for me. And then, we'll move on to creating the conda environment. It's simply condacreate-name, and then the name of your conda environment. Python in the version you'd like. I've already created it above here. And then, Once you hit enter, it'll start creating your conda environment. Once it's done, it'll ask you to activate your conda environment. And then, you can simply type conda activate, and the name you created, in my case is MyConda Environment. And again, here, you'll see that it's already activated. So yeah, once activated, you'll see the environment name in parentheses. And then this means you're inside your isolated environment. So, everything going forward, when you install various libraries, it stays inside the environment and doesn't affect anything on the system. So now, if I type which… Python. with my conda activate… Conda environment activated, you can see now it's… the Python path is now my conda environment. If I run a cat… dot bashrc, I'm gonna do it down here. Oh, dash, RC… Look for the conda initialize block. Here… into here. This confirms that Conda is properly set up in your home directory. And then if you run a conda in the, conda in the list… Gonna, as we wait for this to populate… This will show you the list of conda environments you created, or the system has. This is useful for keeping track of what you have and switching between projects. So, all three of these checks, which Python, looking at your bashRC file, and then once this populates, conda environment lists, these are things We look at when we're troubleshooting environment issues, so this is a good habit to build on. I'll wait. Couple seconds for this. Actually, while we wait, I'm gonna move on to the next slide. Installing Python packages, Once you're in your environment. pip is how you install various Python packages. It's straightforward. You can do pip install, following by the package name. Okay, so yeah, going back up here, These are the two condo environments I created. MyConda environment and my Python environment. These are the system… Condo environments here below. And then this asterisk denotes that that's the current, activated conda environment. So again, yeah, once this is activated, and if you use pip install, for whatever package name, it'll go into this condo environment. For this workshop particularly, we'll be using MPI for Pi, Torch, and TorchVision. I've already installed it on my end, just… For… for timing purposes, but that's what you'll do when you're testing these scripts on your own. And again, always check before running pip install that your conda environment is activated. If it's not, packages will go into the wrong location, and your scripts won't find it. So it's a quick It's a quick check. Next, I'm going to talk about Slurm. Slurm is the job scheduler that our cluster runs on. It's the system that… it's something that decides Who gets which compute resources and when? Every job you run on the cluster goes through Slurm. Again, if you missed the Bash scripting Workshop, on, I think, I believe it was Tuesday, I'd encourage you to go back and watch the recording. It covers Slurm directives in more depth than we'll go into today. It's a useful context for understanding how to write efficient job scripts. But I'm gonna share some common commands used, common Slurm commands, first being SQ, I'm gonna… Go down here… SQ will show all the pending and running jobs. And if you drill it down a little more, if you do sq dash dash me, this will show your running jobs, your pending or running jobs. And in my case, I have an interactive session. Running here. And then… S-Control show partition, I'm gonna use… The free partition in this… Simple. This will show you the partition's available nodes, and the time limits, resource caps, and so on and so forth. So, I wanted to highlight, again, these are the nodes assigned to the free partition. And again, it shows things like CPU, memory. And then if you go into a node, S-Control show node, in this case I'll use one of the three partition nodes. This will show… details you… details about the specific node, so including cores, the state of the node, memory, and so on and so forth. But here, I wanted to highlight, the state. The state tells you whether the node is idle, allocated, or mixed. So, right now, it's at an allocated state means No other jobs can land on this node, because it's full. And then lastly, oh, actually, and then the S account command. I'm gonna do my uni. This just shows the history of the jobs you ran. You can take a look at it for historical reasons. And then… Lastly, Sinfo. The SINFO command… This shows you which nodes are idle, allocated, or down, and what GPU resources are available. So here, you know, you could see different states. You got mixed, draining, drain. Allocated, and then… what GPUs are available. Again, allocated means no other jobs can land on these nodes. Mixed means there are a couple jobs running, but, there's still room for other jobs to run on this. Drain and draining is more for… administrative-type things, we're draining these nodes for, to troubleshoot something. That's typically why you see a drain state. And then, yeah, just understanding Slurm helps you write faster code, more efficient job scripts, and… spend less time waiting in the queue, so again, I encourage you to… Watch the Bash scripting workshop if you haven't. So, Interactive Job. So, Interactive Job gives you a live shell directly on a compute node. No batch script, no waiting for an output, your best tool for development and debugging. Here on the screen… on this slide, there are two SLEC commands. The first one here is just a basic CPU-only, request. But the one that I've used today, which is at this terminal here at the top, This one requests for 1 node, 16 tasks. 4 CPUs per task, 4GB of memory. allocated, oh, and a GPU. And 2 hours. So this is all I needed to cover for all three examples I'll be running today. Again, once you are granted, you're on the compute node, so here you can see that I'm on INS86. You can run your scripts directly, and then simply type exit when you're done to release the resources. Use interactive jobs for development testing. You can save the batch scripts for long production runs. And then I'm gonna quickly run these two again, these two commands down here, sq dash dash me. So again, this shows the interactive job that I'm running, and then if you want to drill it down more, if you do S-Control. Show job, and then the job ID. This will show the… The request that I made. So, like, the 16 task… And then you see the start and end time here, so I requested for 2 hours, and it'll show me when, when it'll end, stuff like that. All the… The things you've requested, the working directory, everything about your job. A very useful tool to see what's happening to your job. So next, you landed on a GPU, so you want to validate that you did land on a GPU, so… I like to do this, especially running GPU jobs. Before running any code, Just run the command nVIDSMI, Oh, this is very… exploded out, but… The things… it'll output a few things, and the things worth checking is kind of like the GPU name, so here I landed on a RTX A6000. the GPU util on this… right side here, it will tell you how much of the GPU is being used. Obviously, I haven't run anything, and it seems like no one else is on this node, or GPU, so it's at 0%. PID and process down here, it'll show Python process, or for examples, you'll see Python processes appear. But obviously there's no running processes right now, so you won't see anything. And then on this side, you got the GPU memory usage. This just tells you how much VRAM you're consuming. Okay, so yeah, before we start examples, here's kind of the roadmap. we're going. Through. I'm gonna have 3 examples, 3 levels of parallelization, each solving different bottlenecks. First is gonna be with JobLib. You have a loop, and you want it faster on course. It's gonna take one import and a wrapper, and you're done. Level 2, we have MPI. Your problem is too big for one machine. So, we'll do multi-process, multi-node parallelism. It's the standard for large-scale simulation. The third level, GPU with PyTorch. The bottleneck is raw math throughput, so one line moves your data to thousands of GPU cores. So each level builds on the same intuition. A quick note before I go into each example, I'm gonna be walking through the scripts at a high level, what it does, what to watch for, and I'm gonna run it live so you can see the output. After the session, I'd encourage you to go back, look through the scripts on your own, read them line by line, try running them with different inputs, and see, you know, see what breaks. Or, and see what works. So, yeah. So… now, again… I got my resources allocated, we got our conda environment set up, GPU is validated, so now… Let's go through the examples. First, we're gonna start with multi-core parallelism with JobLib. JobLib turns a standard loop into a parallel one with minimal code change. It handles the worker processes, distributes tasks, and collects the results. It's one library, a few change lines, and your loop runs across the cores that Slurm gave us. The key flag here in the SLEC command… For this example is CPU per task equals 4. This is what JobLib reads to know how many workers spin up. The other flags are there for the other later examples. And in this example, the goal isn't to show how parallel is faster. It's to find where it wins and where it doesn't. So that's the real lesson in this example. We're gonna first start off in the script with imports. We have 4 imports for the entire script. Time is the stopwatch. OS reads the Slurm environment to find out how many cores we have. Math gives us the factorial as a stand-in for real compute work. And JobLib is the one that matters. Parallel. Distributes work across the cores. Delayed queues a function call instead of running it immediately. So you… again, it's just one external library, that's all you need. for multi-core parallelism in Python. Next, we move on to the worker function. This is a placeholder for whatever your research code does, like… Simulation step, file transformation, inference call. It takes two inputs. I is the task number, weight is seconds for… Weight is seconds to sleep before computing. That sleep simulates I.O.bound work. Like, waiting on a file, database, and network response. While one worker waits, every other worker computes. That's where parallelism pays off. And then down at the bottom, math.factorial. is the CPU bound side. Adjust weight, and you shift between the two scenarios without changing anything. When I run this… the code. I'm gonna start weight equals 0, and then bump it up to weight equals 0.5, and we'll see what happens. If parallel takes advantage. Moving on into the script. Next, we move to reading the environment and user inputs. Here, Slurm CPUs per task is set automatically by Slurm when your job launches. The script reads it and uses that number as the worker count. But if you run it outside of the HPC cluster, like, locally on your laptop, It defaults to 1. The two input calls down here. it makes it interactive. Like, task weight… it asks for task weight and task count to set the runtime. That's what makes this a live race, rather than a hard-coded benchmark. It's good practice for your own scripts to let Slurm tell your code what resource it has, rather than hard-coding it. Your script works on any allocation without touching the script. Code. So the sequential loop is the baseline. Standard Python, one task at a time in order. So, for example, if we gave it 10 tasks at .5 seconds, each always takes 5 seconds. Doesn't matter how many cores the node has, we're only using one. It's like a single cash… cashier with 10 customers. No matter how fast that cashier works, the line still moves, 1% at a time. So, yeah, this is the baseline we're racing against. And the number that wins when tasks are too small. When the tasks are small, it's hard to justify parallelism. And then we go into the parallel loop. Notice how similar it looks to the sequential loop. And that's intentional. Joblib is a minimal change, not a rewrite. Here, there's two new concepts. We have parallel, which calls… The parallel call opens the worker pool. So, now, imagine all the checkout lanes are open at once instead of one. The delayed call here… Cues the call instead of running it immediately. So, think of it as, at the cashier lines, a supervisor or manager is handing a sticky note to a worker rather than telling them to start now. And then, JobLib handles the scheduling automatically. You don't assign tasks to a specific core. If we look at it, with 4 cores and 10 tasks, roughly. Four tasks run at once, and the whole batch finishes within about a quarter of the time. So again, the same function, same output, just distributed. Lastly, this is where the real lesson lands. Not just when parallel wins, but when it doesn't. The else branch… Is the one worth paying attention to. Spending up workers, packaging arguments, collecting results, all of that takes time, and the cost is paid whether your tasks are large or small. So, if tasks are too quick or few, the overhead outweighs the benefit and sequential wins. That's Amdel's law in practice. So, let's… Let's try to see this live, let's try it out with… I'm gonna try it out with weight equals 0 with 5 to 10 tasks, so… List out my code, Python multi… 4… And it's gonna ask me the two input variables. So, first, I'm gonna go with zero. And then 10 tasks, so again, 0 seconds per task and 10 tasks. So you can see here, the winner is sequential, and parallel is slower due to overhead. I'm gonna run the same code. But I'm gonna do… give it .5 seconds per task, and 10 tasks again. And it should be around 5 seconds, yep. For the sequential tasks? And then here, you can see you're running it against 4 cores at the same time, and you get You see that parallel wins. So… That crossover is the sweet spot. You know, finding it for your own research. Code is the whole point of this demo. not… Not making your job… more parallel, parallel isn't always the best solution. Just, you have to identify and to see what your job is doing, and if parallel does… Work out for your… Now, we're gonna step up to the next level, multi-node parallelism with MPI. MPI is Message Passing Interface. The name tells you exactly how it works. Unlike JobLib, MPI launches completely independent processes. Each one has its own memory. The only way they exchange data is explicitly sending messages. Nothing moves by accident. Every data is transferred within the code. Spread across hundreds of nodes, with no memory conflicts. That's why MPI is the standard for large simulations. It's more to think about than JobLib, but the trade-off. For unlimited scale. MPI for Pi is the Python wrapper for MPI. It gives you the full MPI standard without leaving Python. The practical benefit for scientific computing is NumPy integration. When you pass arrays between processes, MPI handles serialization automatically. No manual packing, unpacking, it just works. And if you're coming from MATLAB or R, this is your fastest path to cluster-scale parallelism. You write Python, you already know, and add NPI communication calls. Where your data needs to move between. Processes, and that's it. So, let's go into our example. One quick note before we run into the MPI example. I'm using multiple tasks on a single node today, just a practical choice for the workshop. No multi-node reservation was needed. for these examples… The important thing to note here is MPI doesn't care where tasks land. 16 tasks on one server, or spread across 16 servers. The code runs identically. So… In my salloc command, I use nodes equals 1 and 16 tasks. That is essentially the same as nodes equals 16 and end task for nodes equals 1. In our first example, it's gonna be the hello world of MPI. It's… So simple, it fits in this one slide. You have one import here, And 3 setup calls. Down here. So, the calm world… Is the group containing every process? GetSize returns the total count. GitRank returns process's unique ID. Git processor name. returns the hostname of the node it's running on. And then down here, the print line brings it together. Every process runs the same line. But rank differs from per process. So each one prints something unique. One script, multiple outputs, that's MPI. So, let's take a look. I'm gonna run it against… Against 4 processes. And this is what the output should look like. So… NPI exec… And four processes. Python… node… hello. So what this does is it launches 4 independent copies of the same script. Here, 4 processes for unique messages. I'm gonna run it again, just to show you that… the order is not sequential, so here, 0, 1, 2, 3, first time I ran it, yeah, it was sequential, but then if you run it again, you get 1, 2, 3, 0, It's not always sequential, so MPAI makes no guarantees about which process finished first, so never assume rank order in your own programs. So again, from this example, one script ran everywhere once. Rank did all the differentiation. Each process just checks who it was and acted accordingly. So, hello world example gave us the blueprint, now let's use it on a real problem. Here, in this example, we're going to be calculating pi by dividing The area under a curve into 10 million rectangles and summing them up. Each rectangle is independent, no process needs to know what another is doing, which is perfect case for MPI. So we're gonna run the same script with 4 runs, We're gonna call for 1 process, 2, 4, and 16. And then we're gonna watch what happens to time. So, let me jump into the script. So, first, like always, we're gonna call the imports. The one external library is MPI for Pi. Again, it gives you the full MPI standard in Python. The next is Time. It's the standard Python stopwatch, like we used in JobLib. It just starts and stops the timestamps to measure how the wall time changes as we add more processes. Now we move into the calculation function. This function… Is what every rank calls on its own slice of work. It's just standard math, nothing HPC-specific. The key detailed here is H equals 1.0 over rectangles. This is the global step size. Same across every rink. That's what keeps the math consistent, and means the partial sum add up directly at the end. And notice in this function, there's no MPI in this function. The computation is completely separate from the communication. MPI defies the work before this runs and collects the results after. Next, we're gonna set up the MPI setup. Same 3 lines from Hello World. You have the comm world is the group. Rank is who this process is. Size is how many are we running. Together, they tell each process where it fits and how much work to take on. dividing the work… Each rank figures out its own slice of the 10 million rectangles independently. And no communication is needed. Base gives the equal share. Right here. The remainder handles any leftover. Every rank ends up with a unique start IDX and end IDX. No overlaps, no gaps. The global problem stays fixed at… 10… at the 10 million across all runs. What changes is how many processes share it. That's what we keep. That's how we keep the comparison fair. So, the actual calculation. Every rank now computes independently. This is where the speedup happens. Calm Barrier is the starting gun. Everyone waits until ranks are ready. Then, at rank 0, it starts the clock. Without it, the timer could start before some ranks are initialized, which is an unfair measurement. But then every rank calls the pi calculation on its own slice simultaneously. No communication, no waiting, just parallel computing. And then we collect the results. Once every rank finishes computing its slice, everyone calls on this com.reduce. This is the collection step here. Every rank sends its local sum to rank 0. Rank 0 adds them all together into one total. Every other rink gets none back. They send their number, and that's their only job. Only on rank 0 has the final zero, answer. It stops the clock at… And final pi equal total pi in PI. The combined total is already pi. No extra math is needed. So you can think of this as, like, for people… Each counting a section of a crowd, they all call their number to one person who adds them up. That's exactly what this calm reduce is doing. One collector, everyone else is just reporting in. This… so at the last, we're comparing against the baseline, so this is what turns the script into a proper scaling benchmark. When we run n equals 1, when we're calling for only one process, that gives that will save the baseline time TXT. So, it just runs one process, no parallelism, just a clean reference point. And then when we run… When process is equal to 2, 4, and 16, the script loads that file. Computes the speedup, which is the baseline. over… in par. And prints a full report, the baseline time, current time, calculated pi, and how much faster it rained. When you're running it on your own, always start within 1 to create that baseline file. So yeah, let's watch this scale. We're gonna call for… 1 process, two processes, 4 processes, and 16. Each, again. Each one independently handing its own slice of 10 million rectangle calculations. No communications during… The computation, just each process doing a piece of its own, and sending it to rank 0 at the end. What we can expect is… near linear scaling. So, roughly 2 times, 4 times, and 16 times faster than the baseline. So let me first run it with… the baseline. Fine. Again, this will create that baseline. time, And then when we run it again, I'm gonna just run the next 3. with 2… And then I'll run it with 4… And then 16… So, if you take a look at this, With 2 processes, it's almost 2 times faster than the baseline, with 4 processes. almost 4 times faster than the baseline, and then with 16, about 9 times faster. So this is near linear scaling on a real computation. That's the payoff of this whole demo. This calculating pi example scales perfectly because, again, there's no communication between processes during computation. Most real research code isn't always that clean. Process needs to share the data mid-computation. And that cost can add up, so… MPI is the right tool when the work is large and mostly independent, so if you're doing MPI in the future, just keep this in mind. So now, you know, we covered single-node, multi-core, And then multi-node with MPI. Now, we're gonna go into GPU computing with PyTorch. PyTorch is the go-to Python library for GPU computing. It's the research in deep learning, climate science, simulations, anywhere that needs GPU-scale math. You never write CUDA directly, write Python, set the device, and PyTorch will handle the rest. The core data structure is Tensor. how PyTorch stores data, You… move it to the GPU with one call, and all the math runs there automatically. You can develop this on a laptop and then deploy it on the cluster with no code changes. In this example, I'm going to be using the modified National Institute of Standards and Technology dataset. I'm going to just call it MNIST. This is a dataset of 60,000 grayscale images of handwritten digits, each 28 by 28 pixels. Small enough to train quickly, real enough to be meaningful. It's a nice visual and intuitive way. You can look at the images and immediately understand what the model is trying to do. No domain expertise required. So, this is what the dataset looks like. You can notice the variation. The handwriting styles differ dramatically from person to person. Some 7s may look like a 1, some 9s could pass for a 4. That variation is what makes a real machine learning problem. So… In this example, the model has to learn what makes a 7 a 7, regardless of who wrote it. This diagram shows what's kind of happening behind the scenes as the network processes one of these images. So here, say, it sees this 2, This image passes through layers that break it down into patterns. shapes, and then it predicts an outcome. let's say 2. It's not reading the digit the way we do, it's detecting signals. in this, model here. I'm gonna be covering… Exactly how we get… This done in the code, but yeah, this is a nice diagram to keep in mind. So, yeah, let's get to the actual code. The hardware is set, the GPU that we called on, and the data set, and then we got PyTorch. So, in this example, again, JobLib, just to summarize, JobLib gave us more CPU cores, NPI gave us more nodes, now what we're doing is offloading the math to the GPU entirely. We're gonna train a CNN… on the 60,000 handwritten digit images, and compare CPU versus GPU timing. The hardware choice, we're gonna pick the hardware choice between CPU and GPU, and then the batch size. So that's the whole point of this. Example. One thing to note down here, again, in the RS alloc command. The dash dash gres equals GPU colon 1 is the most important. this is what gives us the access to GPU. Without this grress, even if we were to land on a GPU, it'll fall back to a CPU, so… When you're running, or when you're requesting, resources, make sure you have this grass equal GPU. So let's go into the script. We import 5, libraries. Torch is the foundation. Torch.nn is the building blocks. The layers, loss functions. torch.optum updates the model's weights. Torch Vision. loads and preps the… in this dataset. And then time is the same stopwatch we've used throughout all the other examples. One thing to note here is there's no import CUDA, you never touch the GPU directly. PyTorch handles this once you set the device. So the script starts off with an interactive, setup. There's… first one is choosing your hardware. So, here, the… Code specifically asks whether you want to use a CPU or a CUDA. And then also in this script here, typos like GPU will default to CPU, and then type CUDA with no GPU detected, it'll default back to CPU. Down here, torch.CUDA is available, checks whether the GPU is accessible to this job. This will go… this is important because it goes back to that salloc command. It… if you did do the dash dash gras equal GPU1, this makes it true. If you didn't, this makes it False. So again, without the flag, CUDA won't be available, even if you're on a GPU node. So once the hardware is chosen, the script asks for the batch size. And also, suggestions change depending on the device. Batch size is how many images The model processes at once. Larger batches mean more parallel math per set. Which is exactly what the GPU is built for. If you, choose the CPU device, it'll suggest 64, 256, 512. Cpu shares RAM with everything else on the node. If you push it too high, OS kills the process. With the GPU device selected, it'll suggest 512, 2000, 48, and 10,000. On a GPU, dedicated VRAM is separate from the system RAM, which is… Which is why, you know, this type of workflow GPU is built for this type of workload. Next, this goes into loading the data. device and batch size is chosen, now the script loads the data and preps the MNIST dataset. Transforms, compose, runs two steps on every image. You have 2 tensor, converts the pixels into PyTorch Tensor. Normalize, rescills the values to help the model train more stably. Standard… this is standard preprocessing, nothing to do with GPU or CPU. Data loader over here. Feeds the training loop one batch at a time. num workers equals 2 means 2 background threads preload the next batch while the GPU process… Processes the current one, which, you know, this helps it to keep things moving. The important thing to, note here is the data still lives in the CPU range at this point. Nothing has moved to the GPU yet. That all happens in the training loop. Which we'll get to in a couple slides. Next is the convolutional neural network, the CNN model. SimpleCNN is the network that learns to classify the handwritten digits. Two parts work together here. We have the con stack. It runs… two rounds of CONV2D, which scans every pixel for patterns. Relu filters out noise. MaxPool2D shrinks the output down. And by the end of this, it detects edges, curves, and shapes across the whole image. The next part is linear stack, which makes the prediction. Flatten, unrolls everything into one vector. Two linear layers narrow it down to 10 scores. One… Per digit? And the high score wins. model equals simple CNN.2 device, this right here, this creates the model and moves it to the chosen device in one line. So device, you know, either equals CUDA or CPU. This is where GPU earns its place. Conv2D applies filters across every pixel of every image in the batch simultaneously. Which, a simple linear model can't show you that difference. So, this is kind of… a picture always helps me understand things better. This is what the CNN model actually is doing. So the image goes in… Passes through the layers we just covered. And then a prediction comes out. So in this case, 7 at a 98% confidence. And then finally, the training loop. This is the same loop whether you're on a CPU or GPU. Model.train sets the training mode right here. Runs once per batch. So, let's say we chose 512… a batch of 512. That'll run 117 times. And then if we chose a batch of 10,000, that'll just run 6 times. So again, it's the 60,000, images. Down here, data.2Device and target.2Device. This is the important lines. One calls… Moves each batch to the right hardware. Everything that follows down here. runs, on that particular device. So again, whether you choose a CPU versus GPU. The rest, again, here, down here, is standard, zero gradients, forward pass, measure loss, back propagate, update weights. Same every batch, every… and every model. Over here… We have torch.cuda synchronized. It only runs when you choose the device CUDA. It's, GPU operations are synchronous, so without the timer. Without it, the timer stops early. And then, if you chose CPU, it skips it entirely. So this training loop is what's actually teaching the model. Every pass… Through a batch, the network gets slightly better, recognizing a 2 is a 2. So finally, let's run the script. I'm gonna run it, in two runs. Again, the same script, the same model, same data. But the hardware and batch size is the only difference. So Python here… While this loads, again, PyTorch is setting up, and the dataset is initializing, so that's why you're seeing this delay. And as we wait, also, the first run, I'm gonna choose CPU with a batch size of 512. This is just a suggestion. Again, CPU shares RAM with everything on the node. After the session, you can try higher numbers and see Where it breaks, or… and even try the lower suggestions. But while we wait, I'm going to… That's ancient, when we run it on a CUDA, Device… I'm gonna run that watch, or I'm gonna run that NVIDIA SMI command. Just to show you that the Python process is running. on the GP node itself. Sorry for the delay. Taking longer than usual here. I guess any… any questions while we wait? B. Due to time, I'm going to, while we wait for this to load… basically… Oh, here we go. CPU. Run it against the CPU first, back 12… So, this is running against it on a CPU device, and a badge size of 512. Running… Again. The batch size coming from the 60,000 images. It's probably… I think, times I've tested it took around 20 to 30 seconds. The thing that… with this example, I wanted to highlight is, again, it's the same script, same model, same data. you'll see that not only did the GPU ran the same job faster, but it ran it At 20 times bigger and faster. Run it one more time… There must be other processes going on at the same time. See if it goes any faster. Again, with this example, I just want to highlight that, yeah, it just takes one line to switch it to the GPU, so… you can build it locally, the same code. And then take it to our cluster. And then, yeah, it switches to a GPU, and it'll run… the job. So while that goes, I'm gonna just move on to the next slide. Just to close things up… Again, 3 tools, 3 kinds of parallelism, and 1 message. So, job lib gave us more CPU cores. MPI gave us more processes across nodes, the standard for large-scale work, And then, PyTorch, in this most recent example, gave us thousands of GPU cores. And this is the tool behind every AI and scientific breakthrough. I just wanted to highlight GPUs are not optional anymore, they're the future of HPC. As you saw on our Insomnia cluster, it's already at… it's at a 35% footprint, and climbing. So if you're not thinking about GPU acceleration in your research, you're leaving performance on the table. But, hardware keeps… hardware only helps you know helps you if you know how to use it. So, ask yourself before every job, what's the bottleneck? CPU bound on one node, then use Joblib. Too big for one machine? Use MPI. Is it a matrix math or model training? Then use GPU. So identify the bottleneck, match the tool. and request the right resources. That judgment, it's what makes the difference. Again, the scripts that I run today is yours to keep. Start your own research code, find the bottlenecks, and try applying. One of these tools… Actually, I want to run this one more time, because it should actually be, like, 4 seconds. But… Kind of ending the environment. To… some best practices to end the session. You can simply do condo deactivate. And then you should see the environment name change to base. And then to end the session, you just click Type Exit, you will… release the compute resources back to the cluster, and this is best practices on an HPC system. Alright, yeah, let me just… while I'm closing up. Here is where the scripts live. You can copy it to your home directory. Make sure you use the copy command, not the move command. This is kind of the output, the… what's inside the script, or inside the directory. You can modify it however you want, break them, try different parameters. That's how you learn. So… And if you have any issues copying these. Scripts, just reach out to us and we'll sort it out. And then, here's some Python resources, some two books that I recommend. LinkedIn Learning is free for all Colombian students, or Columbia students and staff, so there's a bunch of Python courses there. Never underestimate Google. Most of the Python I'd learned on the job came from searching a specific problem, and then landing on Stack Overflow. And then, for research… more research-specific things, or research-specific Python context. Software Carpentry here, linked on this slide, is built for… For that audience. Lastly, I know I went over, If there is a question, please do ask. If not, you can reach to us at [email protected]. If not… The next workshop, which I believe is on Tuesday, we're going into Singularity. But yeah. Any… Questions or remarks? If not, then, thank you. for joining… And hope to see you at the next workshop.

April 2026

This training covers optimizing HPC workloads by detailing three parallelization strategies in Python: multi-core parallelization with JobLib, multi-node parallelism for large-scale work using MPI, and GPU-accelerated computing utilizing PyTorch for high math throughput. You will learn how to match the right tool to the job's bottleneck and why it is critical to leverage Python as research computing rapidly moves from CPU-dominated to GPU-accelerated systems.