Classic: Intro to Python for HPC 2025 | HPC Training Series #3

Okay, perfect. So welcome, everyone. Welcome to the third of our four-part series of our intro sessions.

On the screen right now, you'll see a roadmap of what we've accomplished over the last couple of weeks and what's to come. The first session was intro to Linux, which gave you an overview how to navigate our systems. Second week was intro to Bash.

Bash is the scripting language that you use to actually harness the resources from our clusters, be it the RAM, CPU, the cores, GPUs. And today, this is the third week, and we are now on to intro to Python. Today, we'll be overviewing my most popular tools and libraries in Python, such as PyTorch, Numty, and TensorFlow.

So what you'll see is how you can optimize Python to leverage resources such as GPUs. In our HPC environment, we've actually seen workflows being optimized from 10 forward, 30 times the speed using GPUs over CPUs. And Sam will also give an overview why GPUs are becoming so popular in HPC.

So with that, I'll hand the mic over to Sam. But before I do that, just a couple of house rules. Number one, please try not to interrupt Sam when he's in mid flight.

If you have any questions, copy them or write them in a Zoom chat window, and one of my admins will address them for you. Secondly, these sessions are being recorded. So if you want to just watch Sam execute now, you have the luxury of actually following this video later on at your own pace and running these jobs yourselves.

And lastly, we will be reserving 15 minutes at the end of this session. So if you have any questions you want to vocally raise at the end, you will have that 15 minutes assigned to you. So with that, Sam, it's over to you.

Perfect. Thank you, Max. Yeah, so just going over some expectations about this workshop.

This workshop is it'll be run a little differently from the previous Linux and Bash workshops. Today, you won't be following along with the hands-on examples in real time. The main reason is because of GPU access.

The free tier accounts you all were provisioned with don't include GPU resources. And the GPUs I'll be using today belong to a shared resource pool used by multiple research groups. So to avoid overloading the system and slowing down this demonstration, I'll be the one running the jobs during the session.

And like Max said, don't worry, we'll share the slides, the scripts, and the recording after the workshop. So you can review it and try everything on your own. With that, just a little bit about me.

I'm a research systems engineer here on the HPC team. I come from a diverse background across academia and the corporate world with experience in IT systems, IoT systems, and system administration. Early on, I was coding things like web dashboards using PHP.

But that all changed when my supervisor out of college encouraged me to try out Python. It sparked my curiosity. And since then, I've made the effort to use Python wherever I can.

It's such a versatile and beginner-friendly language that it quickly became my go-to, especially compared to heavier languages like C, Perl, or even PHP. So yeah, now in this HPC role, I get to keep expanding my Python knowledge, whether it's optimizing our systems, running reports, automating tasks, or even supporting researchers' workflows. I can't call myself a Python expert, but I found it to be one of the easiest and most powerful tools to work with.

And so my hope today is that you walk away from this learning something new about Python and how it fits into the world of HPC. With that, I want to share my first Python project that went into a live production environment. As I mentioned earlier, my supervisor out of college challenged me to learn Python.

And it turned out he had a big project waiting for me. At the time, I was working at a plant and crop research facility where researchers printed out growth chambers to perfectly simulate environmental conditions like temperature, humidity, and CO2. Oh, and light.

The problem was the controllers for these chambers were built in the 70s, so they were clunky and replacing them cost thousands of dollars. So that's where Python came in. I was tasked to develop a new in-house environmental control system.

It was challenging, but fun. The code grew to around thousands of lines of codes, and after a couple of iterations, we had a stable system running. And what you see here is a graphical interface built entirely in Python.

And behind the scenes, it managed the control logic, tracked all the sensor data in real time, and logged the entire historical records. The payoff was huge. We reduced the cost of a full controller replacement from thousands of dollars to under $300.

Plus, because this was so customizable, we were able to add even more control variables than the old system allowed. This project proved that Python isn't just for scripting or data science flexible and powerful enough to drive something as hands-on and critical as industrial control systems. And moving on to the HPC realm, this is how Python works in my everyday work.

I was approached by a research group. They needed an automated weekly report summarizing their resource usages, including CPU hours, GPU hours, and memory hours, and how many jobs were submitted. I used the Python script to automatically gather all this data from the start of the current month right up to the minute it runs.

So this, in turn, turned a manual task into a reliable, zero-effort automation that supports our researchers. So with that being said, what is Python? Python was created by a Dutch programmer Van Rossum in 1991, and he named it after his favorite British comedy, Monty Python. It's arguably one of the most, arguably the most popular programming languages in the world.

Some real-world examples, Python powers engines on e-commerce platforms like Amazon and Netflix. It's used in game development with frameworks like Ubisoft and EA. It's used in rendering, animation scripting, and automation in companies like Pixar and Blender.

And here, in the HPC world, it's used in research to streamline scientific workflows by leveraging CPU parallelization and GPU acceleration. So why use Python? Because Python's an object-oriented language, it handles much of the organization and complexity for you. So this efficiency allows you to write logic in significantly fewer lines of code than traditional languages, making your scripts less code versus the traditional languages.

Python gives you access to thousands of pre-built libraries, which means you save immense development time by using professional ready-made code instead of building everything from scratch. Python is famous for its clean and readable syntax, which makes it incredibly approachable for beginners and allows you to quickly start writing functional code. Python has become the go-to language for AI and machine learning, meaning its popularity will only continue to grow as the field evolves.

Lastly, because Python is so popular and is open source, it benefits from a massive and active community that continuously supports and improves the language. This strong community is exactly why there's access to a vast ecosystem of pre-built libraries. So with that in mind, I wanted to highlight some of the popular pre-built libraries commonly used on our HPC system.

We have NumPy for numerical computing. We have TensorFlow for deep learning, Pandas for data analysis, MPI for Pi for parallel computing, and lastly, PyTorch for deep learning as well. These Python libraries are all designed to leverage the power of both CPU and GPUs for enhanced performance.

This dual-use capability is what allows them to excel in large-scale simulations, data processing, and speed, and we'll dive into the reasons why in the next slide. So yeah, how does Python handle this? By default, your Python scripts runs on a CPU, which is great for sequential tasks like running your code one line at a time, but for massive, repetitive calculations in AI and machine learning, we need parallel processing, and that's where GPUs are the clear winner. Libraries like NumPy and PyTorch we just saw are designed to automatically leverage GPU acceleration, allowing them to perform massive tasks on our HPC system far faster than CPUs ever could.

So expanding on that, why GPU? The answer is massive parallelism. Think of a CPU as a few genius scientists working sequentially. They can do anything, but they work one task at a time.

A GPU, on the other hand, is thousands of efficient workers doing thousands of simple math operations simultaneously. So this design, combined with its high memory bandwidth, makes the GPU ideal for the massive floating point math, which is at the core of AI and scientific computing. So for these high parallel tasks, you often see performance gains, like Max mentioned, 10 to 100-fold times faster than running on CPU alone.

So again, this is why researchers are increasingly turning to GPUs, not just for the raw speed, but for the efficiency needed to handle today's huge data-intensive workloads. Just to show how much it has changed on our HPC system, about a decade ago when AI was starting out to gather steam, GPUs made up only 9% of resources on our Terramoto cluster. And then on to our next oldest cluster, Ginsberg, that footprint grew slightly, jumping to about 13% on our system's total capacity.

And that brings us to today, our newest cluster, Insomnia. The GPU has exploded to approximately 34%. This massive leap from 19 to 13, and now over one-third of the entire cluster, shows exactly what we've been talking about.

This massive leap to over one-third of the cluster directly reflects the rise of AI and machine learning, proving GPUs are now critical. And this growing adoption highlights the rising importance of learning Python to harness all of that GPU power. So yeah, let's zoom in into the Insomnia cluster to see the hardware a little bit.

We have a variety of cards, including both Intel and NVIDIA GPUs. Our NVIDIA ranges from the older but still powerful A6000s to all the way up to the newest H100s. Which are the state of the art for AI.

Looking at this too, it's a good moment to talk about GPU RAM versus regular RAM. Researchers use GPU RAM because it's specifically designed for high-speed parallel processing. Think of it as a separate super-fast memory being dedicated only to the GPU, allowing the GPU to process huge data sets needed for things like AI and machine learning, much faster than your traditional motherboards RAM.

So you can see the different RAM specs on our NVIDIA GPUs. So now that we talked about Python and its ability to leverage the computational power, of course you might be wondering how do you access those GPUs. So next I'm going to talk about navigating through our HPC cluster.

We offer two main ways to connect. First is through the terminal, which provides the direct classic command line access, which I'll be demonstrating throughout this workshop here on the right side of my screen. The second way, the second option that we offer is open on demand, which gives you a super-friendly web-based option for accessing the same power.

So let me talk about open on demand briefly. Open on demand is a web-based interface ideal for those who are new to Linux. It provides a user-friendly graphical desktop environment.

And it lets you perform all HPC tasks, including submitting jobs without needing to touch the terminal. Again, I just wanted to briefly touch this. If you want a deeper dive into this, you can, again, after this workshop, you can reach out to us and we can give you a deeper overview of what that is.

And here's kind of the web interface looks like. This interactive desktop icon, you can click on this, and it'll launch a fully interactive desktop environment, letting you manage and submit your HPC jobs without, again, touching the command line. So for those who are at the Linux and Bash workshops, this should look familiar.

This is how we access the Insomnia cluster. You'll use the SSH command and your uni at insomnia.rcs.columbia.edu. Once you press enter, it'll ask for a dual prompt. And then once you follow that, it'll ask you to put in your uni password.

So let me demonstrate that here. Let's see, my uni at insomnia.columbia.rcs.columbia.edu. I already had it set up so that it takes my credentials automatically, but now I'm logged into the Insomnia cluster. This is where I'm going to be working for the rest of the workshop here.

So now that I'm logged in, I wanted to cover some of the basics of the free tier account on our cluster. Everyone on this call should have access to the free tier account. But if you find yourself, you don't, you can email us after this workshop.

There are some important limitations to keep in mind for the free tier. It's a CPU only tier. Jobs run exclusively only on a standard node, meaning there is no default GPU access.

You're limited to one compute node currently, but we will be expanding that to two nodes soon. There's a wall time limit of 12 hours. And for obvious reasons, there's no root access for security and system stability.

However, to access GPUs, you can submit jobs to the short partition. This does allow you a GPU jobs, but again, you have the same 12 hour limit. Now that I've covered some basics of the free tier account, I wanted to move on to another key part of working on the HPC cluster, which are modules.

Modules are a key part of our HPC environment because they let you easily load and manage different software package and versions without affecting the system or other users. These modules are built and maintained by HPC admins so that researchers can quickly access the tools they need. So why use them? It makes it simple to load specific softwares, handle complex dependencies, and keep the environment consistent for everyone.

So you, yourself, you don't have to install any software. If there's something that you need that isn't currently available, you can always request that we build that module. How does it work? You simply will type module load the software name to load the module, module unload the software name to remove it.

This way, you're only loading what you need when you need it. I'll go over more of that in this next slide here. So some more basic module commands you use on the HPC cluster.

We'll start with module avail. So this command lists all the software modules available on the Insomnia cluster. You can think of it like browsing a menu of everything that's already installed and ready for you to use.

Next, I'm going to load the Anaconda module since I'll be using that throughout this workshop. And to check if my module was loaded, I'll just simply type module list. And as you can see, it's loaded Anaconda successfully.

If I wanted to stop using Anaconda, I can simply type module unload, and then Anaconda. But for this workshop, I'm going to keep it loaded. I'll clear my screen here.

And then finally, if you are every unsure of what the module command does or what option it provides, you can just type module help, and it'll list all the different options for module. So now that I covered some of the basics of access, the free tier, and modules, let's shift and talk about using GPUs on the HPC system. To take advantage of this, there are a few components that work together, and I'll walk through that stack in the next few slides.

So there are five key components needed to successfully harness GPUs on our system. The first is Anaconda, which I've loaded the module already. You can think of Anaconda as a one-stop shop for scientific computing.

It's an open source platform that comes preloaded with almost everything you need. It includes both Python language and its most powerful tool, Conda, along with dozens of essential libraries, like I mentioned before, NumPy, Pandas, and PyTorch, which we'll be using. Instead of fighting with manual installations and confusing package dependencies, Anaconda makes it super easy to manage all of those tools in one place.

Next is Conda. So inside Anaconda, its most powerful tool, Conda, this is a game changer because it acts as an environment manager, letting you create isolated separate project spaces. This means you can install specific packages for one project without ever creating version conflicts or dependency issues for another.

So it guarantees reproducibility. I'll go for an example in the next slide here. Scenarios.

So I have two scenarios that may help you understand Conda environments a little more. So in this first scenario here, this visual has person one and person two running a script called Sol Run, as you see here. Their computer environments are completely different.

You see there's some different R and Python versions, and then you're missing some libraries in person two. And because person two's computer environment, it doesn't satisfy the Sol Run script, you run into errors. But because both people built and running their Conda environments, it basically has everything needed to run the Sol Run package.

So even though the computer environments are different, because their Conda environments are the same, and it contains all the correct R and Python versions and the libraries, it is able to run Sol Run. So Conda guarantees reproducibility across machines. The second scenario, you can think about working say that you're working with two different departments.

One needs Python 3.8 and TensorFlow. The other needs Python 3.11 and PyTorch. With Conda, you can create two separated isolated environments, meaning those workflows will never conflict.

This keeps your projects clean, organized, and running independently. So how do you set up a Conda environment? So I'll be showing that here. So let's go ahead and set it up.

Before we create it, I wanted to check where Python is running currently. So if I do a which Python, you can see it's running from the module in a Conda that I loaded. Next, before we create the Conda environment, we want to set this environment variable, Conda packages to my home directory.

This just makes sure that Conda installs its packages into my home directory. So let me do that here. Next, we can create the Conda environment.

And then you hit enter, but I already went ahead and created my Conda environment. So what would happen, it'll just spin up this Conda environment with the Python 3.9. And then after it's done, it'll say you can now activate your Conda environment. And you can simply type Conda activate my Conda env.

And then you'll notice that your terminal prompt has changed a little bit. And it shows the environment name in parentheses here. So it means you're working inside that environment.

So let's run which Python command again. Now you can see that it's working within that Conda environment. And now let's take a look at my Bash RC file.

You should see a block of text that starts and ends with Conda initialized. This confirms that Conda has been properly initialized in your home directory. Both of these steps, checking which Python and looking at the Bash RC file, these are really helpful tools to keep in mind when you're troubleshooting your Conda setup.

And if you ever reach out to us, these are things that we look into to resolve any Conda environment issues. So now that my Conda environment is set up, you can start installing any Python packages using pip. So for example, I know I'll be needing PyTorch and TorchVision for this workshop.

I can simply type oops. I can type pip install torch and TorchVision. But again, I went ahead and installed it.

But for example, if I run it now, it should say it's already satisfied. But again, this approach gives you the flexibility to customize your environment with exactly the tools your project requires without affecting other users or the system environment. So here you can see that the requirements are already satisfied because I already installed PyTorch in my Conda environment.

Let me clear my screen here. The third component is Slurm. Slurm is the job scheduler that runs the entire cluster.

You can think of it as the traffic controller for all our resources. It decides how every CPU, GPU and node is allocated based on what users are requesting. I want to quickly show you some of the basic commands you'll use every day understanding what your job is doing.

The first one is SQ. So this command is your basic status check. It shows the list of all jobs waiting or running on a cluster.

The key thing to watch for is this state column here. It's either going to be a PD or R. PD meaning your job is pending or waiting for resources. R meaning your job is running currently.

The next command is SQ. Show partition. In this example here, I'm going to use the free partition.

This shows you details about a partition like assigned nodes and time limits. So max time is 12 hours and the assigned node is minus 22. And then look at the details about a node.

I'm going to run as control show node. Gives you detailed information about the node. I want to specifically highlight this state right here.

There's three different states that will typically be shown here. Right now, it's in a mixed state, meaning this node is partially being used. The other two states are idle, meaning that it's totally free and not being used.

The other one is allocated, which means it's fully in use and cannot take any more jobs. The next command is SQ. So this is your personal job history.

It lets you review all of your past jobs and quickly see if they finished successfully or canceled. So you can see some jobs. The next command is s account umi uni. Let's see 5292.

So this is your personal job history. It lets you review all of your past jobs and quickly see if they finished successfully or canceled. So you can see some failed and completed.

This is just again just to see how your submitted jobs did. And lastly, as simple dash o percent in percent g percent t. This will show all the GPU nodes in the current state. This is useful to see what's available.

The no means that it's not a GPU node. So you can see there's a 6000 l 40s. And there's h 100 should be here to right here.

So you can see the different states it's at mixed draining, meaning it's finishing up jobs. I'm assuming there's going to be some work done to these nodes here. And then next, I want to talk about interactive jobs.

interactive jobs gives you a live hands on access to a compute node. Almost like you're directly logged into the machine. This is especially useful for testing, debugging or running quick tasks.

So in this workshop, here, I'm going to use the s Alec command. It records requests the resources you need. And once they're available, it opens a shell on the compute node.

So let me clear my screen here. So here's kind of a breakdown of a simple s Alec command for a free user. The dash p is the partition.

And for free users will they'll use the free partition dash a references the account. So the account for free users will be free. dash dash mem is how much memory you're requesting.

And in this case is 10 gigabytes of memory, dash dash and task equals one. This asks for one process. dash dash time to colon 000.

This sets a maximum runtime of two hours. So say you wanted to request a GPU node, the syntax will change a little bit to request for a GPU node. So change one second.

So can you make your font slightly larger for the terminal and for the slides? Perfect. That's great. The slides are fine.

You can continue the slides. Okay. Thank you.

Sorry about that. So to request a GPU and with the s Alec command, there are a couple syntax changes and additions to be made. So the partition will be changing from free to short.

And then we'll be adding this flag dash dash grass equal GPU one requesting one GPU. So let me let me do that here. So a key tip for this, the s Alec command, be clear and specific with your resource request.

This helps the scheduler allocate your jobs correctly. And it's especially important for GPU and payload scripts we'll run today. So in my case, I know with my examples, I'm going to run with the RCS partition and the RCS account, dash dash mem equals 10 gigabytes.

And then for my processes, actually, I'm going to call for 32 processes and the time I can keep it at two hours for now, which is gives plenty of time. So you can see the scheduler slurm granted me this job ID and then this INS node here. So with this job running interactive job running, I can type sq dash dash me, it'll show that this job, this interactive session is running.

So you see the status R, and it gives you a job ID. So taking it a little further, I can take this job ID and do as control show job here. So within this job description, you can see things like the submit time, start time in time, it knows that I requested for two hours, so it automatically sets an end time, the node that I landed on here, and then the tasks, the processes I requested for, which was 32, the memory I requested for, which was 10 gigs.

And then lastly, like the working directory I'm working out of, which is my home directory. All this information is helpful when you're tracking, you know, your jobs or troubleshooting, if anything doesn't go as expected. So moving on.

So before we dive into our CPU job, I wanted to just do a quick recap of what I just talked about. Some of the resource rules, like partitions and all times. First, partitions, the free partition is assigned to free tier users.

It's only, it doesn't have GPU access by default, but if you use the short partition, it allows you to gain GPU nodes. Again, the salloc command allows you to launch an interactive job, and you can use the short partition to gain access to a GPU. Both these partitions have a wall time of 12 hours, so make sure your job can finish within that window.

And finally, I just wanted to talk about this exclusive flag. Whenever you use this flag, no other jobs will run at the same time on the node that you landed on. So this is useful when you want to see consistent performance or debugging and want to avoid interference from other users.

Again, this is a quick overview of SLURM. This will be talked in more detail in the next HPC workshop. So the next component, again, is the GPU resource, as you saw.

I called for a GPU resource, and I landed on INS082. To confirm that I actually landed on a GPU, I can use the NVIDIA SMI command. Oh, maybe this is not the right node to work on.

I knew this node was, okay, this is a better node. So if I run an NVIDIA SMI here, oh, I exactly know why. Because I didn't type, I didn't call for a GPU node.

So I forgot to do equals GPU. Definitely did not call for a GPU node. So now if I do an NVIDIA SMI here.

Okay. So this is a tool provided by NVIDIA giving you realtime monitoring of GPU usage. Super helpful for debugging and tracking performance on the GPU.

So it tells you the model here. I landed on 86,000 node, or GPU. And it gives you some of its information, the GPU util, some of the processes here.

Currently, since I'm not running anything, there's nothing, no information to be given right now. But once I run Python processes, you can see the different uses here. So before assuming your job is using a GPU, you know, like earlier, it said I, you know, came in and not found out because I was never on a GPU.

So it's always good to run NVIDIA SMI to make sure you're actually on a GPU. So now we're ready to put it all together using Python as the key language to harness and control that raw GPU power on this HPC system. Let me clear my screen here.

Your first Python program. So I know we might have a real mix of experience in the room today from beginners to more advanced programmers. I'm going to start off with some basic Python program just to highlight core concepts and ensure we're all on the same page.

From there, we're going to build up the complexity into more advanced scripts to show how Python is used on the HPC world, specifically how it interfaces with GPUs and CPUs. If you attended the batch scripting last week, there's some some of the same principles will show up in the next slides. I'm going to touch on the functions briefly.

But if you want a deeper dive into those, feel free to revisit those batch presentations. So yeah, let's jump in. So first, actually, I'm going to jump into where I kept all my scripts.

First is a simple hello.py program. It contains one line, print hello world. And the added shebang at the top here.

Actually, let me tap the hello.py. The added shebang at the top here. This just tells the shell to use Python when executing the file. The print function.

This is one of Python's simplest and most commonly used tools. It's easy to display an output and see what your code is doing. We'll use print often in this session.

It's great for showing the outputs, debugging, testing values, understanding your program's flow. So as you're starting out working with Python, print is an essential tool for figuring things out. So let me quickly run this.

So when I run this script, you can just see the output. It just outputs hello, world. Very simple.

Next, variables and data types in Python. Variables are like labeled containers that hold different kinds of data. And data types describe what kind of data is inside.

Like numbers, text, or true or false values. This is all key to writing useful programs. So let's break it down deeper into data types.

We have strings which are text. So anything inside like hello, world is a string. Integers are whole numbers without decimals.

This is typically used for counting or doing some basic math. Floats are numbers with decimals. These are used when you need to do more precise calculations.

Booleans are either true or false. They're super helpful when your program needs to check for something check for something or make decisions. Next, moving on.

So let's make our code a little more interactive by introducing the input function. The input function is how we get information from a user. When Python reaches this line, it pauses and waits for the user to type something like their name in our example.

Then we use the print function to here we're combining hello and then whatever the user types printing out a personal greeting message. This is a simple interaction and is really powerful. So let me run that for you here.

So I'm going to type my name and then it prints out that custom message with the input that I gave it. Now that we know how it takes input and print and output, let's talk about how we make decisions in our code. In real life we make decisions all the time.

If it's raining, I'll bring an umbrella. Otherwise, I won't. Python works in the same way using conditional statements if, else.

These let your program check something if something is true and run different code depending on the result. To do this, Python uses comparison operators like equal to, not equal to, greater than and less than to evaluate those conditions. So in this example, we're building from the last one with some basic decision making.

First, again, we use the input function to ask for the user's name. Then we check the input using an if statement. If the name equals max, we print a custom message for max.

If it's Sam, we use elif and print a custom message for Sam. For any other name, we use else to print a default message. So you might be wondering what's the difference between a single equal to sign and double equal to sign.

The single is an assignment operator, so it stores a value, the name you entered, into a variable, which is name. The double equal to is the equality operator. It compares two values to see if they're equal.

So again, if name, whatever the user inputs, if that name is equal to max, it'll print out this custom message for max. So let's test out this script here. So I'm going to type max, and it should print out that custom message for max.

I'm going to run the code again, type Sam. It'll type out the custom message for Sam. I'll run it once more.

I'll run Bob. It'll print out a generic message for Bob. So now let's talk about loops.

Loops lets your program repeat actions without having to write the same line of code over and over. This is useful when working with large datasets, running repeated calculations, or automating tasks. There are two types of loops.

One is for loops, which is great for when you know how many times something needs to happen. For example, looping through a list of names or repeating a block of code 10 times. While loops, on the other hand, it's when you don't know ahead of time how many times something should be run.

They keep going as long as the condition stays true. For example, if you ask, while the user hasn't entered the correct password, you keep asking for its password. Using loops helps us write cleaner and more efficient codes.

So let's jump into some examples. So here we're combining what we've learned so far with input conditions and now loops. In this for loop example, we wrap our code in a for loop with range three.

This means that we repeat this block of code three times. And then, again, just like before, we check if the name is Sam or Max, and it prints a custom message. If it's somebody else, we use a generic message here.

So let's test this for loop out. So I'm going to type Max. I should type the generic message for Max, Sam, Bob, and then it finishes it because it ran three times.

So as you can see, I don't have to run the code multiple times to see the different messages like I did earlier. I ran the hello conditions Python code three times to see the different conditions or if the conditions were met. In this case, I told my program to loop it through three times to see the three different conditions, which, again, makes it easier to test and see results using for loops if you know how many times you need to loop something.

On the other hand, we have while loops. So in this line, while true, it creates an infinite loop, meaning the program will keep running until we tell it to stop. So it'll run all these until we tell it to stop.

So, again, just like before, we check if it's Max or Sam. It'll print a custom message for them. Else, it'll print a generic message.

But now there's a new condition. Name equals stop, print goodbye, and break the loop. So let's test that out.

So, again, like Max, Sam, it prints those custom messages. If I use Bob again, it'll do a generic. I can do more names.

I can keep going and going until I type stop. So when I type stop, it'll say goodbye and close out this script. This just demonstrates how while loops lets us repeat actions until this condition is met, typing stop, giving the program more flexibility.

So for and while loops in HVC world, it's incredibly useful for automation and resource management. For loops are great, again, when you know exactly how many tasks or jobs you want to run. For example, if you're running a series of simulations with different parameters, a for loop lets you submit each one automatically.

No need to write out every command manually. While loops give you more flexibility, they're perfect when you need to wait for something to happen, like if a job has finished or if a resource such as a GPU node becomes available. The loop can keep checking until the condition is met and then take the next step.

So, again, loops are essential tool in your HVC toolkit. Now that I've covered some of the Python, let's take it a step further. I want to show how we can scale up these scripts to harness the power of the HVC system.

This is what makes it possible to run huge simulations, process big data and manage thousands of jobs efficiently. I'm going to start with the concept message passing interface or MPI. MPI is a standard tool in HVC that lets us run Python programs across multiple processors to speed up things through parallel processing.

MPI uses a process based model where each process runs independently and communicates by sending messages. This makes it ideal for breaking up large tasks. It's highly scalable from a laptop to many supercomputers and with libraries like MPI for Pi, Python users can easily access this power without switching to another language.

MPI for Pi, again, it brings MPI to Python. It's a wrapper around the standard MPI library so you can write parallel code in Python instead of using C. It works really well with NumPy, letting you efficiently share large data arrays between processes. Perfect for simulations, data analysis and scientific computing.

Best of all, MPI for Pi has a clean Pytonic interface, so knowing Python, it's much easier to get started with parallel computing. Before diving into the example, I want to talk about NumPy. Again, it's the foundation for numerical computing in Python.

It provides fast and easy to use tools for working with large arrays. It includes many built in math functions for things like stats, linear algebra and more. NumPy is also core of many other popular libraries like SciPy, Pandas and Matplotlib.

And HPC NumPy pairs well, again, with MPI for Pi, making it easy to share large data sets across processes. Perfect for scaling up scientific computations. So, let's take a look at the MPI example.

So, um, this script, this script is a basic MPI hello world example using MPI for Pi. It runs in parallel across multiple processes and each process prints its own message along with its process ID called rank. You'll run the command, the prereqs are to install MPI for Pi in the conda environment, which, again, I already have.

Actually, I need to activate my conda environment. Which I already had done that. And this command here, MPI exec is the command you'll use to run the MPI tool.

So, MPI exec dash in, dash in refers to how many processes you want to run. So, in this example, I'm going to run four processes. And then Python and the Python script.

Just wanted to explain this script here. Hello, MPI. It's the same thing here.

So, we first import the MPI for Pi tool, the library. So, this tells us to use the MPI libraries. Here, size tells us how many total processes we're running.

Rank gives us each process gives each process a unique ID. Name then tells us which machine where it's running on. And then it prints out a message showing its rank, where it's running from.

And then in here, the F before the quotes, opening quotes in the print statement, it just tells Python to evaluate anything inside the curly brackets here and print them. So, again, when I run this command or yeah, when I run this command, the output I should see is four different lines of output. So, here, hello, MPI.

Oh, actually, MPI. So, you'll notice, again, so, this is parallelism in action. One thing I wanted to highlight here is you'll notice that the numbers here are not in sequential order.

Again, this is because the script is running in parallel. Otherwise, if it weren't running in parallel, you'll see it in sequential order. So, next, I wanted to dive into a more advanced example using MPI for Pi and NumPy to calculate an approximation of Pi.

This example uses numerical integration, a method that's simple to parallelize. We can show how to speed up computations by distributing work across multiple processes. This approach that I'll be talking about is often paired with Monte Carlo method, a statistical technique that uses random sampling to estimate values.

So, I'm going to be breaking it down into different sections here. So, first, imports and initialization, computation functions, main function, work distribution, parallel computation, and then combining the results, and then later we're going to speed it up with more processes. As you can see in this image, so, our example, we're approximating Pi by summing the area of these rectangles under the curve.

So, we'll be using this exact model to calculate Pi by having MPI process each rectangle simultaneously. So, again, this is a fantastic benchmark for understanding parallelism and how MPI accelerates tasks on our HPC system. So, let me first jump into the script here.

So, first, imports and initialization. We start the script by importing the necessary libraries. Again, here, MPI for Pi to give us access to MPI functions.

NumPy, which helps us with the numerical operations, and then time. We're going to use that to measure the accelerations later. This next few lines of code here initializes MPI and set up our communicator, which is the group of processes that are going to work and talk to each other.

The rank tells us the unique ID. Size tells us the total number of processes running the script. So, again, here.

Next, the computation function. So, this is the heart of the calculation, the calculate Pi part function. Mathematically, it's using the midpoint rule of numerical integration to calculate the area under the curve, which gives us Pi.

So, every parallel process is going to call for this function to compute the small chunks of the total sum it was assigned. So, that is this portion of the code. And then we're going to jump into the main program. Moving into the main program, we first define how accurate we want our pi approximation here by setting the numstep to 1 billion. Each of these steps represents one of those tiny rectangles under the curve.

The step variable then calculates the width of each rectangle here. And then finally, only in the main process rank zero, we start the timer. It's crucial that the rank zero does this so that we can accurately measure how long the entire parallel computing task takes from start to finish.

So next, under the divide work among processes section, this part of the code, we're truly harnessing parallelism. This section defines the work, the total work evenly among the available processes. So size was the total available processes.

And then each process uses its unique rank to figure out its specific start and end positions. In our case, the chunk size was 1 billion. And then this ensures two things.

First, that every process is working on a unique piece of the problem simultaneously. And second, that if the step don't divide perfectly, the process handles the remainder, guaranteeing all of the work gets done efficiently without overlap. Next, gathering the results.

So once the work is divided, each process calculates its own partial pi here. This is where the magic happens next. We use com reduce function to gather all of those thousands of partial results and sum them together in one final value.

But only on rank zero. On rank zero, when we finish the calculation by multiplying the total by the rectangle width, and we print out the final accurate value of pi and the total time it took. And then finally, the if name equal to main, this construct is essential in our Python, it just tells it tells Python only run this main function here, let me highlight it right here.

It just keeps this script, your code organized. And then I wanted to talk about speed up. So again, since each process handles different part of the work simultaneously, we expect computations to run faster.

In theory, running for parallel processes should give us up to four times speed up running it over running it sequentially. And then we'll see an output kind of like looking like this. So in this example, I'm going to jump between four compare between four processes versus 16 processes versus 32 processes.

So here when I do MPI, except dash n with four processes, Python, and I MPI for pi. So now it's going in again, like the image that I showed before, it's calculating all those rectangles. And it's going to display how long it took when I gave it four processes to calculate.

And in theory, the higher number of processes we give it, it should get faster. It takes about 20 seconds. Okay, oh, this one took 45 seconds.

So yeah, so the speed up of four processes, this is theoretical again, versus if it ran sequentially, so let me speed it up with 16 processes instead of 14, or four, we should see a dramatic difference. Okay, yeah. So you see, with 16 processes, it took 7.5 seconds to calculate pi.

And then let's move on to 32 processes. It should be somewhat faster than this time. But we'll see what happens.

So this time, it's a little greater, which, again, this can happen. Which brings me to my point, we might not see dramatic speed. And this simple example is because this example is it's so small.

But for larger research examples, like an AI or other simulations, which is a bigger data set, it'll clearly demonstrate how the more processes use the the shorter time it takes to to execute these calculations. If you rewatch the recording, you might not see the same results that I got here. You might land on a different node that might be sharing resources.

So again, don't, because you see different numbers, don't think that you're doing it wrong. It's because you might land on a different node than I did. Next, let's go into PyTorch.

PyTorch makes it easy to use GPUs with a few lines of code. It has built-in support for NVIDIA's CUDA, which handles memory and parallel tasks efficiently. At its core, PyTorch uses tensors, like multidimensional arrays, and you can run tensor operations directly on the GPU for much faster computations, speeding up training and computation significantly.

Best of all, switching between CPU and GPU is seamless with PyTorch. So yeah, it's a very powerful tool to use. Before jumping to the PyTorch example, I wanted to go over the dataset we'll be using.

It's from the Modify National Institute of Standards and Technology. I'll be calling it Minst. It's a classic dataset of 60,000 handwritten digits, 0 through 9, each 28 by 28 pixels.

It's small, clean, and easy to work with. Because of its simplicity, Minst is often used to test and compare models. It's kind of like the hello world of image classification.

So it's perfect for learning and experimenting with tools like PyTorch. So this is what the dataset looks like here. You see each image is 20 by 28, a photo of handwritten digits 0 through 9. Since each digit comes from many different people, you'll see lots of variations in handwriting.

The goal of the example is to train a model to recognize and classify each image into one of 10 categories, basically answering what number is this. So again, we'll use this dataset in our PyTorch example. So at a high level of the example that I'll be showing, this is what's happening behind the scenes.

We'll take the dataset here. Let's say it grabs the 2. It passes it through a neural network built with Python and PyTorch. The model learns and recognizes which number it is.

The example trains a simple classifier and runs it on both CPU and GPU. So you can see how PyTorch handles both. So at the end of the output of the script, it'll print how long each training took and highlighting the advantages of using a GPU.

So moving into our PyTorch example, again, like before, I'm going to go open up the PyTorch Python code here. So I broke it down into different sections. First, set up your tools.

So again, before we dive into the script, we import torch, torch.nn, torch.optim. These are a core of PyTorch. They let us build the model, train it, and optimize it. We also bring in time because we're running a race between CPU and GPU.

We import sys to make sure scripts exit neatly. If it errors. And then lastly, torch vision.

This is how we easily get and prepare the data set. The next section here, the interactive configurations, remember how we talked about making your Python code interactive? That's exactly what we're applying here. We use that input function command to ask for the batch size, which is the number of images we want to process at once.

Here, I'm suggesting to use 64, 512, or 2048. So this knob lets us test different workloads without having to stop or edit the script. So it makes this demo pretty quick and flexible.

In the next section here, this is the traffic cop of our script. The line torch.cuda is available right here. This is a decisive moment which PyTorch checks if a fast GPU is ready to run.

If then stores the result either CUDA or CPU in a single variable called device here. So either PyTorch is going to see if it's on a GPU. If it is a GPU, label it CUDA.

If it's not, then CPU. This is this single variable is our navigator. Everything we do from here on will be calling for this device.

Next, in the section two, the model definition. Now we're defining the brain itself, which is called simple CNN. And this isn't just a simple math problem.

So we use convolution layers here and we call it nn.conv2d. These are specialized layers designed to perform huge batches of identical calculations. This is the heavy lifting that lets us see the massive speed differences between GPU and CPU. So here is kind of the image of what convolution layering looks like in the background.

So these are the calculations we're running in these code. Next, data loading. So next, this section is about getting the fuel to the brain.

So we're loading the mints dataset, again, the 60,000 images of handwritten digits. The data loader is our factory floor. It takes those images and bundles them into packages, which we call batches.

Batches here. So this is the number we'll be inputting when running the script. So once the user inputted batch size is inputted, it determines the size of each package we delivered to the training loop.

So moving on to the training loop section here, or before moving to the initialization section. So here, we set up the cognitive system for our training. So this is important before we do our training.

We initialize three key components. We have the model, the brain structure itself, the loss function, which is the scoreboard. So pretty much what this is doing is, hey, it's pretty much saying, hey, you were 80% wrong on that prediction.

And then finally, the optimizer is the teacher. So it takes the error from this, the loss function, and adjusts it, the model, so it learns from its mistakes. And then moving on to the training section now.

So this is the heart of the code where timing, where the timing differences happen. So for every single batch of images, the first, most important line is the right here, the X2 device here. And this is the actual moment where we take images off the computer's main RAM and push them over to the GPU's dedicated high memory speed.

So once this lands onto the GPU high memory speed, it runs the forward pass here to predict. And then in the backward pass, it learns its mistake, which is incredibly fast. And then next, execution and timing in section six.

So here, this is the final comparison of the race. So we run the exact same training twice. For the GPU run, we use the command called synchronize function that is used actually up here to make sure Python waits for GPU completely finish first, and then it'll use the CPU.

And then we'll calculate both and to see the final speed up. So let's test this out. So I'm going to run my PyTorch script here.

So again, it's taking those data sets. Actually, it's going to ask me what my batch size is going to be here shortly. So I'm going to start off with 64 and we'll compare to the other batch size.

So again, what this is doing is I'm giving it the batch size of 64. So it's running batch jobs of 64 of identifying those 60,000 handwritten images. So as you can see, it first runs on a GPU, sees how the training took, it took 11 seconds, and it shows you the loss as well.

And now it's moving on to a CPU for comparison. So again, this training is how long it took to identify all those 60,000 handwritten data set. So again, as a result, when we use a batch size of 64, the total CPU time took 24 seconds, whereas on a GPU, it took 11, about 12 seconds.

So GPU was two times faster than CPU. So let's see what happens when we increase the batch size. In theory, the bigger the batch size, the more efficient the GPU should run, because that's what GPUs are good at doing.

So again, here, instead of doing batch sizes of 64, we're doing batch sizes of 2048 simultaneously. So here, as you can see, the GPU finished in nine seconds with a total loss of 2.3. And now moving to the CPU model to see, yeah, 18 seconds. So again, it still takes longer.

GPU was again, about two times faster than the CPU. But again, I wanted to highlight, you see, when we use batch size of 64, the total GPU time took about 11 seconds. And then when we increase the batch size, so again, running 2048 batches simultaneously, the GPU is actually more efficient at processing that.

So it went from 11, about 12 seconds to 9.6 seconds. I know we mentioned, you know, GPU, we should see 10 times speed in different work in certain workloads. So why don't we see that in our example here, there are kind of three main reasons why we don't see that significant speeds, the 10 times.

So, you know, in our example, it's mostly just two times faster. The main reason in our example is data transfer. So we're moving data between CPU and GPU, which is slow.

If we just focus it on GPU, you know, it should be significantly faster. The Python loop overhead, the outer Python loop is sequential. So it just goes line by line, and it adds small delays.

And then the third is saturation. Our small neural network is like a tricycle in a Ferrari engines. We need massive models to truly make the GPU sweat and hit its peak performance.

So that's the PyTorch example. So after seeing the code run, comparing the times, let's take a quick step back. Why did we even look at this? Why does CPU versus GPU difference matter for you as a Python developer? First, it's about speed and iteration.

As you start learning machine learning, you want to test ideas quickly. A GPU experience that might tie up your regular laptop for an hour and runs them in minutes. This means you spend less time waiting and more time actually experimenting.

Second is simplicity of the code. The most important lesson here is that PyTorch lets you build one model with just that one dot two parentheses device switching between, you know, a GPU versus a CPU. You can instantly run it on the fastest hardware available, again, on the GPU.

It shows you that writing modern Python code is powerful enough to be hardware agnostic. And third, this is the engine of modern AI. Every major system you hear about, whether it's DALI or even the AI running in self-driving cars, it's fundamentally running the same massive calculations on thousands of GPUs simultaneously.

So you now have the basic insights of how all that advanced tech actually works. And the best part of it is with our HPC system, you have access to this power. You can take this exact code, scale it up, scale up the batch size, and start running your own large scale experiments.

And to close this workshop out, this is how you deactivate your Conda environment. Again, this is the proper way to close out isolated environments we created. So in our case, my Conda environment, this ensures everything is clean for your next session.

So I'm going to Conda deactivate here. And now it goes to the default insomnia cluster base environment. The scripts I ran today will be located at this path here.

So make sure to use copy command rather than the move command. You will simply cd into your home directory and then use copy the path that I provided, the asterisk sign, which copies all the files in this directory. And the period means copy it at your current location.

You can use the LS command to confirm that they're copied over. So yeah, these are the expected scripts that were copied over that I went over today. If you run into any issues copying the scripts over, feel free to reach out to us.

So yeah, as you continue your Python journey, you're kind of this free resources that I recommend here. You can take a look at these books. Also, if you're looking specifically for ways to apply Python and research problems, I highly recommend looking through the software carpentry resources I have linked here.

So again, you'll have a copy of the slides and you can click on these links. And I think that's it. Sorry, I only spared three minutes.

But any questions? I know there was a lot of information to take into so apologies for that. No need to apologize. That was great.

Sam, do we have any outro slides? Oh, actually, yes. Sorry. So if there are no other questions next week, Al will be going through the intro to HPC.

Again, I went through some basic CERN commands. Al will be covering kind of more in depth of that and just our HPC system as a whole. I don't know if you want to add to that.

No, pretty much covers it all. We've been trying to run a layered approach to our HPC environment over the last few weeks. The first week was the operating system.

The last couple of weeks with Bash and Python was more showing you the scripting language which you use to interact with our clusters. The next week is going to be intro to HPC which we're going to bring together. We'll be touching on cutting edge technology such as parallel file system.

The schedule we're going to go deeper into. We're going to be launching jobs. Sam touched upon how you can interact with the scheduler, listing amount of jobs, nodes, allocations, and so on.

And also, we'll be touching on the network layer as well. So, for example, your average home internet speed is about 1 gig per second. Our network layer is pushing 100 or 200 gigs per second.

So, we are really using some cutting edge technology. So, for anyone who's here right now who's been with us for the last three or two free weeks, please attend. And again, it will help you understand when we send out notifications out to the mailing list.

For example, we're talking about infinite band layer and storage. You know exactly what we're doing with our network. Aside from that, Sam, great job.

I've never seen that one before. So, I'll be using that in my everyday workflow now. Thanks for that.

Do you want to have the last word, Sam? No, no. Thank you for attending. I wish we could go more in depth, but I know we're tight on time.

But yeah, appreciate it. All right. If you have any questions when running SAM scripts or access to the clusters or unable to get permission to create a folder, please contact us.

You have mine and Sam's information. You can email us directly or contact hpsupport at Columbia.edu and we'll get back to you. Aside from that, hopefully see you guys next week.

And again, Sam, thank you very much for a great presentation. Thank you. All right.

Take care, everyone.

Recorded November 6, 2025

In this workshop, we'll introduce Python, a versatile and easy-to-learn programming language widely used in many fields, including high-performance computing (HPC). We'll explore why Python is a popular choice due to its simplicity, readability, and extensive library support. In the context of HPC, Python is used for data analysis, automation, and scientific computing. We'll also cover some key Python modules used in HPC, such as NumPy, which help optimize performance and scale computations efficiently across clusters.