Classic: Intro to High Performance Computing 2025 | HPC Training #4

Transcript

OK, everyone, it is 2.05, so I am about to begin. And welcome to Research Computing Services Workshop Series Part 4, Introduction to High Performance Computing. My name is Al Tucker, and I am a system administrator here with the HPC group, and we are part of Research Computing Services.

And you may ask us questions or request cluster accounts by emailing us at hpc-support at columbia.edu. Also, let me apologize in advance. I feel a little bit of laryngitis coming on, so if my voice is a little bit too soft, I apologize. But I want to make it through the entire 90 minutes.

So topics that we will be going over in this session are reviewing access to the cluster, what is an HPC, HPC use cases, overview of the Columbia clusters, job types, including simple batch jobs, their output, and interactive jobs, what is a GPU, and then a few miscellaneous topics at the end. The system we'll use for this class is insomnia.rcs.columbia.edu, and your user ID and password is your uni ID and password. So we're going to take a moment to go over access.

If everyone didn't complete this before class, please send a private message in the chat, and any of our assisting instructors can help you. Also, if you have any questions during the class, please send messages in the chat, and an instructor will assist you there, so that way it won't slow down the pace of the class for everyone else. We're going to go through a little overview, and this is mainly for the benefit of those who will be watching the slides alone afterward as this session is recorded.

So quickly, Windows users will be using PowerShell. All current versions of Windows should have it installed. If you do not have it installed, you can download it here.

Within PowerShell on Windows, you'll be using SSH. Again, current versions of Windows should have it installed. If you do not have it installed, it is something you can download at this URL here.

macOS has a Unix command line built in. It is similar to Linux, and you can access it through the terminal application, which is located in the Applications folder, Utilities folder, and the Terminal program. Simply double-click on it.

Either way, whether you are accessing this through Mac or Windows, you will SSH to your uni at insomnia.rcs.columbia.edu. So if you don't have access by now, also don't worry. The next few slides will be lecture only, so there's plenty of time for you to receive help and not fall behind before we reach the hands-on exercises. Now, on to the meat.

What is high-performance computing? A high-performance computing cluster, sometimes called a supercomputer, even though it is not one giant computer, is a grouping of many powerful compute servers working together in tandem. Each compute server is called a node. They are commonly reached through one or more front-end login nodes, which control access.

Work is not done on the login nodes. The nodes in each cluster work in parallel with each other, boosting processing speed to deliver high-performance computing. Whoops, sorry there.

All right? So on an HPC, parallelism is paramount. Again, with a high-performance computing system, a number of powerful computers work together in parallel to solve large, complex problems much faster than any single machine. Another term you may hear, high-throughput computing, is similar, but focuses on repeatedly doing large numbers of smaller independent tasks.

The focus is in solving a large volume of simpler things faster. Now, this isn't necessarily an either-or situation. A computing cluster can do both kinds of tasks, and Columbia's HPC system can do both.

It all depends on what type of problem you choose to solve. So why use an HPC? Well, one reason is to save time and speed up research. An example, a statistics student wants to cross-validate their model.

This involves running the model 1,000 times, but each run takes an hour. Doing this on their laptop could take over a month. Analyzing huge amounts of data is another HPC use.

For example, a genomics researcher who's been using small datasets of sequence data, but soon will be receiving a new type of sequencing data that is 100 times as large. It's already challenging to open the datasets on their computer. Analyzing these larger datasets will probably crash it.

So, more HPC use cases. A variety of industries use HPCs for a variety of purposes. Research labs, for one.

HPC is used to help scientists find sources of renewable energy, understand the evolution of our universe, predict and trap storms, and create new materials. Artificial intelligence and machine learning. HPC is used to train and run AI models, teach self-driving vehicles, and improve cancer screening techniques.

More use cases. Financial services use HPCs to track real-time stock trends and automate trading. Media and entertainment.

HPC is used to deliver mind-blowing special effects and stream live events around the world. Stimulations. HPCs are used in designing new products every day, as well as in developing treatments for diseases like diabetes and cancer, and enabling faster, more accurate patient diagnoses.

Research on the HPC has been published in many, many hundreds of peer-reviewed publications. Top-tier science research journals, such as Science News, Nature Geoscience, Science Advances, and Quaternary Science Reviews. So now we're going to give you an overview of what a cluster looks like.

This is simplified as it's focusing only on the main components that a user will see. There can be some other components that will only be seen by admins, but they're not relevant right here. So, typically, an HPC cluster will consist of a head node, or login nodes, and these control user access.

This is the public-facing front end that users will log into. Also, the job scheduler, such as Slurm, which will regulate which jobs go first and next, can also live here. Behind the head node are one or more execute nodes.

These are where the work is actually done. When you send a job to the cluster, this is actually where it's processed. Behind that are storage nodes, and how it works is this.

Instead of individual hard disks on these nodes hosting your home directory or scratch space for your lab, a storage node can be one or more highly redundant computers with large arrays of hard disks inside of them. So all of your data is stored on the storage node, and it's shared out over the network to all of these other nodes. This is the magic behind when you log into the head node, you'll have the same home directory here that you have here, that you have here, that you have here, and on every node, instead of having separate individual home directories on every single computer, the way you would if you were in a lab and you went from one computer to another.

The high-speed network makes it all possible, connecting all of these together. So the current HPC footprint consists of three clusters. Insomnia is our current cluster, so named because it's the cluster that never sleeps.

Insomnia is internal. We will never be retiring Insomnia. Instead, we will simply, as old nodes die, be buying new nodes, and as new customers come on, we will be adding more and more nodes to Insomnia.

Currently, Insomnia has 96 nodes, which has several thousand cores spread across them, with 41 standard nodes, 27 high-memory nodes, and an array of nodes with GPU processors of varying vintages. Ginsburg is our next oldest cluster, and Ginsburg is part of our old model where a cluster would have a five-year lifespan and then be retired to make a new one. Ginsburg is in the middle of that lifespan, so no new nodes are being added to Ginsburg.

Users who have nodes on Ginsburg can add researchers into their partitions so they can use the cluster, but anyone buying a new node today would be put on Insomnia. Ginsburg has 286 nodes, again, with many thousands of cores spread across all of these nodes, 191 standard nodes, 56 high-memory nodes, and its own array of GPU-enabled nodes. We also have Terramoto, which is a legacy cluster.

It is older than Ginsburg, and it is officially beyond its support span, but some users who had nodes on Terramoto wanted to keep them running, and we have accommodated them. However, Terramoto is, like Ginsburg, not being added to, and right now, if a node on Terramoto died, it would simply be put out of service and not fixed. So across all three clusters, we have a large number of GPUs.

We have 120 NVIDIA A6000s, six L40s, 14 L40s GPUs, six H100 GPUs, each having their own in-built memory. GPUs, that's something to note. A GPU will have its own memory separate from the system memory on the motherboard.

We'll get more into that later. Anyway, this is the array of GPUs that are enabled across all of our clusters. More nodes are added to Insomnia as it goes on.

Standard nodes on Insomnia have 512 gig of memory and 160 CPUs, which is broken down to two CPU sockets per node, each one having 40 cores and hyper-threading enabled, which doubles that. Insomnia also uses a GPFS file system. This is a parallel file system.

So now, what is that? A parallel file system is a type of file system that's designed to allow multiple processes or nodes to read and write data simultaneously. So some of the benefits of this is high input-output throughput because multiple processes can read and write files simultaneously. It scales very well.

As you add more compute and more storage to this, you get much more storage and even better performance. A parallel file system is also optimized for HPC and artificial intelligence workloads like simulations, large models, and big data analytics. It works well with job schedulers and MPI-based applications.

Also, a parallel file system is very fault-tolerant and redundant. So for example, in your computer, if your hard disk dies, your computer is out of service, but on a parallel file system linked to an array of disks, if any single hard disk dies, the system goes on without problem until the disk can be replaced. So what is the difference between a parallel file system and an NFS mount? One moment while I take a drink of water.

So some of you may already know NFS. NFS is a network protocol for sharing the hard disks of independent computers. So that way, each computer can see files on the hard disk of one of the others.

Typically, NFS is something you'd see in a small lab setup or a home computing setup. The takeaway, the big takeaway from this is if the network goes down in an NFS situation, each computer typically can still function, but it's only going to see the data on its own internal hard disk. GPFS, on the other hand, is more expansive.

With GPFS, all the hard disks typically live on one storage node, as we saw in the diagram before, and GPFS handles both the sharing of the data and the arrangement of the data on the hard disks, all to optimize the data so that all computers can read and write data simultaneously and quickly. The big takeaway from this is that GPFS without a network cannot exist. This slide also details some of the differences between GPFS, or rather any parallel file system like GPFS and NFS.

Basically, GPFS or other parallel file systems like Lustre, which is used on Ginsburg, and some others out in the world that we don't use, they're basically much more fault tolerant. They stripe data in different ways across different disks, and they're generally designed for high-performance situations, while NFS is more of a small lab situation, a small lab solution. Software on the HPC.

These are some of the available programs we have on the HPC. Labs that own nodes on the HPC can request installation of other programs that they may need. Free users on the HPC cannot request new programs be installed.

You need to work with the things that are on there already. However, we have a wide array of things installed, such as Python, Anaconda, which is a package management system for Python, CellRanger from Tenant Genomics, Java, MATLAB from the MathWorks company, and an array of other programs are installed on the HPC. You'll see a little bit more of them in the slides coming along.

And also, we're going to talk about how we handle those programs. Programs on the HPC are mostly handled as environment modules. Modules are programs that you can load if and when you need them.

Each module corresponds to a particular software tool or library. And by using modules, we ensure that each application runs in a consistent environment that doesn't conflict with other programs on the cluster, since all of them do not exist together. They only exist for each user, depending on if a user loads them.

So now starting with the next slide, we'll be demonstrating some things that you can follow along with in your own terminal. So to view available modules on the HPC, you use module avail or ML avail. I'm going to show on my terminal.

I'm already logged into Insomnia. If you use ML avail, here you can see the vast number of programs we have installed on Insomnia. If you wanted to load one, for example, Anaconda, you would load it like this.

And I will demonstrate. Now notice that Anaconda only has one version. So by just typing ML Anaconda, I automatically got this version.

If a program has more than one version, say CellRanger, if I typed ML CellRanger, what I'm going to get is the latest version. And you can see that with module list ML. These are the modules I have loaded.

If I wanted to use an older version of CellRanger, then I would have to type out the entire name in order to get the older version. So that's the behavior. If you just use the name itself, you'll get the latest version or only version, older ones type the entire name.

And again, ML lets you see what you have loaded. Running HPC jobs. So typically to run jobs on the HPC, you'd use a piece of software called a job scheduler.

There are several popular job schedulers out in the world, PBS, Moab, Condor, but at Columbia, what we use is something called Slurm. And Slurm was created by a company called SkedMD. Now we'll continue with the hands-on exercises and show you how to submit, run and work with HPC jobs.

So first we're going to copy the workshop files into your own directory. And I will do this on my screen as well. So you can watch me clear this.

PWD will show me the home directory that I have. And we're going to copy the workshop files into our home directory. Then we will go into our copy of the home directory and LS will show us the files that we have in it.

These are things we'll be working with throughout the class. So first we're going to work with a file called hello world. And this is a submission file or sbatch file.

An sbatch file is nothing more than a text file with directives that tells Slurm what to do. We're going to edit this example file using nano which is an easy text editor for any beginner to use. So this is what it looks like when we use nano.

And again, I will also do this on my screen. So this is a small typical batch file that you can submit to run a Slurm job. The different parts of any batch file are going to be your account, which for the users here all of you are in the free account on Insomnia.

Some of you are also part of labs. So if you are part of a lab that exists on the cluster and you know your account, you can feel free to put your lab account in here as well. But we're all in free.

Also in a batch file, you're going to give your job a name. And this reason is because you can run multiple jobs at once. So if you're running 10, 20, even a few hundred jobs at once, if you don't give your jobs names it's going to be very hard to know which one is which when you go back to look at them.

In a typical sbatch file, you'll do things like tell the cluster how many cores, how many processor cores on a node you want to use. Here we indicate that with the dash C1. Also, you'll tell the sbatch file to give you memory for each core.

So here we're asking for one gig of memory for every core that we asked for. So this is just one gig. We also give time to SLURM.

And this time is an estimate of how long we expect our job to run. Since this is simple, I'm only going to give in this sbatch file a time of one minute. And this is zero days, zero hours, one minute.

It's always good to give a time that your sbatch file is expected to run and you can start small and scale up because this way- Al, I think you're muted. I think you're muted here. Yes, I was.

Sorry, I don't know what happened there. I must've accidentally touched something. So what was the last thing I said that everyone heard? You were telling us about the time.

And you were saying like, you know, zero days, zero hours, and one minute. That's the last thing I recalled. Okay, wonderful, thank you.

We caught that quickly. So it's always good to submit a time with your batch file if you estimate too little time and your job needs more and will simply fail and you can just run it again. However, if you overestimate and give too much time, what can happen is your job might wait longer to be scheduled unnecessarily longer.

Say this is a one minute job. If I gave a time of say an hour, Slurm would be looking for free space in the schedule where an hour long job could fit. And it might wait much longer than it needs to when I only need a minute.

So it's always good to start small and then amp up as you create your own batch files. So these are directives to the scheduler telling it what to do. And in this file, the actual things we're doing is going to sleep for 45 seconds and then echo hello world.

Another thing to note is that in the Sbatch file, the hashtags are not comments. These precede Sbatch to tell Slurm that these are directives. If you wanted to make a comment, you use two hashtags in a row.

This line is a comment. These are not. So now that we're done with that, we can exit out of nano and move on.

So we're going to submit this to the job scheduler. So do Sbatch hello.sh. This will send it off to Slurm to be processed. So we have a job ID.

That's what always happens when you send a job off. Your job number, of course, is going to be different from mine. Now to view information about this job, we can use the sq-job command.

Oops, sq-eue-job and the job ID. And this shows us that this job right here, it's pending. And this is pending because free users have lower priority on the cluster.

So other users who own nodes, their jobs will run first before ours. We can tell it's pending. We can see it's under my account.

This is the job name. Again, that's useful because if we had done 10 or 20 or so jobs, we could tell them apart by name. Numbers would also differ on them, but human beings tend to work with names better.

Another way to see status of jobs is with sq-me. We get the same output here. However, again, if we were running multiple jobs, this would show us all of the jobs we had running, whereas referencing the job number only does the specific job that we were asking about.

Now to speed things up, hold on a moment, let's see. All right, this is still pending. What I'm going to do, I'm going to jump ahead a little bit and use a command that I hadn't intended to show yet, but I'm going to cancel this job and I'm going to edit this. I'm going to use... Did you go over the S run command? No, I'm not running it on the login node because when I do sbatch that sends it off to a compute node.

So I launched it from the login node, but it's actually running on the compute node. And anyway, what I'm going to do, because I want to demonstrate something, I'm going to change this to my administrative group, which has higher priority. So the job will actually run immediately.

So I can demonstrate something, write this out. Okay, now I have a new job ID. And when I do this, I can see now the job is running.

And to see all jobs in the queue, you can use the sq command. And this is going to show you, indeed, every job that is running on the cluster from all people in all groups. Since that's a lot of data, you might want to pipe it through less, which can easily let you see by scrolling up and down who's running what.

You can see here that there's a lot of jobs running on the free partition. So that's why they're all pending. You can also see here... Let's go down to the bottom.

Tickle Smith group seems to be running a lot of things. And the short partition, which we'll get into later, has several jobs pending. But basically, sq by itself is going to show you every job that's running on the cluster.

Another useful command to see current and historical jobs is stacked, or saccount. When you use this... Again, I'll clear. Saccount is going to show you all the jobs that you've run since the previous midnight.

So you can see here that I've run quite a few while I was testing this class on the free partition. I ran several under the RCS partition. All of you, most of you, can probably see very little unless, of course, you are part of a lab and you've been doing some work on your own already.

So now we will submit HelloSH to SLURM again to demonstrate some more things. And if I look and see... Yes, my job is already finished. So I will submit this once again to the queue.

And now I can demonstrate a command sctrl show job. If you use sctrl show job and give it the job ID, this is going to give you a lot of information about your job. It's showing you... Again, since we're on the login node, but once we submitted it to the cluster, it's running on node 95.

It also tells you a lot of other things. The state that it's running. We'll get back to that later.

It shows you the account. Again, I changed this to run under my RCS account. For you, if your job is running, it would show free.

And it shows you a whole host of other information about your job. This is another slide showing you all the various things you can see with sctrl show job. Other things you can do with sctrl is sctrl show partition.

And a demonstration of that is when you use sctrl show partition, you see information about all of the accounts or partitions on the cluster. That's a lot of information. And generally, you're going to be most concerned with what you're running.

So, if you use sctrl show partition with the one that you're in, you see a much more limited amount of information and also more useful. So, this shows different bits of information about the free partition that we're running under. Another useful thing is sctrl show node.

Similarly, sctrl show node is going to give you information about every node in the cluster. And with 96 nodes, that's a whole lot of information. Generally, you might be concerned with just the nodes that your particular group owns.

So, if you want to focus in on a particular node, and I believe the free partition, the only node that's primarily for it is INS 022. So, if we look at INS node 022, this is one node that's devoted to the free partition. And we see more information about this particular node.

Now, node states. There are many different states a node can have. And again, I'm going to show that with sctrl show node.

For example, when we looked at node INS 022, we see that this node right now is idle. So, that means it's available and not being used by a job. sctrl show node and also a command called sinfo, which shows node states in a different manner, where you see different groups, what nodes they're using, and what states they're in.

Nodes can have many states. You can see if a node is idle, if a node is allocated, meaning it's running a job that's consuming all of its resources. You can see if a node is down, meaning for some reason or other, it has failed or it's unavailable.

Draining means that we administrators have taken it offline for maintenance, and no new jobs are going to be accepted by it. Once the jobs run out, then we know that we can use it for whatever we need without interfering with people's work. And there are many other different states that nodes can have.

A common one is mixed, since the cluster is very busy. Mixed means that a node is both allocated and idle, meaning it's running jobs, but the jobs that it's running aren't taking up all the resources. Either they're not taking up all the memory, they're not taking up all the processors on the node, they're just not using the full node.

So SLURM knows that if a job that is waiting can fit into the number of processors that are available, or the memory that's available, it can fit that job onto that node. That's what helps contribute to high availability of jobs. And again, that's why it's important to start slow and scale up with, say, the number of cores you request for the amount of memory you request, because that way SLURM has a better chance of fitting your job in and making it run sooner.

So where did our output go? In our batch file, we asked the script to echo hello world to the screen. Well, we never saw hello world pop back up onto the screen, so what happened to it? Since batch files run in the background on the cluster, this way you don't have to sit and wait for them to finish. However, since they run without needing a screen, all that output has to go somewhere, and what happens is output from your sbatch file is sent to a file that you can review later.

So when your job finishes, any messages it sent will be in a slurm-x.out file that's located in the same folder that you started your job from, and that x is going to be the job ID that you received when the job started. So if your jobs have finished, and if not, I'll demonstrate with mine. When I do ls, I see that I have a couple of slurm-x.out files, and by using that, we can see that the hello world that we asked it to output was put into that file.

Also with job output, with the scontrol show job command that we had demonstrated a few slides back, scontrol show job also has, at the bottom, information about the job which can tell you where your output file is and what its name is, where the error file is for errors. You can, if you want, make a separate file so that it doesn't have to put them all in one. It tells you what command started your job and also the directory that your job was run from.

So this is all part of the useful information from scontrol show job. Interactive jobs. One moment, please, while I take another sip.

Interactive jobs. Slurm interactive jobs allow users to run applications on compute nodes within the cluster as if they were working on a regular login node, but with more resources and control. This is achieved by requesting a certain amount of time and resources from the slurm system and connecting to the allocated node by commands like srun or salloc.

A simplified rule of thumb, when you are running a long job, use sbatch. If you just want to work live on a node for a little while, use srun to get to a node, and we're going to get into that right now. So use srun to start an interactive job.

Now, if you notice, the same type of directives that we had in our sbatch file can be used on an srun line. So as before with the sbatch file, we were sending a job off into the background with one core and one gig of memory for that core. With srun, we're going to ask for here, at least, a batch command line with those same specifications.

So again, sbatch requests resources and sends it off to the background. Srun is similar. Srun basically is just another job as far as slurm is concerned, but typically we use srun to gain a command line in the foreground.

Also, srun can also be run inside of an sbatch script to run tasks in parallel. That is a bit more advanced, though. So back to some hands-on.

If you use, and I'm going to make things simpler by clearing my screen, if we cat the file that we have called interactive.sh, what you have is you see a typical srun line asking for, actually, this is an earlier version of the file. So we'll work with it. This is a typical srun line, and you can copy and paste that onto the command line.

Or if your terminal isn't good at copying and pasting, you can just type it onto your command line. But here, srun is going to ask for a batch command line under the free account. So now while that's waiting, I'm going to talk.

Unfortunately, that's an earlier version of the file. I meant it to look like this, but I'm going to explain it as if it looked exactly like this, because the time is intended to be expressed differently to show you this. Dash dash time is another way to express time.

It's the same as dash T, but it's just a different formulation. So dash dash time in this formulation with only one colon is asking for time that is only one minute, zero seconds. That's with one colon.

There are no days and hours here. Also, now, I don't know if your jobs have gotten a command line. You can see mine hasn't.

So for demonstration purposes, I'm going to cancel this, because I want to show something. I'm going to use my RCS admin account. And that way, I got a command line instantly.

So now, I've gotten a batch command line that's going to last for one minute, and it's on node 96. I'm no longer on the login node anymore. I'm on a compute node.

And on this compute node, now, if I'm going to do some work for a while, this is where I should be working, on a compute node, not on the login node. So for example, I could do anything. Let's say run the host name command, which we already know we're on 96, but it's just an example.

We can do any command here, but we're running on the resources of a compute node. We do this so that we don't take up space on the login node, where everyone has to come through at the same time. And we prefer to keep larger things, larger, more work off of the login nodes and on the compute nodes.

Now, when you want to exit this shell, just simply type exit, you end the SRUN session, and you're back to the login node where we started, back in the same folder that we had started from. So with SRUN and SBatch directives, again, just to go over what we saw with the dash dash time, many directives in an SBatch file or in an SRUN line can be expressed alternately by either a long or short syntax. So these are some things that you may commonly see.

Each formulation here means the same thing, it's just expressed differently. There is no right, there is no wrong for which one you use. And as you create your own SBatch files, and as you see others' SBatch files, don't be surprised if you see users mixing and matching different formulations.

salloc is an alternate to SRUN, another way to obtain resources on a node. We're not going to delve too much into it because on our clusters, salloc and SRUN basically have similar behavior. On other clusters, you may see big differences between how salloc and SRUN behave.

On our clusters, we recommend using SRUN to obtain an interactive job, at least until you become more advanced and understand more about what you're doing. scancel. So scancel again was something that I had to jump ahead and show you before.

However, just to go through that once again, if we do send a job off to the nodes, we get a job ID. If we want to cancel our job, just do scancel with the job ID. It cancels the job and you can see, yes, the job is no longer running.

It's canceled. If you do scancel.you and use your uni, you can cancel all of the jobs that you're running. You as users can only cancel all of your jobs.

We as administrators sometimes use this to cancel anybody's job that's running on the cluster. In cases where particularly someone may be running a job that's consuming too many resources and threatening resource allocation on the cluster. Core-based jobs versus node-based jobs.

And one moment again, please. Core-based jobs can use less than a full node of 160 cores. And again, that's each node has 80 physical cores with multithreading that doubles that effectively to give each node 160.

So with the core-based jobs, you specify how many cores you need. One is always a good place to start. If it's not enough and your job isn't running well, scale up from there.

Specify a memory requirement for each core that you're using. It's important to specify a memory requirement because without a memory specification, your job likely will probably just fail. These values will take some trial and error to discover what's best for you.

There is no hard and fast rule. You're going to, as you do your research on the cluster, know better than we about what your job may need. And again, just take some trial and error and you'll scale up from something small to something bigger.

Now, core-based jobs are opposed to node-based jobs. That's when instead of looking for one core of a node, you specify that you want an entire node. Again, you can use either of these formulations to ask that you want a node.

Here, I'm asking for two nodes. If I combine either of these with the dash-dash exclusive directive, then I'm saying, well, like here, not just that I want two nodes, but that I exclusively want two nodes. Meaning, I want two nodes that do not have any other jobs running on them at all.

I want them all for myself. As opposed to just asking for two nodes, but knowing I won't use them all, Slurm can put some other jobs on there. So, dash-dash exclusive is valid.

Anyone can use that flag. However, if you do use the exclusive flag asking for a node, it will make your job wait longer. Because again, Slurm has to find a node that doesn't have a job running, that doesn't have a job waiting, and has to wait until that job is entirely done before it will allow you to run.

So, yes, dash-dash exclusive, you may need it sometimes, depending on how big your job is, but it will make you wait longer. Multiple copies of job can be submitted by using the job array mechanism. One moment, please, again, while I take a drink to spare my straining voice.

So, as with other Slurm directives, you can specify in a batch file, which is non-interactive, going into the background on the cluster, or in the command line, an interactive job that you want to run an array job. So, we're going to demonstrate with a sample array job, array.sh. And that is in your directories. So, hold on.

If we use nano on array.sh, what we're going to do is edit this so that it has your account. Again, you're all part of a free account. Those of you who may know you have another account may use it, but everyone is part of free.

So, we can put the account in here. And even though this is doing a specific thing just as demonstration, it's not really important what this is doing. It's just important that we have it so we can demonstrate an array job.

So, now that we have our accounts in here, we write this out and exit nano. And again, here is our array job. And we can do sbatch, array.sh. This sends our array job off to the cluster.

We can take a look at it. Whoops. And we can see with an array job, we had five components to our array.

sctrl show job shows us we have an array job with five different tasks on it. And if we use sq dash dash me, we can see this is what an array job looks like. They all have the same job ID with a different suffix on it to show that they are all part of the same job that's running.

Now, this again, because I'm a free user, it's pending still. It's not getting immediate resources. So, actually, I don't need to cancel that because we've already seen what an out file looks like.

Eventually, when this does finish, what we'll have is a slurm.out file that will have this job ID in it. And we can see the output in that.out file. So, next.

CPUs versus GPUs. The central processing unit is the brain of any computer. It can perform general purpose complex tasks.

Graphics processing units were originally developed for high end graphics calculations. These calculations are very specific and can run in parallel. It was later realized that any operation that can benefit from parallelism can be run on a GPU.

Not just graphical problems. And training AI models is a specific example of what can benefit from using a GPU. Not every problem can benefit from a GPU.

You need to understand what you are trying to do. And as you do your research, you will know better than us whether you need a GPU or not. A GPU node will have both CPUs inside of it and it has GPUs.

If you run a program on a GPU, node, that does not mean it will use the GPU. Your individual program must be GPU aware and specifically take advantage of the GPU. If the program you're running isn't directed to use a GPU, even if you're running on a GPU node, it will only use the CPUs that are present on the node and it will not run with added GPU performance.

And this is a very important thing to remember. We have gotten questions in the HPC just on this very topic, where people didn't understand why their program didn't use a GPU. It's because they were on a GPU node, but their program didn't know how to use it.

So GPU prices are high. The GPU nodes are much more expensive than regular CPU nodes, but the performance is unbeatable if you have the right programs and right problems for them. Insomnia primarily has NVIDIA GPUs of varying vintages.

NVIDIA CUDA. CUDA is a proprietary parallel computing API from NVIDIA, and CUDA on the HPC is implemented as a module. So you would module load CUDA in order to have it in your environment.

And again, you would only do that if you were on a GPU node. CUDA is what allows programmers to write code that will be executed on the NVIDIA GPU. CUDA is being supported by many applications.

Many Python packages and libraries support CUDA. MATLAB supports CUDA for GPU performance. And of course, all AI and machine learning applications support CUDA.

GPUs are in high demand, and support and demand continues to grow for GPUs. GPU jobs. So in order to get a GPU node, you use this, the dash-dash regress directive to request a GPU node.

Any GPU node in the cluster has a maximum of two GPUs on a particular node. And for three-tier users, if you're using a GPU node and running a GPU job, you'll have a maximum time of four hours that the job can run for a GPU job. So this is how you would do it in an SBatch script.

You would have this directive asking for one or two GPUs. If you were using an SRUN line to get the command line on a GPU node, this is what it would look like. It's very important to remember for free users particularly, you must specify the short partition in order to get a GPU node.

If you don't have this in and you're requesting a GPU node, you'll get an error and you won't get a command line shell on a GPU node. Next, we're going to talk about job priorities and what's called fair share. So SLURM allocates resources by default with a first-in, first-out principle.

But it's not just that. And we use something called fair share to help decide when jobs run. Fair share is a scheduling policy used to ensure that compute resources are shared fairly among users and groups over time.

Fair share prevents users who've used a lot of resources recently from monopolizing the system and gives priority to users who have used less. Every job is assigned a priority based on fair share, but also some other factors like job, age, and size. And we'll delve more into that in a moment because this is a complicated subject and we're just trying to simplify it for all of you.

Fair share calculation in SLURM requires that the SLURM accounting database provides the assigned shares and consumed computing resources. It takes into consideration computing resources allocated and computing resources already consumed, meaning as SLURM calculates your fair share value, it's looking at how much you're using now and how much you have used in the past, as well as how much you're asking for, how much memory, how many nodes. All of this is taken into account in a complex calculation to get your fair share score.

The fair share factor is a floating point number between zero and one. So to delve a little more into this, so if you and another user both submit jobs, but you haven't used the cluster in a while and they've been running many jobs, your job might run first, even if you submitted your job after them. Because again, SLURM is calculating your fair share factor.

So if you have a fair share factor of one, that means you haven't been doing anything. So you're going to get higher priority. If you have a fair share factor of zero, that means you've been very, very busy and you're going to start getting lower priority in order for other people to get some more time.

Again, most people have somewhere in between this. And an example is the command ssshare. This is a simplified view.

With ssshare, here you can see two users on the cluster. AT3708 has not been doing a lot. So this user has a higher fair share, closer to one.

So their jobs will get higher priority in terms of who goes first, who goes second. User SB5160 has been using the cluster a lot lately, a lot lately. And they have a lower fair share score.

So they're going to start seeing their jobs a little bit more delayed between the two users. But again, this all has to be taking into account how much is each person asking for. Is this user, who has a higher fair share, asking for a node exclusively or a lot of memory? Is this person asking for only one core? There's a lot of things that go in the background, making fair share a bit complex.

But in general, this is trying to give you an idea of what your fair share score is, and why some people may see jobs not running immediately, while others, who generally you think they shouldn't be running before you, they may be running before you. It's because of fair share. So the ss-share command can view current share status.

All accounts will be displayed along with one user, you, meaning this. And I will demonstrate that. If I use ss-share, here I can see the fair share scores for every group on the cluster.

But the only user I see is me. Now if I want to see fair share scores for other users on the cluster, then that's ss-share-a. Now with ss-share-a, I see the fair share scores not only for other groups, but also for other individual users.

If I want to try and start to figure out why some person may be running a job faster than I, this is what would give you the numbers to make a comparison. If recent users use less than the target share, their job priority increases. Users using more of their target shares see their priority decrease.

Job priority with ss-share, what this is detailing are the various columns that are in, and hold on this, actually I'll probably be able to easier demonstrate that by ss-share piping it through less. So there are various columns to the fair share output. And what this slide is demonstrating are just what the various definitions are of the various columns.

Again, it's a complex thing. It takes a while for fair share concept to sink in, but this is meant to give you an introduction to what's happening on the cluster when it comes to who's running first, who's running next. Next, home directories.

Free and paid users on the cluster both get a 50 gig home directory and 150,000 inodes in that home directory. And inodes is just a measure of how many files you can have, as opposed to the block size, which is the size of each file. However, labs also have a group shared scratch space.

Free users all share one scratch space that is one terabyte only. So even though a free user is not, may not be in the same lab or even the same department as another free user, all of the free users share one terabyte of scratch space. So if you are a free user only and don't have a partition in a lab group, be nice.

Please don't use too many files. Leave space for others. Scratch space is more suitable for code scripts, small data sets.

An important thing to remember is that nothing is backed up on the cluster, neither your home directory or your scratch space. Now, while the scratch space is on these highly redundant storage nodes that we went over in previous slides, and these storage nodes are very, very reliable, however, everything is mortal. Even machines, nothing is quite invincible and accidents may happen.

Uh, in the event that a file is lost, or if you mistakenly delete a file, no, we do not have backups to restore from. So if you have a file that is critically important on the cluster, you should keep a backup copy yourself, whether you have it on your local hard drive, whether you have it in cloud storage, whether you have it someplace else. So if it's absolutely critical, you are your own backup.

Also, jobs are not allowed to run on the login node. And this gets back to what Wakas was mentioning earlier when we were on the login nodes and how using an sbatch script from the login node sends your job off to run on a compute node. If you want to do a lot of work, again, don't do it on the login node, but instead, as we did earlier, start an srun session and then do a lot of work there on one of your assigned compute nodes.

And it will be a different node pretty much every time. So you may not always be on the same node, but again, it doesn't matter because you all have the same home directory, no matter where you are. Data transfer nodes.

Basically, a data transfer node is a node that is just dedicated to letting you move data up and down, back and forth from the cluster to your local machine. So on Insomnia, there are no dedicated data transfer nodes. So if you want move a file on and off of the cluster, you just use scp or rsync to go directly to Insomnia.

On Ginsberg, there is a dedicated transfer node called motion. So if you had a file you wanted to put on Ginsberg, provided you were a user on Ginsberg, you would scp or rsync that file to motion and it would deposit it into your home directory. And again, since your home directory is universal, it's shared to be the same no matter which node you log into.

You could put the file on motion, but then when you log into Ginsberg, you will see that file right there. The home directory is omnipresent on all nodes in the cluster. So you can use scp or rsync.

Also, another thing that we have is Globus, which is a very fast way to transfer files. It is a separate program. Here's an example of using scp to scp a file from a local computer onto Insomnia into your home directory.

Globus is configured on all the login nodes. Globus is a separate program that is dedicated for doing file transfers and does them very fast, very efficiently, much faster than scp or rsync. And this number here we have for transfer nodes, which again, Insomnia login nodes and transfer nodes are the same.

We have a 10 gigabit pipe to the internet. Basically, without being too technical, it's faster than most connections. Again, if you're transferring large data files, especially we recommend using Globus.

You can find out more information about Globus here. Insomnia, we have large pipe connections to the internet. And Insomnia has Globus documentation as well.

Show you how to use it. It's a very easy web-based file transfer mechanism. So execute node access.

Basically, what this slide is saying is that on the cluster, there are three types of nodes. So there are nodes that are owned by your account, nodes that are owned by other accounts, meaning other labs, other people own those nodes. And then there are the public nodes, which are the login nodes.

And the login nodes, again, as we saw in the earlier diagram, the public nodes, the login nodes are the only thing facing the outside world. So again, when you SSH to Insomnia, you log into one of two login nodes transparently in the background. And then from there is where you would do say your S run or S batch to a compute node.

Job runtimes. So if your job asks for 12 hours of runtime or wall time, as we call it, or less, it can run on any node in the entire cluster using the short partition. And again, this gets back to what we saw in the previous slide when we were talking about nodes owned by your account and nodes owned by other accounts.

What this is is that, let's say for demonstration purposes, this user is part of the PMG group, but they have a job that is a short job. Their group owns their own particular nodes. But since the job is very short, it can run on any node in the cluster.

It doesn't have to run on their nodes. So what that does is increase the chances that the job will run immediately. Because instead of only running on just the nodes that this group owns, which I believe are 10 nodes out of the cluster, by specifying the short queue, it can run on any node, any one of the 96 nodes.

And here we've requested time of 11 hours and 59 minutes, because again, to run on any node in the cluster, a node that you don't own, you can only have 12 hours or less on the short partition. Free users can only run jobs for 12 hours no matter what. Because free users are, again, free.

You're not part of a group which is owning a node. So in this example, a free user is submitting a job and they're asking for 14 hours. That's going to fail with an error.

Because again, free users regardless, no matter what, can only run a job for 12 hours. And again, as we noted in an earlier slide, if it's a GPU job, you can run for less than that. It can only run for four hours.

Now, if your job needs more than 12 hours, then it's only going to run on a node that you own. So in this example, this person is part of PMG group and they own 10 nodes in the cluster. So they're asking for a job that can run for seven whole days.

So that's fine. It can run. But then SLURM is going to direct it to only run on one of the 10 nodes that they own and not on any node in the cluster.

Miscellaneous comments. We have the free tier and the educational tier on Insomnia. Again, for this class purposes, you're all part of the free tier.

The educational tier is for classes and that has similar limits. The free and educational tier on Insomnia is brought to you by the School of Arts and Sciences, the SEAS department and the EVPR department. Each year, these departments contribute funds to buy another node that we'll add into the cluster and make available to the free tier.

And that's about it for the free and the educational tier. Access. Advanced fiber infrastructure coordination ecosystem services and support.

To summarize, Access is a collaboration brought to you, the users, by the National Science Foundation. Basically, Access is a way that you can apply for a grant to get HPC or HTC resources and other services provided free through the NSF. You just need to write a grant and have it approved.

That way, if you're a free user which has limited resources and you need much more computing power, a way to get it is to write an Access grant and apply for that. If you want to learn more about Access, the CUIT website has information about the Access program and all that it can do for you. Also, you can email rcs at Columbia.edu and ask about how do you get into the Access program.

Empire AI. Empire AI was first announced in April 2024. Empire AI is a New York statewide initiative backed by the governor.

The purpose of Empire AI is to establish next generation GPU-only computing clusters specifically for AI research done by educational institutions in the state of New York. Only a select number of institutions are part of Empire AI, and Columbia University is one of the first. If you want to find out more about Empire AI, at this URL on the CUIT website, you can find out more about just what Empire AI is.

Again, this presentation is being recorded and will be online, so you don't need to rush to write this URL down. You'll be able to access it yourself in a few days. Also, there are additional workshops, events, and trainings that take place on campus sponsored by the Foundations of Research Computing.

Also, the Columbia University Library has different workshops, data clubs, boot camps, user groups. These are all things that are on campus, but you can find out information at these URLs or also searching for them through the CUIT website. You can also find the Insomnia or Ginsberg documentation online that is public to anyone, whether you have an account or not, and this is at cuit.columbia.edu.hpc. You will see a box where you can find the documentation for our clusters for the free users, our training video library, which is where this video will be hosted after this class is over, information about access, and Empire AI.

And with that, we come to the end of our presentation. Thank you so much for attending. Again, if you have further questions, if you want to get a paid account and not a free account, if you're eligible, or just have any other questions about this, you can email us at hpc-support at columbia.edu. And with that, we're out of time, but if anyone has any questions, I open the floor up if anyone wants to ask something.

Topics:

HPC Cluster Training

Recorded November 13, 2025

In this workshop, we'll introduce you to high-performance computing (HPC), explaining what it is and why it's essential for solving complex computational problems. We'll cover how HPC is used to run simulations, process large datasets, and accelerate research across various fields. Additionally, we'll provide an overview of HPC clusters, highlighting their architecture, how they work, and how they are used to distribute computational tasks across multiple nodes to achieve high performance.

Note: The video is a recording of a class session taught live to users who had all been provisioned with user accounts on the Insomnia HPC cluster. Users outside of that setting will not have an account on Insomnia, so you won't be able to log in and do the steps, though can you watch the lesson recording.

To perform the exercises, ask your Department Administrator if your department has a partition on the Ginsburg, or Insomnia HPC clusters, and if you can be given an account. If they don't have a partition, then you may apply for a free account as long as a faculty member is willing to sponsor your access. Note that free tier accounts are supported with online documentation, but are not eligible for direct support from CUIT's HPC Engineers.

Classic: Intro to High Performance Computing 2025 | HPC Training #4

Phone

Contact Us

Classic: Intro to High Performance Computing 2025 | HPC Training #4

Related

Phone

Contact Us