Intro to High Performance Computing 2025 | HPC Training #4

Everyone. Hello, and welcome to the intro to high performance computing class. It's the fourth class in our HPC Training series.

And my name is Halyna Levatic. I'm one of the systems engineers on the HPC Team. And my other teammates are here.

You may recognize them from the previous classes. Al Tucker, Ryan Abreca, Sam Cho, Syed Bukhari, Wakas Hanif, and our manager, Max Shortte. And they will be assisting me and answering questions in the chat, should you have any.

So if you have any questions, please post in the chat. I'm going to share my screen and we will begin. Okay.

So what subjects are we going to cover today? First of all, what is HPC? Then some examples of HPC research at Columbia. Then Columbia cluster overview. We'll cover different job types, including interactive, simple, bad jobs.

We'll talk about GPUs and MPI jobs and some other subjects. So first of all, what is high performance computing? And one answer, it's a magical land where parallel doesn't mean that things finish faster. It just means they fail faster together.

But that's a joke. And a serious answer is that high performance computing cluster is a grouping of many powerful compute servers that are connected together to run through complex, usually computational problems. And each compute server is called a node.

The nodes in each cluster work in parallel with each other, boosting processing speed to deliver high performance computing. So HPC systems are parallel, and you usually need to be able to use all the resources in parallel to benefit from HPC. This parallelism can be achieved in a couple of ways.

From many independent tasks, high throughput computing, to a single large parallel computation, which communicates across servers. HPC focuses on executing a small number of complex tasks in parallel to achieve the high speed processing, while HTC is geared towards handling a massive quantity of relatively independent tasks. Why would you want to use high performance computing at all? One reason is to save time and speed up your research.

So some examples might be a statistics student who wants to cross validate their model. And it involves running the model a thousand times. But each run takes an hour.

And obviously, on their computer, it will take over a month. But on an HPC cluster, it will be much less. Another example would be to analyze huge amounts of data.

So for instance, a genomic researcher who has been using small data sets will be receiving more data sets that are 100 times as large. It's difficult to open all the files in their computer and to analyze it there. So doing this on HPC cluster, again, much more appropriate.

What industries utilize HPC these days? And the answer is pretty much everyone, from automotive, to financial, to life sciences, academic, and of course, AI. Some more examples. So for instance, in research labs, HPC is used to help scientists find sources of renewable energy, understand the evolution of universe, predict indirect storms, and so on.

So it's used to predict storms, create new materials. In AI and machine learning, HPC is used to train and run the models, teach self-driving vehicles, and improve cancer screening techniques. It's used in financial services to track real-time stock trends and automate trading.

It's used to edit feature films, create special effects. HPC is used in simulations to develop treatments for diseases like diabetes and cancer, more accurate patient diagnosis. Those are some examples.

So we have had some impressive research conducted on Columbia HPC clusters. And a lot of the research has led to over 100 peer-reviewed publications in various top-tier research journals, as you can see here. So some of them.

Bran Stockwell, on a quest for cure, using molecular dynamics to simulate drug interactions. Then Jacqueline, researching sea level changes ranging from the past glacial cycles to millions of years in order to infer ice mass changes. Ryan, researching ocean dynamics.

David Kipping, researching moons outside of our solar system. So, let's get into more specifics and take a look at the HPC cluster overview and the main components. So, usually, an HPC cluster will consist of one or more head nodes.

It's the node where the job scheduler is run and most of the management happens. Then you have compute nodes for running the jobs. And in our clusters, we have hundreds of them.

You have storage to store data temporarily. And we would have scratch space and home user space on our clusters. And, of course, network for data and message transfer.

So, the current HPC footprint at Columbia, at the moment, we have four clusters. Terremoto, Manitou, Ginsberg, and Insomnia. But we plan to integrate all of them into one cluster, Insomnia, within the next six months, approximately.

But if you add up all the resources that we have here, we basically have two petabytes of storage, 20,000 compute cores, 20 terabytes of RAM, and roughly 200 GPUs. And everyone is always curious about the GPUs. So, we do have a lot of NVIDIA and a lot of A6000 series, but also L40, L40S, H100, and we do have some Intel GPUs.

More will be added as cluster grows, not from just old clusters, but we will purchase more hardware in the future. I also wanted to note that Insomnia utilizes GPFS file system, and that is a parallel file system. It is a type that is designed to allow multiple processes or nodes to read and write simultaneously.

What are the benefits? High IO throughput, scalability, reduced bottlenecks, concurrent access, it is optimized for HPC and AI, and it's also fault tolerant, efficient at large file handling. And here's a comparison of parallel file system versus NFS, but I'm not going to read it all out loud. But the slides will be available if you want to familiarize yourself with the differences a little bit more.

Moving on, we offer a lot of different software on our clusters, and here's just some of it. Anything from R to Python to MATLAB, you name it. And you might remember from the previous lecture that we create environment modules to manage our software.

So the modules utility allows you to load customized environments, and each module corresponds to a particular software tool or library. And why use modules? Why not use packages directly? It ensures that your application runs in a consistent and functional environment, basically. And, you know, one day you want to use one version of software, you just have to load the module.

And next day you want to do something else, you will load a different module. So if you recall from the previous class, you would type module avail to see the list of the modules. And here we go.

CMake, Julia, a lot of one API modules to different R versions. This is Insomnia. And let's say I want to load R, you would just do ML load R or module load R. Then you can type module or ML to see what you have loaded.

Here we go. So that's all good. But how does one actually interact with HPC environment and how do you run jobs? You would use a piece of software called a job scheduler.

And there are several popular ones out there. PBS, Mob, Condor, but you may know that here at Columbia we use something called Slurm created by ScadamD. And let's now jump to the fun part and do some hands-on exercises.

And you're welcome to follow along with me or, you know, you can just watch and do this later. You will have access for another week to Insomnia, I believe. So you can do that at your leisure.

And a refresher to log into Insomnia, you would get a terminal on a Mac or CMD on Windows and you would SSH your uni at insomniarcscolumbia.edu. So let me do that. And for me it will be HL2472. It will ask you for your password.

I have SSH keys in place so it didn't ask me. And I just I will give everybody a moment to catch up and log in. Okay.

So I hope everybody is logged in and you would land in your home directory. You can verify that by typing pwd. And so your home directory should look similar to mine, Insomnia home and your uni.

And we do have some exercise files in slash TMP. Please copy them to your home directory. And I'll do the same.

Copy recursively TMP workshop4 to here. Well, I already downloaded this, but you shouldn't be prompted. And you should now have a workshop4 directory.

Let's just cd in there and look around. There are a bunch of files here. And let's take a look at hello.sh first.

So hello.sh is a bash script, but it is also a slurm batch file. As batch is a keyword after which you pass various parameters to slurm. And please note that the hashtag is not a comment.

This is how as batch keyword is entered. But double hashtag here is a comment. So what do these parameters mean? Account, which for us today it's free.

Job name. Cores. I'm asking for one core.

And time, one minute. Memory per CPU, one gig. And this is a partition, but let's ignore it for now.

And all our program does is it sleeps for 30 seconds and then it echoes hello world. And we're actually going to modify it because we have nothing here for the account. And we will change that to free.

So I'm going to use vim to modify the file. But you're welcome to use your favorite editor, nano, emacs. So account free.

And I'm going to save. I'm going to get the file again. And I'm ready to launch it.

And I'm going to use a command called as batch for that. This command submits a batch job to the scheduler. And here I see that my job was submitted with a job ID.

Your job ID will be different. And now I'm wondering is my job running? Was it submitted successfully? How would I see what's going on? So for that, we use an ask you command, possibly with different parameters. And dash J or dash dash job, you can pass the job ID to that and take a look.

So we see my job ID, partition free, name, user, state, time, number of nodes. And something to note that my state is pending. So the job is not running.

And the reason is that I don't have a priority right now. But that's OK. We're just looking at the stats.

Like we're not going to do anything with the job. So it's OK that it's currently pending. I also like this ask you dash dash me command.

It will show me all of my current running jobs. OK. I'm just going to quickly look.

I see somebody got errors about invalid account. So some intro series. Yeah.

OK. I'll help with that. So we can also view all the running jobs on the cluster.

If we just issue an ask you command. And I'm going to pipe it to less because it's a lot of output. And we can sort of spy what other users are doing.

Also, some are running jobs. Some are pending like us. So again, we see job IDs.

We see partition names. These are all paid accounts. And here's our free lab name.

And please note that there are different reasons for pending jobs. For instance, partition node limit or priority or resources. And we will get questions pretty often.

Like my job is not running for some reason. Why is that? So it is useful to take a look at the ask you after you launch your job. And to note the reason.

So here person is asking for more resources that are available. And here the priority. Somebody else has a priority over them.

And we'll talk about priorities a little later in this series. But it's a good starting troubleshooting point. Another useful command to examine your current and historical jobs is ask for slurm account, I guess.

And so I see all of my current and some of the past jobs that I launched. Also with various additional information. And if you have a job ID for something that ran previously but is completed, you can also pass it a parameter and examine the past jobs further.

Another command is ask control show job. And ask control is a powerful command that has a lot of flags. But we're just going to use it to further examine the details of our running job.

So ask control show job. And a lot of useful information but pretty self-explanatory. My ID, submit time, start time, whether it was completed.

So this one was. What script I used to launch it. And then the error log.

So you can also view the partition and node information with ask control. And we're going to do it now. So right now, me and most of you are hopefully on the free partition.

We can get a little more information here. For instance, which nodes it contains, which group is permitted, priority tier, how many CPUs in memory are in this partition. We can also view node information.

Let's say we spotted that node 022 is in this partition. So we want a little more information. We can see that there are 112 CPUs currently allocated and 160 total.

And we see the OS version. Which partitions are here. We can even see the boot time.

Something to node, the node state. So this one is mixed, which means that some jobs are running on it. But not all the resources are utilized.

So there are some available idle resources. But the state can also be completely idle. Means it's available, not used.

Allocated, which means it's currently assigned fully. Down, self-explanatory. Drain, node is offline for maintenance.

And no new jobs will be assigned. Completing, node is in the final stages of job completion. Pending, and so on.

Again, you can always just go back to the slide to take a deeper look, if you're interested. So, interactive jobs. We saw an example of running a job with sbatch.

But there's also an option to run SLURM interactive jobs. Which allows you to run applications and compute nodes as if they were working on a regular login node, but with more resources and control. And this is achieved by requesting a certain amount of time and resources from SLURM.

And then connecting to the allocated node via srun or salloc commands. And generally, you want to use sbatch for the longer jobs. But srun is good for, let's say, troubleshooting or shorter jobs.

So, we're gonna take a look at the file in workshop 4, interactive sh. It contains a command that you would use to run a simple interactive job. So, this means obtain pseudo terminal.

Run my job for one minute. Account is free. And I'm just launching bin bash, a shell.

So, you're welcome to just cut the file and copy paste this. Not to retype the whole thing. So, I am waiting for the resources.

Because the free partition is currently a little busy. And but maybe let's cheat a little bit. And I'm also part of the RCS partition.

So, I'm just gonna demo how it works. I launched my bin bash. And I ended up on INS053 node.

So, a little more information. salloc, like sbatch, allocates resources to run a job. While srun launches parallel tasks across those resources.

srun can be used to launch parallel tasks across some or all of the allocated resources. And srun can be run inside of an sbatch script to run tasks in parallel. In which it will inherit the pertinent arguments or options.

But please be very careful with the nested srun. You need to understand exactly what you're doing. So, best to avoid it, honestly.

And, you know, what we just saw. srun can also be used on its own to request resources. And start tasks where tasks will be spread across the resources.

So, we did try this. And, yeah. This is in my PS1, the host name.

But, yeah. So, yeah. My time limit is over.

And I will be kicked out. My bash shell will be terminated. Because my Slurm job will be terminated in, you know, after a minute.

And just a quick example how to use salloc. And you see here, my job was terminated after a minute. So, I'm also going to use our RCS account for demo purposes.

But, yeah. You would pass similar parameters that you would to the batch script. Number of tasks, memory, time.

So, once again, I got a shell here. Ended up on another node. And let's say I realized something was wrong with my parameters.

I just I want to cancel my job. I'm gonna see what's here. I can use the scancel command.

It will wait a little bit before canceling it. But it should eventually terminate. Oh, there we go.

I also wanted to mention core based versus node based jobs. So, the core-based ones require less than a full node of 160 cores. And node-based jobs require one or more entire nodes. And I guess I went over this very quickly, but on Insomnia, we do have 160 cores on each node.

But essentially, it is two sockets with 40 cores each, and hyper-threading is enabled. So it does add up to 160 cores per node. If you're not sure how many cores you need, just use one.

It's a good default. And we do have 512 gig on most nodes on Insomnia. And some of them are one terabyte.

So you can specify a memory requirement. But also, please try not to request more than you need, because there is only so much to go around in the free account. So node-based jobs will specify a number of nodes required.

And you can also request exclusive non-shared access. For that, you would use dash n or dash dash nodes parameter in as-batch. And you can use this exclusive flag.

But I would say you cannot use this with the current free account, or you will probably not be successful, because currently, there aren't too many resources. If you have your own lab account, and you have a lot of nodes, and some of them can be idle at any time, then perhaps you can exclusively request several nodes at a time. And I just quickly wanted to mention array jobs.

Multiple copies can be submitted using a job array mechanism. And with other Slurm directives, you can specify it in a batch file, or you can do interactive. So let's try to run our hello sh job here from the command line.

And we're going to run it five times. And I'm going to take a look. And you see there are five copies of it, but pending.

Because again, the free tier is a little busy today. We're going to finish with this. I wanted to share some other useful commands with you, because we do get a lot of inquiries of this sort, when users overutilize their space and their quota, and they are not sure what exactly is going on.

So on Insomnia, we use GPFS. And therefore, we would use MLS quota command with some parameters. And again, you don't have to memorize this.

The slides will be available. But it is mlsquota-u, my username, block size, and Insomnia home. I'm looking up my home quota.

And it's pretty, pretty self-explanatory. But I'm using 270 meg of space. And where's me here? Or rather, I'm using 1.7. It shows me my quota is 50 gig.

This is true for all the users on Insomnia. And all the users on Insomnia also get allocated 150,000 inodes, which for this purpose, essentially files. So you cannot have more than 150,000 files in your home directory.

But let's say you're getting warnings that you have used up your quota. How would you see what is taking up the space? I created a directory to demo a du command. But instead of temp example there, you would use your path to your home directory.

So du is disk usage. Max depth is two directories down. I'm looking in temp example there.

Errors go to dev null. And I'm sorting it and just seeing 10. But I guess my command kind of sort rh.

This is better. So I see that something called my test file is taking up 11 meg. Maybe I want to clean that up.

You would examine where you're taking up most space and maybe clean it up if you're over your quota. If you're over your inode quota, you can use a find command. And you would see I have six inodes used in example there, which means six files essentially, right? And here I have one inode allocated.

And right, for the quota, a block size is just the amount of data measured in bits or bytes, but in megabytes or gigs for us. And inodes are for our purposes here just same as files, number of files. So let's talk about parallel computing next.

That's one of the whole points of HPC. Message passing interface is application program interface that defines a model of parallel computing where each parallel process has its own local memory. And data must be explicitly shared by passing messages between processes.

Using MPI allows programs to scale beyond the processors and shared memory of a single compute server. And to the distributed memory and processors of the multiple compute servers to be combined together. MPI consists of a communicator, which is the heart of the MPI.

And every process gets assigned a unique rank. It's a unique number like an ID. Messages are predefined data types.

So integer, float, character, functions for point-to-point communication, MPI send, MPI receive. It's where process can send individual message to and from another process. And for collective communication, MPI broadcast and scatter are used.

And of course, MPI parallel code requires some changes from serial code as MPI function calls to communicate data are added. And the data must somehow be divided across the processes. So we do offer Intel one API and open MPI on our clusters.

But of course, programs need to be recompiled to realize the full performance. And we're going to see some examples. First of all, let's again see the list of available modules.

We see there's open MPI here. But for our example, I'm going to use Intel MPI module. Actually, I'm going to go to a compute node.

And all the work you should be doing on compute nodes, not on the login nodes. So I'm going to load the module. I'm going to verify that it's loaded.

And if I'm curious which MPI commands are available, I can just type in MPI and tap to see what possible autocompletes are there. So there's C compiler, GCC, Fortran, IFX. But we're just taking a look.

And we're going to compile a quick C program from our work directory. And we're going to launch it. You don't need to be a C expert to understand what's going on here.

I'm just going to quickly take a look. So I include the MPI and standard IO libraries. Main function, MPI init, initializing MPI environment.

And then I'm getting some information. So number of processes, rank of the process, name. And then I'm just printing out hello world and the processor information.

So exit out of here. And we're just, let's compile this program with MPI CC-O and MPI hello MPI hello.c. And I'm just going to give you a minute to do that. And I'm going to compile.

Okay. So here's my compiled MPI hello program. And I'm going to use a utility MPI run to run it across four different processors.

Hopefully. And you can also just cap this MPI build SH file if you don't want to type all of this. Okay.

So my MPI hello program ran across the four processors. And I think we're going to skip over this bonus example. But you can play with it on your own time because we do have some other stuff to cover.

So let's talk about GPUs for a minute. What is it? It's general purpose graphics processing unit. And essentially it's a very powerful fancy CPU.

And it is a happy accident of history because initially it was just used for graphics, but then it was discovered that it's really good for high performance computing calculations. And price is high, but performance ratio is unbeatable. We, again, on Insomnia, we primarily have NVIDIA GPUs, L40s, and H100s.

But the Ginsberg cluster also has a lot of A6000 series. And we do have an Intel, Intel GPUs, it is available for testing, I believe, even to the free tier people upon request. And we have a CUDA module for GPU work.

It is now being supported by many applications, including Python packages, MATLAB, machine learning application. And demand continues to grow. We do plan to add more GPUs to the cluster.

So you can use in the batch file, you can use the dash dash grass directive to request a GPU node. And let me do that in my file. But I don't believe you would be able to do that with a free account.

So I'm going to modify hello to be RCS. And I'm going to add this. And I'm going to as batch.

Hello, my job is submitted. Well, that actually didn't quite work as I intended. I'm going to go back to my home directory, workshop three, where the hello program lives as batch.

Well, I should have I should have ended on a GPU node, but not so let me try to maybe demonstrate with Asron. Okay. So from the command line, I said that I wanted to request one GPU and started a shell, ended up on a node INS 035.

And how do I know what GPUs I have here? I can use NVIDIA commands. So NVIDIA SMI, for instance, it will give me a little bit more detail about the GPUs. And on this node, if I do ml avail, I will see that I have CUDA available here.

So let's move on and talk about job priority and something called fair share. So how does Slurm allocate resources to users and decide the priority? By default, if we don't make any changes in Slurm.conf, it will be FIFO, first in, first out. But at Columbia, we use something called fair share.

And it is essentially a scheduling policy used to ensure that compute resources are shared fairly among users or groups over time. But we're going to focus mostly on user fair share today. And it prevents users who have used a lot of resources recently from monopolizing the system and gives priority to users who have used less.

And as this helpful picture shows, this toll user, I guess, used a lot of resources and others haven't. So if it was FIFO, you know, the resources wouldn't be distributed fairly as in with fair share, everybody gets a chance. So every job is assigned a priority, but also some other factors like job age and size are a factor in deciding the priority.

Fair share calculation in Slurm requires that the Slurm accounting database provide assigned shares and the current consumed computing resources. It takes into consideration computing resources allocated and the resources that are already consumed. And the fair share factor, it will prioritize the queued jobs based on underutilization.

And this factor is a floating point number between 0.0 and 1.0. So the higher my fair share number, the closer it is to 1.0, the more priority I have. So if me and another user are on the same partition, let's say free, and you both submit jobs and you haven't used the cluster in a while, but they have been running many jobs, you probably will run first, even if you submitted after them. And again, it's because of this fair share factor.

Yeah. So you can see that information was s share command, and it will actually give you more output, but this is just an excerpt. So you see Helena used 0.02 shares and my fair share is 80, and Saeed, he used 0.8. His fair share value is lower.

Therefore, Helena will get a higher priority for the job she submits. So you can also try typing s share command and taking a look. I'm going to pipe it to less.

There will be some other information. It will show me all the partitions and my user. And this is my user in the RCS partition, and this is my user in the free partition.

Note how the numbers are different. So in RCS partition, my fair share is pretty close to 1, which means I have a pretty high priority. I haven't run too many jobs compared to other users.

In the free partition, I've been running a lot of jobs, and my number is pretty low. So my priority, as you have seen, because my jobs kept in pending state, my priority here is pretty low. And if you want to see fair share for all the users, you can use dash a flag.

We're also going to pipe it to less. And yeah, you can see what's going on on the cluster. And you can use this with grep.

So let's say I want to see what everybody else is doing in the free partition and how come my jobs don't have a priority. So I can see, you know, a lot of people maybe have a higher priority than I do right now. Mm-hmm, that's been covered.

And a little bit more information. This pertains to the lab accounts. So like I said, fair share can apply to users, but it also applies to the accounts that we have, meaning like right now we're on a free account, but there's all the labs have their separate accounts.

So raw shares will be the amount of shares a lab has been allotted. And usually that depends on amount of CPUs they purchased, norm shares, percentage of groups, raw shares to the total shares for the entire cluster, raw usage, the amount of compute usage a lab has run or has requested in their jobs, effective usage, percentage of actual cluster use by lab. So all of these are used to calculate fair share for a particular lab, but we're not going to focus on this too much today.

But, you know, documentation is available if you get a paid lab account. And we did mention that when we were looking at the GPFS command, but both free and paid users get a 50 gig block of storage in their home directories and roughly 150,000 inodes, or you can write that many files essentially in your home directory. And there's also something called free scratch space.

It is a one terabyte block and 1 million inodes. It's available for all free users to use. But, of course, it's shared.

So there's some limitation and, you know, no one should be writing too much data there. So scratch space on Insomnia is in here in our GPFS mount departments free. And please keep in mind that it is suitable for code scripts, data sets, but there's no backups.

So if you write something into the scratch space or your home directory and you need it and it disappears, no backups. So please copy it out somewhere more secure. Don't keep it on the cluster if it's valuable data.

And I did mention that you shouldn't submit any jobs on the login nodes. And I'm going to step back a little bit. But when you SSH into Insomnia, you would end up either on this login 001 or login 002 node.

These are the login nodes, which are round robin on the five load balancer. So whichever has the fewest users will get the new connection. And you are welcome to run S run commands from it as a batch.

But you should be working on the compute nodes. Generally, because if everybody does work on login nodes, there are just two of them. Of course, they can get overloaded.

And as I mentioned, don't keep any data on the cluster if you need it and you don't want to risk losing it. You can transfer it out either using SCP or rsync or using something called Globus that we offer on transfer nodes. And basically, on Manitou and Insomnia, the transfer nodes are the same as login nodes.

And on Ginsberg, it's motion, rcscolumbia.edu. If you want to learn a little bit more about Globus, you can take a look either at their website or our Columbia Insomnia Globus documentation. And all of our transfer servers have 10 gig connections and just wanted to reiterate the distinction between different types of execute nodes. There are nodes that are owned by an account.

Nodes owned by another account and the public login nodes that are public facing and actually they are routable from the outside, whereas all the compute nodes are not. And you can use S info command if you want to see information about nodes in the cluster. Well, one of the commands.

So, it will tell you which nodes belong to a certain account and which ones are mixed, so on. Another thing to note is that if you have a free account, you only have 12 hours of so-called “wall time”. Wall time is essentially for how long your jobs can run.

And on a free account, wall time is 12 hours. If you have a paid account, it will be five days.

I also wanted to mention some outside resources for free HPC access. Well, it's Advanced Cyber Infrastructure Coordination Ecosystem and Columbia doesn't run it but participates in it. It is a virtual collaboration and a lot of different universities contribute to it.

A little bit more info in the next slide but if you are interested, you can reach out to the RCS team and request more information or an account. So for our own free slash EDU tier, each year we will purchase a new node to add to the free tier and so the free tier used to live on the old cluster Terremoto and the users have been migrated from the retired equipment to the new Insomnia cluster and as we saw right now, it only has one node but INS 0.22 but we do plan to add more. So partition, free.

Now it's here. And if you're interested in any additional workshops, events, and trainings, you can always take a look at our website for Foundations of Research Computing and what events we have next and also Columbia Libraries run some events. You can check out the links here and see if there's anything HPC related.

And as always, documentation is available online in our Confluence, Columbia Confluence. The links are here for Insomnia Ginsburg and you can go to the CIT Columbia EDU slash HPC directly. And I do believe we will have some time for questions.

Okay. I hope you found this helpful and if you have any other questions, you can always reach out to HPC support email. It will generate a ticket for us.

So I'll take a couple of questions. Ajit, most of your use is interactive. Any best practices other than SRUN usage? Well, best practices can apply the same as to batch jobs.

Just be mindful of the flags you're passing to the SRUN command and the resources you are requesting. I would say that the biggest issue that I've seen is users requesting too many resources. So too many CPUs, too much memory that's not actually available on the cluster.

How do you increase the amount of time it takes to auto log out? If you mean how do I request a longer time, you would instead of one minute or whatever we were specifying, you can specify up to 12 hours. So instead of 01 colon 00, you would do 12 colon 00 colon 00. If that answers your question, Chase.

Five or so minutes. Well, so Chase is saying that he gets kicked out from the SSH shell after five minutes of inactivity. That shouldn't be the case.

Yeah. I would check your internet connection because our SSH defaults are much, much higher. It could be something maybe in your personal user settings.

But you should definitely have more than five minutes of inactivity. And I forget off the top of my head which flag you can pass to SSH to increase the time. But yeah, you should be able to do that.

There's also a keep alive in your SSH settings. There's a file and there's a keep alive parameter which you can modify. If you share your email address, we can actually send you the actual updates or the parameters you actually modify that file.

This could also be something to do with your internet provider where I had switched to T-Mobile and it would kick me out after five minutes as well. And there is a keep alive syntax you get to add. So Al just shared it there with you on the Zoom chat window.

But there is a permanent parameter you can add to that file. So if you share your email address, I can email you that syntax or that file to update. Okay.

Thank you, everyone. I think we can finish this workshop. And again, slides should be available online tomorrow.

We will email a link. Thank you, everybody. I hope you found this informative.

Recorded April 2025

In this workshop, we'll introduce you to high-performance computing (HPC), explaining what it is and why it's essential for solving complex computational problems. We'll cover how HPC is used to run simulations, process large datasets, and accelerate research across various fields. Additionally, we'll provide an overview of HPC clusters, highlighting their architecture, how they work, and how they are used to distribute computational tasks across multiple nodes to achieve high performance.

Note: The video is a recording of a class session taught live to users who had all been provisioned with user accounts on the Insomnia HPC cluster. Users outside of that setting will not have an account on Insomnia, so you won't be able to log in and do the steps, though can you watch the lesson recording.

To perform the exercises, ask your Department Administrator if your department has a partition on the Ginsburg, or Insomnia HPC clusters, and if you can be given an account. If they don't have a partition, then you may apply for a free account as long as a faculty member is willing to sponsor your access. Note that free tier accounts are supported with online documentation, but are not eligible for direct support from CUIT's HPC Engineers.