Scheduling Your First HPC Jobs with Bash | 2026 HPC Training Series #4
Transcript
Halyna: Okay. So, let us start. Welcome, everybody, to the Research Computing Workshop Series Part 4. This is a class about scheduling your first HPC jobs with Bash. My name's Halyna, I'm a systems engineer on the HPC team, and my co-host today is . What are the class objectives for today? This is not a Bash coding class. We will focus on Bash specifically for Slurm, Job Scheduler. By the end, you should be able to run, view, analyze Slurm jobs. You should be able to read and modify Slurm as batch scripts confidently, request compute resources, and debug failed Jobs. It is not imperative that you know Bash for this class. You can be an effective slurring user without knowing Bash programming language fluently. And, let's just take a moment to set up our environment and log into the cluster. You don't have to, you can just listen today, and you will have access for another week if you want to review slides on your own pace and go through the exercises. But to SSH into the cluster. Essentially, from your terminal, you just do ssh uni at insomniacrcscolumbia.edu, and I'm gonna do it with you. And, you will get prompted, most likely, if you don't have SSH keys set up for the password and do a login, so I'm just gonna give everybody, a minute, a moment to log in. And once you're in, perhaps you want to copy over our, our files for today. So I see that I ended on login 002 node, I am in my home directory, I just… I have some old files here, and I'm just gonna make a new directory, bash… Workshop, again, a CD there. And… I'm gonna… Recursively. copy files from Insomnia. adapt. Intra Series. dash workshop, copy everything here, and I see a typo in the slide, sorry about that, but it's not "-", it's "-r" for recursive copy. So, I should have some files in my directory now. That we'll take a look at later. And just a very quick recap of what Bash is. It is a shell for Linux. It is also a default shell scripting language on essentially every Linux system. The control flow is built in, like many other programming languages, so it supports loops, conditionals, and variables. Like we said, not a prerequisite to writing as batch files for Slurm. And, a quick recap of what Slurm is. It is essentially a high-performance computing job scheduler that we use here at Columbia. It is open source, fault-tolerant, it is highly scalable, and it is used for cluster management and job scheduling, both. So it manages resources like CPUs, GPUs, memory, schedules, and monitors job queues, and it acts as the cluster's workload manager. And that's more… more than 60% of the top 500 supercomputers. run Slurm at this time. seonjoo lee: Oh, but when I was copying it, it says the permission denied. Okay. Is, anybody else having that issue? Shami Chakrabarti: Yes, me too. Nope. Shami Chakrabarti: Same, yes, same problem. Intra-series. Hyunkyu Lee: Yeah, I also have the same problem. Okay, our… Sam or somebody, could you just correct the permissions real quick while I'm… speaking, or I can… I can just do it. I'm not sure what happened. I think you should be able to do it now. Is it working? Sam Cho: Lina, I ask, use your sword. Sam Cho: Type in their uni's code at that time to, access Royals. Sam Cho: I see Rob Lane says no. You know, I think actually what needs to happen here… Group needs to be free. Sam Cho: Yeah, it's… I'll see. Because, for… we don't specifically have some intro series for, Slurm Scheduler, and for this, it needed to be free, so… We're just quickly gonna change the group. Sam Cho: Thank you for… Should be better. But, so again, not essential that you copy over the files now. I think you should be good, but let's move on, because we have a lot of material to cover, and we can adjust the permissions, later, or while… while I speak. So just to reiterates, learn key functions and features. Resource allocation, job management, high performance fare scheduling. A quick recap of commonly used Slurm commands. Ask run is to run a job interactively. Asbatch, that we will cover today a lot, is to submit a batch script for later execution. AskQ lets us view the status of jobs in the queue. As info, show the status of partition and nodes. As cancel, canceling a pending job or jobs. SAF, examine. completed jobs efficiency. Alright, so let us do a quick demo and just, refresh our memory for the interactive job with Asron. What… what is an interactive job? It does let you run applications on compute nodes as if you were on a regular login node, but it will dedicate the resources. And, what this command does, it allocates pseudo terminal. Basically, it will be, like, a shell. Specifies the amount of time this is gonna run, in my case, 10 minutes. The count, all of us should be part of a free account right now, and I'm just gonna launch the bash shell. So, let me… Copy this. into my terminal. Enter. And… we see that the job is queued, it's waiting for resources. Of course. free account is a little bit limited on insomnia with resources, and all of us are trying to start jobs right now, so it might take a little while until, you know, all of us will get resources allocated. But, okay, finally it went through. Something to remember, that ASRON is good for quick tests and debugging. So, we want to see what is going on with our job. We will use the ask you command, and the job ID, either "-J or dash dash job. For the job ID. And we see that it's running. We see some other parameters, that it's in a short partition, we see the time it's been running, and the node. that I'm on. It's INS051. You can specify different flags if you want to see slightly different information. Some other useful flags, if you just want to see all of your jobs, would be SQ with dash dash me. In my case, I currently only have… one job. You can, of course, also SQ and grab, but we discourage doing that because it does put pressure on the scheduler. Another way to look at the statistics for the job. As account, accounting for current and past jobs, so if this… when this job finishes, we can still examine it with S account command. But for now, we… You can see different stats as well. So once the job finishes, we can examine it with the, SAF command, which is slurm Efficiency. It is also very useful, especially when you're analyzing your jobs later. So I don't… I don't think it will tell me the efficiency for the job that is still running. Oh, it does. So, in my case, I'm not taking a very good advantage of the CPU. My efficiency is close to zero, or memory, but this is because I'm essentially not doing anything, much. I'm just… I just have a bash shell. But, so once you are writing your jobs and you see that your CPU or memory efficiency is low, you can adjust the resources that you request accordingly. And, of course, you can always cancel your jobs with the askCancel command. Another way to look up various statistics and job parameters as control. And, of course, as batch, to submit a batch job instead of an interactive one. And when you are working on writing your jobs, it's always a good idea to test with SRUN, an interactive session, but once you have everything worked out, it's better to move your jobs into batch scripts. And once again, what are the bad jobs? They differ from interactive ones, because they do not require user input while they're running. You write a script, you submit it to the scheduler, and Slurm runs it on a compute node. You don't need to stay logged in, you don't care about terminal disconnects, and it enables automation. So, let's take a quick look at a very simple as batch script, and I believe in our instance, it's the… That… Slurm 001. slambadge01.sh, if you want to take a look. And if you want, you can submit it with an as badge command right now, just to test and play around. Again, if free account, it might take a little while for it to start. But our job was submitted, as you can see. So these are the minimal directives that are important to have on our clusters. You do need to specify an account, and in some of your cases, you do have access to accounts other than free. You do need to mention the job name, and you do need to mention the amount of time. In this case. Just one minute. And this is the command that it executes, it just echoes the host name. So, we did mention that, Slurm batch scripts are bash scripts, and, like, all of them, they will start with the shebang. That's, hashtag, exclamation point, been bash, pointing to the bash executable. They also have as badge lines, which are directives to locate resources or time, and these are not comments. You will also have module loads and setting up your environment, and the actual work execution. So again, as badge lines are not commons. They are scheduler directives that tell SLRM what resources your job needs. If you ever want to comment out an asbatch directive, just use the double hash. So, a slightly longer example of a job. Here we not only specify job name and time, we also request two nodes and, amount of memory that we want, 8 gigs. We also load a Python module. And then we run our Python program. So, what's the workflow for the batch script? Essentially, you submit it with the asbatch command, it gets scheduled by Slurm, and then it gets executed on a compute node. Slightly different view. So… Login node, as badge, scheduler, compute node, and then your program gets executed. So we did mention in this new example, we did have something… With loading a module, and… Perhaps some of you are not familiar what that is. Modules are essentially tools for managing software versions and environment settings on HPC systems. You can examine available modules with module available command, or… MLFL for short. Enter. So, I can see that in our environment, I have a bunch of GCC modules, different versions, CUDA, OneAPI, And, of course. Much more. Why would we use modules instead of using RPMs or packages in our conda environment? It does keep things more consistent, and simplifies managing complex dependencies. It also standardizes the HPC environment for all users, and prevents the overhead. Not everybody has to create their own conda environment, which, prevents excessive using of space. So in order to load a module, it's just ML load and the module name. I'm gonna… load GCC, and I can see my modules. with just the module command, I can see that I have GCC loaded. But, some modules are dependent on other modules, and we will usually have all the dependencies built up… built in already. So, for instance, I think if I'm gonna load one of the OneAPI modules, it will also load in GCC. And some other things that it requires. So, again, just a quick refresher of the module commands. module avail, list, load, unload, and you can always do module help. For a men page. The distinction between resource versus runtime requests in batch scripts. So, let's just get our… Slam Badge01, again. So requests like CPU, GPU, memory, they are resource requests, and runtime is how long the job runs. A word of caution… Over-requesting resources delays the scheduling, and the bigger your ask, the longer it will generally sit in the queue. So always examine, your jobs after they run to see if memory and CPU efficiency are good, and adjust the requests accordingly. Good rule of thumb, start small and submit shorter test jobs first. Then inspect with PS or CEF, And adjust the directives. I just briefly wanted to show you some other resource directives that you can use beyond the basic ones that we covered. So partition, for the partition you want to use, memory, GPUs, number of tasks, number of nodes, CPUs. for tasks. time we covered. We have a little bit more detail of what each directive means, but memory, pretty self-explanatory, right? And the number of nodes, CPUs per task, we will cover in depth a little later. GPUs. Of course, everybody these days wants to use a GPU, but… word of caution, that GPUs are an expensive resource, and while they are available to people in the free partition, please request them with caution. and your program specifically needs to be written to use the GPU, otherwise the GPU node will just be sitting idle and wasting the resources. And, of course, there's this very helpful as badge manual page provided by Slurm, that we recommend everybody reviews in detail and, just examines which flags are mutually exclusive, and learns them a little better. best practices. Always set a time limit, of course. Always capture output. load modules explicitly inside the SBatch script, and test interactively was F stron before asbatch. So when your job… jobs fail, or don't start. And I'm just gonna quickly take a look at the SQ. So, we see a bunch of jobs are running, a bunch of jobs, however, are pending with a priority warning. And dependency warning. This is how we can find out why our job is not running, taking a look at SQ. So, always check the output files to see what the errors are, and like we said, interactive debugging really helps. Common reasons for job failures. Often, your job will run out of memory, and you will get the error in your logs if you set them up. Correctly. So you would just bump up the amount of memory you're requesting. timeout, job exceeds its time limit. Maybe you need more time for your job to run. Command not found, maybe you didn't load your modules correctly, or, the software you're trying to use is not in the path entry. Maybe you're loaded the wrong module. And some of the more common mistakes that we see on our… in our clusters that users make Often, somebody will run a job or a script on a login node, but please don't do that. Always request resources with either SRON or as badge for jobs. Often people will request too many CPUs, but just have a single thread code. Of course, asking for excessive pull time. Forgetting submit directory. Not capturing output logs. Submitting thousands of very small jobs instead of doing a raise. Confusing CPUs with memory. not testing with Asheron sufficiently. And, we do offer a lot of other additional workshops, events, and resources, and of course, we will share these slides with you at the end, so that you can check them out. But, Columbia Libraries Department offers classes. Of course, our sister team, Research Computing, Emerging Technologies. And, source back. If you have any additional questions, you can always email us at hpc-support. And it will open a ticket for us, we… and we will check in with you. And I think I'll… I… give a couple of minutes before Al takes over, if anybody has questions. Or, I don't know if there… because I cannot really see the chat, I don't know if there were common questions in the chat that I could maybe answer.
Sam: I just wanted to make sure that… is any one still having issues on their account? Okay, sounds like we're good, so I'm gonna let Al take over with his part.
Al: Thank you, everybody. Okay, everyone, so give me a moment to start sharing my screen. Boom… Okay, so can everyone see my screen? Wonderful. Okay, so everyone, welcome to scheduling your first HPC Jobs with Bash Part 2: array jobs, slurm variables, and cluster submission queues. So, I'm Al Tucker , and I'll be your instructor for this part. I am a Senior Research Systems Engineer with Columbia's HPC Group. So, first we'll talk about job arrays. So, job arrays… Are a more efficient way to submit large numbers of similar tasks. Instead of submitting each job one at a time, you can submit them all together as a single group. This reduces the overhead on the Slurm scheduler, and helps the system handle large workloads more smoothly. A job array won't make your calculations run faster. But it helps Slurm manage the tasks more efficiently. And can improve overall scheduling responsiveness. So, here's going to be some examples following, and this is the inefficient way to do it. Let's say you wanted to run a job, my job SH, 20 times. The normal go-to way that someone would accomplish this would be with a for loop. So, 4X in 1 to 20, do submit your job, and you would pass it the, variable X, so that your job would be submitted the first time, the second time, third time, and so-and-so, up until 20. And inside of MyJob SH, This is a sample of what your submission script would look like. The variable, the input, the counter, the loop counter index, would be set To be what was… what was passed to the job… the, job submission at the beginning. And that's how you would know whether this job was running the first time, second time, third time. So… What happens here is this results in 20 separate job IDs. And for Slurm, Tracking these individually would create a little bit more overhead. So job arrays are a more efficient way to do this. And this is the syntax for submitting a job array. You would just do s batch. Dash dash array with the number of times you want your job to be run. Now, inside of the… of your SBatch script, that loop counter that we were having before, the X, This would be represented by the variable slurmArrayTaskID. So, inside of your script. this variable is always going to be the index, or the loop, so it's going to change depending on which time this is running. So this will be 1, 2, 3, 4. But what's happening here for Slurm is that you're only submitting one job With 20 different tasks. And that's easier for SLRM. So again, just to repeat, Slurm ArrayTaskID is something that Slurm supplies you, you don't need to supply to it, and it will always be the index of each time through the loop. So, to use an analogy about why a job array is easier, a job array Would be like this, like, keeping track of one group Of 20 children on a field trip to a museum. Versus the loop. Which would be, like, unstructured free time, just setting all 20 children loose to run free wherever they want, and you would have to try and keep track of each one individually. Both you and Slurm would find the first situation easier than the second. So, a little more on Jamborets. As the index for each task of the job, Slurum array task ID will be 1, 2, 3, etc. When you use the SQ command on the command line. you would see something like this, with the job ID, subscript 1, subscript 2, subscript 3, and so on. That would be what would appear visually to you. But the variable still only contains 1, 2, 3, or more. the parent job, and here, we're just saying that the job ID of the parent job is this, 8541355. The parent job ID that started the whole array would be contained in another variable, and that would be slurmArrayJob ID. There are some other variables, too, that may be useful, and again, these are things that Slurm gives you when your array job is running. So, slurmArrayTaskCount is the total number of tasks in the array. slim array task max is also the maximum index number, which for most jobs, these would be equivalent. It depends on the type of array that you submitted, whether they would always be equivalent. So, for example, if you submitted an array with tasks of, odd numbers, 1, 3, 5, 7, then the total number of tasks would only be 4, but the maximum index number would be 7. So that's a case where they would differ. And then there's slurmArrayTaskMin, which is the minimum index number of the job array. So there are other variables that Slim gives you as well, aside from array-related ones. These can be useful whether you're using a job array or not. So, slurm Array Job ID is the ID of your job, and that is always present whenever you run a job. Slurm node list. Is the list of notes allocated to your job at the start. Now, this doesn't necessarily mean they were all used. So, if you request more nodes than you need. This would have the total number of nodes requested. And your job might be a little bit slower to start if you ask for more than you need, because Slurm is going to be looking to clear that many nodes. Other slurm variables? Slurm SubmitDeer. And that is the directory from which the job was submitted. That means where your S-batch was run from. So this may not necessarily be where all your files are, so I'll just give a quick example of how this could be useful. Say if you had all your files in a directory. You would go there first by CDing into the directory that you wanted. And then you would submit your script with SBatch from that directory. Then, inside of your MyScript SH, you could have this statement to CD to the SLRM submit deer. That way, when the script is running. it would know To center itself from where you submitted your SBatch file, where all your files are, and it would make sure that as each process within your script runs, everything would… Would, see paths relative to that top path. So now we're going to move away from job arrays. In the next session of this, we're going to be talking about four queue submission types in the HPC clusters. And we're going to be going into some options for using each of them. And the four submission types we'll be talking about are your group's normal partition, The short queue. The HPC test queue, And the burst queue. So, first, we're going to talk about your normal account. And that's the standard way you submit a job on the cluster. Using an S-Batch script, under your normal account, here, in this example, my account is RCS. Yours, of course, will differ. One hallmark of your normal account is that when you run a job, it's going to run on the node or nodes that your group owns. And potentially, you can have a time limit of up to 7 days. Now… Always specify a wall time with your job when you're making SBATC submissions. If you don't submit a wall time, your job will quit with an error. Another thing, if you don't know how much runtime your job needs, start small and work your way up. Don't just ask for 7 days because you can ask for 7 days. So in this example here, I've specified 42 minutes of wall time for my job. So, if your job needs more time than 42 minutes. Slurm will kill it, because that was what you requested, and then you would just start it again, and increase your wall time number. Also, Whenever you submit a job, you should at least ask for one node. And always request memory to be allocated for your entire job. So again, these numbers are going to depend on what you're doing, and they'll be different for everyone. Again, if you're not sure, it's always better to start small and scale up. Because the more you ask for, the longer your job could potentially wait before it runs, while Slurm frees up enough room relative to all the other jobs that are running on the cluster. But if you don't supply a node number and memory Your job could quit with unexpected errors, so it's always important to have these two. Now we'll talk about the short partition. A short partition is something that makes it possible to run a job on any node in the cluster, not just nodes your group owns. But… You can only run it for a short time, and that's 12 hours. A reason why you would use the short partition? If your group didn't own a GPU, no. and you needed a GPU, This would be a way to get access to someone else's GPU node temporarily. Now, again, 12 hours is the maximum wall time, but start small and scale up. Just because you can have 12 hours doesn't mean you need it. In this example, I've only asked for 7 hours and 10 minutes. Also, as a short aside, this example, I'm using a GPU. If you're using a GPU with your job. Your code should always load the CUDA module. These are GPU libraries that are necessary for working with a GPU. Next… We have two other queues that are relatively new that we've introduced in the HPC, and the first we'll talk about is the HPC test queue. Now, the HPC test queue is only available to groups that own nodes. So, if you're a free account user, or an EDU account user, you would not have access to the HPC test queue. But this is how you would call the HPC test queue in your bash script. So the purpose of this queue was designed for when those times, between your jobs running and other jobs running on your node from the rest of the people in your group, and also jobs in the short partition, which could be running on your node. you might experience a delay in getting your job assigned to your node when you do the initial SBATCH submission. What the HPC Test Queue does, it gives your job higher priority, but only on your node, or nodes. Over other jobs, and you can more quickly gain access to your node. Now, the trade-off here is that with the HPC test queue, the maximum wall time you can ask for is only 6 hours. This is intended to strike a balance of flexibility between high priority, but also equal access for everyone. Again, with the HPC TestCube, always good to start small and work your way up. Just because I could ask for 6 hours doesn't mean I need to go for that. Here, I've only asked for 2 hours and 13 minutes. So we're going to talk about a small caveat about different queue options. And so please take a look… take a moment to look at this example here. I'm going to point out But sometimes, not all possible combinations of options go well together. And it's up to you as the user to think through the ramifications of what you're requesting. And the next slide is going to make this clear. So using the HPC test queue as an example, this GPU request might not work. And the reason is that the HPC test queue gives you a higher priority on a node that your group owns. So if your group doesn't own a GPU node. you would not be able to request a GPU And your job would have an error. By the same token, if your group does own a GPU node or nodes, then automatically, you will be put onto a GPU node. Now, most nodes… have only two GPUs, so it does still make sense to request one or two, depending on how many GPUs you actually need for your job. And again, when using a GPU, load the CUDA module. Before the rest of your code. So, one more aside here when talking about GPUs. Just because you're requesting a GPU node doesn't mean your job is going to use a GPU. To use a GPU, your program needs to be GPU-aware. So everyone's research varies, so it's up to you as the user to understand your research and make the appropriate choices as you construct your jobs. So, as a general rule of thumb. If your program ran perfectly on a standard node, meaning a node without a GPU, Unless you consciously made changes in your code. just because you were running it on a GPU node. would not make it use a GPU, or make it run any faster. It would just be wasting space on a GPU node. So now we're going to move on to the other cue, the burst queue. And this is something similar to the short queue, in that the burst queue will let you run a job on any node in the cluster. And you can even run that job for as long as 14 days. So, in this example, I'm using the burst queue. And this is, a directive you would also use when using the burst queue, re-queue. And I'll go into that in a little bit later. And just because I can ask for 14 days, you don't have to, I'm asking for only 10 days and 2 hours. So the trade-off with the burst queue, though, since you could be occupying a node for as much as 2 weeks, this queue has low priority. And it can be preempted by jobs from the node's owner. That means your job could be forced to quit And then be put back on the queue to restart later. That's why we have the ReQue option as part of your code when requesting burst. So, in order for this to be something efficient, you are responsible for making sure your job will write temporary files. And that's a technique called checkpointing. So that way, if your job is preempted and forced to stop. Meaning it's re-queued to start again later. The checkpoint file is something that it can use when it restarts to look at its own progress and start up again approximately from where it left off, instead of rerunning from the very beginning. So also, too, in your code, reading these checkpoint files so that the job knows where it can start. It's something that you must figure out when you are writing your code, and account for it. It's not something that CRM just magically gives you. So, fully explaining how Checkpoint is being… is used is beyond the research of this training session, and it requires the user to understand when this may be useful. Because, for example, even if you want to use the burst queue and checkpointing, not every job can be just interrupted and started later. What we do want to point out here, though, is that when you're writing temporary files, and this is whether you're writing temporary files in your scripts for checkpointing, or for anything, just for whatever reason your job needs to write a file and then read it again. The place to write files is on the slash local path. Or, you can always use your own home directory or group scratch space as well. But what we'd like to caution you not to do is, please, when writing temporary files, do not write in slash TMP. or slash var slash TMP. Because on the nodes, these paths are on a smaller file system. And there's a danger that you could run the node out of space. If you run the node out of space, this could cause problems for your job, and for the node in general. So now we'll talk about some more job options. When you ask for a node and some memory for your code, the job isn't alone on the node. A standard node on the Insomnia cluster has 512GB of RAM and 160 cores, with something called hyper-threading enabled. So your job… will share the node's resources, and Slurm will handle this. So that it keeps them separated, and knows which resources your job is using, and which resources other jobs are using. So now, it is possible to get exclusive access to an entire node by using the dash dash exclusive option. However… Unless you're absolutely sure you need to exclusively use node. This is generally not a good option to use. Why? It's because your job would have to wait until Slurm can clear an entire node of all other jobs before your job can begin. So in a shared environment like the HPCs, if you use dash dash exclusive, it means your job could be waiting for days instead of just minutes or hours before it could even start. So another point here about which options go well together and which ones don't. Even when you're using dash dash exclusive, it does make sense To request a memory allocation. Dash dash exclusive means that you CAN use all the resources on the node, but it doesn't mean that you're automatically given all the resources on the node. So here, even though you're asking for a node exclusive for your use. Your job, without the memory request, would only be given a trivial amount of memory, and that may not be enough memory to use. So this is similar to what we talked about before, when… how running on a GPU node, it doesn't mean your code would actually use a GPU. You need to be aware of your options and account for them. So, when you ask for a node and some memory, Slurm, invisibly in the background, fills in a 1 to most of the other questions that would be applicable to your job. Such as, how many CPUs should your job use? Now, for many jobs, this is fine. However, you can be more precise in allocating resources for your job. Now, again, the following options we'll talk about may or may not make sense for you, depending on specifically what you are doing. So with these new options here, end tasks, CPUs per task, mem per CPU, You can specify how many tasks your job should have in a node. As well as how many CPUs each task should be given. And how much memory each CPU in each task is given. At the next slide, I'll try to explain more about just what this means. So… when you're asking for tasks in CPUs. If you envision the node itself as a department store. When you're asking for a number of tasks, That would be… Like, the division of different departments within the department store. when you ask for CPUs per task, each CPU Would be, like, an employee working in each department of the department store. and memory per CPU, Would be an equivalent of how much money each employee is given to work in each department in the department store. So you're more granularly specifying resources. If you don't specify tasks and processors like that, Slurm invisibly would just fill in one for everything. And again, for many jobs, this might be fine. Because we are talking about processes here, and not literal people. Another thing to point out is that memory is cumulative. And some options may not mix well. And so both of these examples here are asking for 24GB of RAM. But they're asking for it very differently. In the top example, you're asking for 3GB of RAM, for each CPU in each of the four tasks. So 3 times 2 times 4 equals a total of 24GB of RAM for this entire job. Versus just a straight-out memory request for the job here. Now, again, about options that work well together and don't, what you don't want to do is mistakenly mix conflicting options, like here. So since the mem per CPU request is cumulative. You're already requesting 24 gig for the entire job. So if in the same S-batch file, you also use the dash dash mem request to request 24GB for the entire job, what you're doing is you set up a conflict for yourself. And this could cause your job, in the best case, to fail with an error. So again, it's another example of when you use job options, you need to put a little bit of thought into what you're using, and how things may or may not go well together. Now, note that either of these two examples would be the quote-unquote correct way to resolve the conflicting memory problem that I talked about in the previous slide. So, in the first example. you're still requesting tasks and CPUs, but by asking with dash dash for 24GB, You've given 24GB for the entire job, and the tasks and the CPUs are free to fluidly use more or less as needed as the job runs, out of an overall pool of 24GB. In the second example, by using mem for CPU, You are specifically given each individual CPU in each of the tasks 3GB of RAM to work with. So… If one of the CPUs in one of your tasks only used 1GB of RAM, You'd be wasting some memory. But, even more, If an individual CPU needed 4GB, Or more of memory, Your job could fail entirely. So neither of these solutions is inherently better Than the other one. And if this sounds confusing, yes, it can be confusing. In the world of Linux, there are often multiple ways to do the same thing. And each way is valid. Deciding how much memory you need. Do you need to specify tasks? How many tasks? How to allocate RAM. All of these questions do not have a right answer. There is only what's right for you, depending on what programs you're using to do your research. So, yes, sometimes this may be just as much art as it is science. When starting out. A good way to get guidance on questions like these are from other colleagues in your lab, or researchers in your field, doing the same kind of work that you're doing. Once you have a clear idea of what you want to do, and it still fails. That is where the HPC support can be of best help to you. So… Wrapping up… It's impossible to cover the full magnitude of slurm options and variables in any single session. And we hope that you found this broad overview informative as you begin using our clusters. As mentioned before, a recording of this session will be posted later, among other training videos that we have. in our video library at this URL right here. Also, more specific help documentation about our clusters. Insomnia, Ginsburg, and Teramoto. can be found linked off of our main HPC page at this URL here, where it will cover the things I've talked about, as well as other things. And, of course, as always, Google and Gemini are never more than a click away. If you have questions for us after this presentation, you can reach the HPC group at hpc-support at columbia.edu. And at this point, this is the end of my section of the presentation, and if anyone has any questions to ask me. Feel free to ask away, and hopefully I'll have an answer for you.
Sam: I just wanted to make sure that… is any one still having issues on their account? Okay, sounds like we're good, so I'm gonna let Al take over with his part.
Al: Thank you, everybody. Okay, everyone, so give me a moment to start sharing my screen. Boom… Okay, so can everyone see my screen? Wonderful. Okay, so everyone, welcome to scheduling your first HPC Jobs with Bash Part 2: array jobs, slurm variables, and cluster submission queues. So, I'm Al Tucker , and I'll be your instructor for this part. I am a Senior Research Systems Engineer with Columbia's HPC Group. So, first we'll talk about job arrays. So, job arrays… Are a more efficient way to submit large numbers of similar tasks. Instead of submitting each job one at a time, you can submit them all together as a single group. This reduces the overhead on the Slurm scheduler, and helps the system handle large workloads more smoothly. A job array won't make your calculations run faster. But it helps Slurm manage the tasks more efficiently. And can improve overall scheduling responsiveness. So, here's going to be some examples following, and this is the inefficient way to do it. Let's say you wanted to run a job, my job SH, 20 times. The normal go-to way that someone would accomplish this would be with a for loop. So, 4X in 1 to 20, do submit your job, and you would pass it the, variable X, so that your job would be submitted the first time, the second time, third time, and so-and-so, up until 20. And inside of MyJob SH, This is a sample of what your submission script would look like. The variable, the input, the counter, the loop counter index, would be set To be what was… what was passed to the job… the, job submission at the beginning. And that's how you would know whether this job was running the first time, second time, third time. So… What happens here is this results in 20 separate job IDs. And for Slurm, Tracking these individually would create a little bit more overhead. So job arrays are a more efficient way to do this. And this is the syntax for submitting a job array. You would just do s batch. Dash dash array with the number of times you want your job to be run. Now, inside of the… of your SBatch script, that loop counter that we were having before, the X, This would be represented by the variable slurmArrayTaskID. So, inside of your script. this variable is always going to be the index, or the loop, so it's going to change depending on which time this is running. So this will be 1, 2, 3, 4. But what's happening here for Slurm is that you're only submitting one job With 20 different tasks. And that's easier for SLRM. So again, just to repeat, Slurm ArrayTaskID is something that Slurm supplies you, you don't need to supply to it, and it will always be the index of each time through the loop. So, to use an analogy about why a job array is easier, a job array Would be like this, like, keeping track of one group Of 20 children on a field trip to a museum. Versus the loop. Which would be, like, unstructured free time, just setting all 20 children loose to run free wherever they want, and you would have to try and keep track of each one individually. Both you and Slurm would find the first situation easier than the second. So, a little more on Jamborets. As the index for each task of the job, Slurum array task ID will be 1, 2, 3, etc. When you use the SQ command on the command line. you would see something like this, with the job ID, subscript 1, subscript 2, subscript 3, and so on. That would be what would appear visually to you. But the variable still only contains 1, 2, 3, or more. the parent job, and here, we're just saying that the job ID of the parent job is this, 8541355. The parent job ID that started the whole array would be contained in another variable, and that would be slurmArrayJob ID. There are some other variables, too, that may be useful, and again, these are things that Slurm gives you when your array job is running. So, slurmArrayTaskCount is the total number of tasks in the array. slim array task max is also the maximum index number, which for most jobs, these would be equivalent. It depends on the type of array that you submitted, whether they would always be equivalent. So, for example, if you submitted an array with tasks of, odd numbers, 1, 3, 5, 7, then the total number of tasks would only be 4, but the maximum index number would be 7. So that's a case where they would differ. And then there's slurmArrayTaskMin, which is the minimum index number of the job array. So there are other variables that Slim gives you as well, aside from array-related ones. These can be useful whether you're using a job array or not. So, slurm Array Job ID is the ID of your job, and that is always present whenever you run a job. Slurm node list. Is the list of notes allocated to your job at the start. Now, this doesn't necessarily mean they were all used. So, if you request more nodes than you need. This would have the total number of nodes requested. And your job might be a little bit slower to start if you ask for more than you need, because Slurm is going to be looking to clear that many nodes. Other slurm variables? Slurm SubmitDeer. And that is the directory from which the job was submitted. That means where your S-batch was run from. So this may not necessarily be where all your files are, so I'll just give a quick example of how this could be useful. Say if you had all your files in a directory. You would go there first by CDing into the directory that you wanted. And then you would submit your script with SBatch from that directory. Then, inside of your MyScript SH, you could have this statement to CD to the SLRM submit deer. That way, when the script is running. it would know To center itself from where you submitted your SBatch file, where all your files are, and it would make sure that as each process within your script runs, everything would… Would, see paths relative to that top path. So now we're going to move away from job arrays. In the next session of this, we're going to be talking about four queue submission types in the HPC clusters. And we're going to be going into some options for using each of them. And the four submission types we'll be talking about are your group's normal partition, The short queue. The HPC test queue, And the burst queue. So, first, we're going to talk about your normal account. And that's the standard way you submit a job on the cluster. Using an S-Batch script, under your normal account, here, in this example, my account is RCS. Yours, of course, will differ. One hallmark of your normal account is that when you run a job, it's going to run on the node or nodes that your group owns. And potentially, you can have a time limit of up to 7 days. Now… Always specify a wall time with your job when you're making SBATC submissions. If you don't submit a wall time, your job will quit with an error. Another thing, if you don't know how much runtime your job needs, start small and work your way up. Don't just ask for 7 days because you can ask for 7 days. So in this example here, I've specified 42 minutes of wall time for my job. So, if your job needs more time than 42 minutes. Slurm will kill it, because that was what you requested, and then you would just start it again, and increase your wall time number. Also, Whenever you submit a job, you should at least ask for one node. And always request memory to be allocated for your entire job. So again, these numbers are going to depend on what you're doing, and they'll be different for everyone. Again, if you're not sure, it's always better to start small and scale up. Because the more you ask for, the longer your job could potentially wait before it runs, while Slurm frees up enough room relative to all the other jobs that are running on the cluster. But if you don't supply a node number and memory Your job could quit with unexpected errors, so it's always important to have these two. Now we'll talk about the short partition. A short partition is something that makes it possible to run a job on any node in the cluster, not just nodes your group owns. But… You can only run it for a short time, and that's 12 hours. A reason why you would use the short partition? If your group didn't own a GPU, no. and you needed a GPU, This would be a way to get access to someone else's GPU node temporarily. Now, again, 12 hours is the maximum wall time, but start small and scale up. Just because you can have 12 hours doesn't mean you need it. In this example, I've only asked for 7 hours and 10 minutes. Also, as a short aside, this example, I'm using a GPU. If you're using a GPU with your job. Your code should always load the CUDA module. These are GPU libraries that are necessary for working with a GPU. Next… We have two other queues that are relatively new that we've introduced in the HPC, and the first we'll talk about is the HPC test queue. Now, the HPC test queue is only available to groups that own nodes. So, if you're a free account user, or an EDU account user, you would not have access to the HPC test queue. But this is how you would call the HPC test queue in your bash script. So the purpose of this queue was designed for when those times, between your jobs running and other jobs running on your node from the rest of the people in your group, and also jobs in the short partition, which could be running on your node. you might experience a delay in getting your job assigned to your node when you do the initial SBATCH submission. What the HPC Test Queue does, it gives your job higher priority, but only on your node, or nodes. Over other jobs, and you can more quickly gain access to your node. Now, the trade-off here is that with the HPC test queue, the maximum wall time you can ask for is only 6 hours. This is intended to strike a balance of flexibility between high priority, but also equal access for everyone. Again, with the HPC TestCube, always good to start small and work your way up. Just because I could ask for 6 hours doesn't mean I need to go for that. Here, I've only asked for 2 hours and 13 minutes. So we're going to talk about a small caveat about different queue options. And so please take a look… take a moment to look at this example here. I'm going to point out But sometimes, not all possible combinations of options go well together. And it's up to you as the user to think through the ramifications of what you're requesting. And the next slide is going to make this clear. So using the HPC test queue as an example, this GPU request might not work. And the reason is that the HPC test queue gives you a higher priority on a node that your group owns. So if your group doesn't own a GPU node. you would not be able to request a GPU And your job would have an error. By the same token, if your group does own a GPU node or nodes, then automatically, you will be put onto a GPU node. Now, most nodes… have only two GPUs, so it does still make sense to request one or two, depending on how many GPUs you actually need for your job. And again, when using a GPU, load the CUDA module. Before the rest of your code. So, one more aside here when talking about GPUs. Just because you're requesting a GPU node doesn't mean your job is going to use a GPU. To use a GPU, your program needs to be GPU-aware. So everyone's research varies, so it's up to you as the user to understand your research and make the appropriate choices as you construct your jobs. So, as a general rule of thumb. If your program ran perfectly on a standard node, meaning a node without a GPU, Unless you consciously made changes in your code. just because you were running it on a GPU node. would not make it use a GPU, or make it run any faster. It would just be wasting space on a GPU node. So now we're going to move on to the other cue, the burst queue. And this is something similar to the short queue, in that the burst queue will let you run a job on any node in the cluster. And you can even run that job for as long as 14 days. So, in this example, I'm using the burst queue. And this is, a directive you would also use when using the burst queue, re-queue. And I'll go into that in a little bit later. And just because I can ask for 14 days, you don't have to, I'm asking for only 10 days and 2 hours. So the trade-off with the burst queue, though, since you could be occupying a node for as much as 2 weeks, this queue has low priority. And it can be preempted by jobs from the node's owner. That means your job could be forced to quit And then be put back on the queue to restart later. That's why we have the ReQue option as part of your code when requesting burst. So, in order for this to be something efficient, you are responsible for making sure your job will write temporary files. And that's a technique called checkpointing. So that way, if your job is preempted and forced to stop. Meaning it's re-queued to start again later. The checkpoint file is something that it can use when it restarts to look at its own progress and start up again approximately from where it left off, instead of rerunning from the very beginning. So also, too, in your code, reading these checkpoint files so that the job knows where it can start. It's something that you must figure out when you are writing your code, and account for it. It's not something that CRM just magically gives you. So, fully explaining how Checkpoint is being… is used is beyond the research of this training session, and it requires the user to understand when this may be useful. Because, for example, even if you want to use the burst queue and checkpointing, not every job can be just interrupted and started later. What we do want to point out here, though, is that when you're writing temporary files, and this is whether you're writing temporary files in your scripts for checkpointing, or for anything, just for whatever reason your job needs to write a file and then read it again. The place to write files is on the slash local path. Or, you can always use your own home directory or group scratch space as well. But what we'd like to caution you not to do is, please, when writing temporary files, do not write in slash TMP. or slash var slash TMP. Because on the nodes, these paths are on a smaller file system. And there's a danger that you could run the node out of space. If you run the node out of space, this could cause problems for your job, and for the node in general. So now we'll talk about some more job options. When you ask for a node and some memory for your code, the job isn't alone on the node. A standard node on the Insomnia cluster has 512GB of RAM and 160 cores, with something called hyper-threading enabled. So your job… will share the node's resources, and Slurm will handle this. So that it keeps them separated, and knows which resources your job is using, and which resources other jobs are using. So now, it is possible to get exclusive access to an entire node by using the dash dash exclusive option. However… Unless you're absolutely sure you need to exclusively use node. This is generally not a good option to use. Why? It's because your job would have to wait until Slurm can clear an entire node of all other jobs before your job can begin. So in a shared environment like the HPCs, if you use dash dash exclusive, it means your job could be waiting for days instead of just minutes or hours before it could even start. So another point here about which options go well together and which ones don't. Even when you're using dash dash exclusive, it does make sense To request a memory allocation. Dash dash exclusive means that you CAN use all the resources on the node, but it doesn't mean that you're automatically given all the resources on the node. So here, even though you're asking for a node exclusive for your use. Your job, without the memory request, would only be given a trivial amount of memory, and that may not be enough memory to use. So this is similar to what we talked about before, when… how running on a GPU node, it doesn't mean your code would actually use a GPU. You need to be aware of your options and account for them. So, when you ask for a node and some memory, Slurm, invisibly in the background, fills in a 1 to most of the other questions that would be applicable to your job. Such as, how many CPUs should your job use? Now, for many jobs, this is fine. However, you can be more precise in allocating resources for your job. Now, again, the following options we'll talk about may or may not make sense for you, depending on specifically what you are doing. So with these new options here, end tasks, CPUs per task, mem per CPU, You can specify how many tasks your job should have in a node. As well as how many CPUs each task should be given. And how much memory each CPU in each task is given. At the next slide, I'll try to explain more about just what this means. So… when you're asking for tasks in CPUs. If you envision the node itself as a department store. When you're asking for a number of tasks, That would be… Like, the division of different departments within the department store. when you ask for CPUs per task, each CPU Would be, like, an employee working in each department of the department store. and memory per CPU, Would be an equivalent of how much money each employee is given to work in each department in the department store. So you're more granularly specifying resources. If you don't specify tasks and processors like that, Slurm invisibly would just fill in one for everything. And again, for many jobs, this might be fine. Because we are talking about processes here, and not literal people. Another thing to point out is that memory is cumulative. And some options may not mix well. And so both of these examples here are asking for 24GB of RAM. But they're asking for it very differently. In the top example, you're asking for 3GB of RAM, for each CPU in each of the four tasks. So 3 times 2 times 4 equals a total of 24GB of RAM for this entire job. Versus just a straight-out memory request for the job here. Now, again, about options that work well together and don't, what you don't want to do is mistakenly mix conflicting options, like here. So since the mem per CPU request is cumulative. You're already requesting 24 gig for the entire job. So if in the same S-batch file, you also use the dash dash mem request to request 24GB for the entire job, what you're doing is you set up a conflict for yourself. And this could cause your job, in the best case, to fail with an error. So again, it's another example of when you use job options, you need to put a little bit of thought into what you're using, and how things may or may not go well together. Now, note that either of these two examples would be the quote-unquote correct way to resolve the conflicting memory problem that I talked about in the previous slide. So, in the first example. you're still requesting tasks and CPUs, but by asking with dash dash for 24GB, You've given 24GB for the entire job, and the tasks and the CPUs are free to fluidly use more or less as needed as the job runs, out of an overall pool of 24GB. In the second example, by using mem for CPU, You are specifically given each individual CPU in each of the tasks 3GB of RAM to work with. So… If one of the CPUs in one of your tasks only used 1GB of RAM, You'd be wasting some memory. But, even more, If an individual CPU needed 4GB, Or more of memory, Your job could fail entirely. So neither of these solutions is inherently better Than the other one. And if this sounds confusing, yes, it can be confusing. In the world of Linux, there are often multiple ways to do the same thing. And each way is valid. Deciding how much memory you need. Do you need to specify tasks? How many tasks? How to allocate RAM. All of these questions do not have a right answer. There is only what's right for you, depending on what programs you're using to do your research. So, yes, sometimes this may be just as much art as it is science. When starting out. A good way to get guidance on questions like these are from other colleagues in your lab, or researchers in your field, doing the same kind of work that you're doing. Once you have a clear idea of what you want to do, and it still fails. That is where the HPC support can be of best help to you. So… Wrapping up… It's impossible to cover the full magnitude of slurm options and variables in any single session. And we hope that you found this broad overview informative as you begin using our clusters. As mentioned before, a recording of this session will be posted later, among other training videos that we have. in our video library at this URL right here. Also, more specific help documentation about our clusters. Insomnia, Ginsburg, and Teramoto. can be found linked off of our main HPC page at this URL here, where it will cover the things I've talked about, as well as other things. And, of course, as always, Google and Gemini are never more than a click away. If you have questions for us after this presentation, you can reach the HPC group at hpc-support at columbia.edu. And at this point, this is the end of my section of the presentation, and if anyone has any questions to ask me. Feel free to ask away, and hopefully I'll have an answer for you.
