Harnessing Accelerated HPC Workloads with Singularity | 2026 HPC Training Series #6

Transcript

Good afternoon, everyone, and welcome. My name is Syed Bukhari, and I am a member of the CUIT HPC team here at Columbia University. So, today's session is about Singularity, also known as AppTainer. And how it helps us run software reliably on HPC systems. This session is split into two parts. I will cover the concepts, and then we will move into hands-on demo, which my co-worker Waqas will show you around. I will keep this simple, so no prior container experience is needed. So, over the next few minutes, I want you to understand what problem containers solve, why a similarity is used in HPC, and how it fits into real workflows like Slum and GPU nodes.

So, let me quickly walk you through what we will cover. We will briefly cover what containers are, why we use them, and where they come from, and how similarity fits into HPC, then we'll move into a hands-on demo.

So, what are containers? So, a container is a packaged software environment It includes your application and everything it needs to run. It is lightweight. It shares the system kernel, but isolates the software environment. There are 3 key ideas, which are, like, it is self-contained. Everything is bundled together. It is portable, which is… it runs the same across systems, and it is a single image, usually a .sip file in Singularity. So… Now, let's look at why this matters in HPC. So why containers? The biggest… the biggest reason we use container is consistency. HPC systems are shared, and it's not practical to install every dependency globally. So, containers solve this by letting users bring their own environment.

So, 3 quick benefits. Dependency management, so it avoids version conflicts. Reproducibility, so it gives same results every time, and security, so no root access required on shared system. So containers are not just convenient, they solve real problems in HPC. So now, where do these containers come from?

So, containers come from registries. Basically, libraries of pre-built images Docker Hub is the most common for HPC, NVIDIA and GC is important for GPU workloads. In practice, you usually start with an existing image and customize it So, another quick key question. Why Singularity instead of Docker? which is actually the main, you can say, very important slide. It's why we use singularity on HPC systems other than Docker's. So actually, Docker is great for development and cloud environments, but it relies on a daemon and elevated privileges. That is not ideal for shared HPC systems, but other ways, the singularity is designed for HPC No root required, integrates with Slurm, GPUs, and shared file systems. And when you run into it. It… you stay yourself, so you don't need root access. So, key line is, Docker is great for building containers, while the Singularity is built to run them safely on HPC systems. So, now let's talk about how similarity actually works on our HPC system. There are two important things to understand here, where your containers run, and how you run them. So, execution environment. Plus, containers run on a compute node, not login nodes, so login nodes are only for… login nodes are not for actually running heavy workloads. Containers can be resource intensive, so they need proper CPU, memory, and sometimes GPU allocation. That's why we run them through the scheduler. It gives us an isolated environment and fair resource usage. And the second is job submissions, so how do we actually run containers? We use Slum just like any other HPC job. For testing or development, you can start an interactive job using commands like Srin or S allocation. So this lets you run and test your container in real time without waiting in the batch queue.

So, the key takeaway is simple containers don't replace the HPC workflow, they fit into it. you still use Learn, you still use compute nodes. The containers just give you a consistent environment to run your code. So, that's what… that was the short conceptual side of Singularity and how it fits into HPC workflows. I will now hand over to Waqas, who will walk you through the hands-on portion and show how this works in practice. Over to Waqas?

Waqas: Alright, so… Thank you so much for the great introduction to Singularity, Syed. I'm gonna share my screen. Alright, can anyone confirm if they can see my screen? All right, so, hi everyone, my name is Waqas, and I am part of the HPC Adventure here. Now, I'll take it from here and focus more on the practical side of the things, specifically how You can use obtainers, r singularity, those are the same names for… two different names for the same thing. The old name was Singularity, the new name is Eric Tanner. So I'll walk you through how to access it on Glitter, and some common ones and best practices to keep in mind while working with containers. By the end of this, you should feel comfortable getting started with Obtainer on our systems. Right, so, on the login node, if you try to check for obtainer using which obtainer it will basically, nothing shows up. You… if you see it here. And, why? Because we don't allow running containers on logging nodes, and because those nodes are meant for live tasks, like editing, compiling, and… Job submission, Not heavy workforce. So, now moving on, how we actually use AppTainer, and to run the containers, we first need to get onto a compute node. And, the way we do that is by requesting an impact job, using SRM. For example, we can run, SRM And I'm gonna move on to the next slide as well, so that we can follow along. and hyphen P, and for this… There's a short partition… Main account is free. management. Budscrash. Alright, so this command basically allocates a compute node for us, and once we are inside that environment, we'll have access to the singularity. And from there, we can start running our container workloads. Okay, and how can we… access the, obtainer or Singularity is by loading the machine. So, I'm gonna load the material. Nope. Hey, Maddie. And if I want to list… which module is loaded? It is AppTainer 1.4 confirmed. So if I run with Singularity again, which obtainer, to be exact. Now it is showing me that module has been loaded. And if we want to check, singularity… 199… It's showing that obtainer version is 1.4.5. So that's how you load the modules, and on the compute nodes. Alright, so another thing I want to mention is that, We have to… We have to move from our home directory, which is… It is. slash Insomnia Home. And my username. What we need to do is that we have to move from home directory to the Squarespace. So by default, when you log into the cluster, you will land in your home directory, which is what you see here. And, your home directory has a limits… limited space. And it is not designed for heavy workloads or large files. So if you start running container jobs, or storing the large data sets here, it's gonna Fill up the… fill up pretty quickly and, can cause issues. So, the recommended approach is to do your work in the scratch, or group space instead. For example, I'm… Navigating to CD, into insomnia, and then the APTS. Free… And then users… And… my username. For you. you have to enter your username to access the scratch space. Hit enter. And if I want to check out… I'm in my spread space. Okay… I don't have anything here. All right, so this is critical part, by the way. You have to work inside your scratch base, because the images we will pull later is gonna fill up your home directory very quickly. Alright, so… now… This is how you pull the container image. Actually, not here. Oh… Back to mine. Spikes… Actually, I'm missing one slide, but anyways, I'm gonna give you the command to hold the… image. Which is obtainer. And then O is the man for the image. And then Docker, which is the registry we are trying to pull the image from. And the name of the name of the image, which is Python. and column 3.11. So there are a lot of images on the registry. So, in my case, I'm just pulling… trying to pull the Python 3.11. So, we are pulling this from Docker Hub, which is one of the largest public container registries, as mentioned by Syed. The reason we use Docker images is because there's a huge Ecosystem, and most applications already have pre-built images available, so… We don't… I have to build everything from scratch. And I'm gonna hit enter. Alright, let me just do cache. Okay, so we need to clear our cache. The error is… Saying that… Shady, kind of… Amazing. Yeah, clean. Alright, so basically, I was, Well, while I was preparing for this workshop, I had pulled some of the images, and then there was some images which were in the cache memory, so I just cleaned up my cache memory, and then this is the command to basically cache… clear the cache memory. Okay. And then we're gonna try to create… Oops. slash violent territory. Dip in the cache. So this is basically… giving me some error to pull the Python image. Oh my god. Okay. Alright, so it's gonna take some time to pull the image. Basically. it's downloading the image from the Docker Hub and converts into .SIF file. And once it's done, you will see a file like python underscore 3.11.sif with the SIF extension. That's your container, basically. And, just a reminder, make sure you run this in your scratch space, not your home directory. And, you only need to pull it once. After that, you can reuse it anytime. You might notice that this step can take a bit of time. That's because what's happening in the background is the obtainer is downloading the entire image layer by layer. And then converting it to .sif file. So, depending on the image size and network speed, it can take a few minutes, but the important thing is that this is a one-time process. Once the image is pulled, you can reuse it as many times as you want without downloading it again. So… Just bear with me. Alright, so… No. Once you have your image, there are 3 main ways to run a container. First is Obtainer shell. Which is here. And then this drops you into an interactive shell inside the container. This is best for exploring, testing, or debugging. Second is obtainerExec, which is short rectic queue. This lets you basically run a single command inside the container, and then that's it. And then the third is fun. This basically runs the container, container's default entry point, if one is defined. But for today's demo, we'll keep it simple, and use the first option, which is obtain a shell. So we can go inside the computer and interact with it directly. So… Singularity. Shell. icon. And name of the container. Alright, so I'm inside the container. So basically, I'm launching the Python inside the container, and this is isolated from cluster. So how can I do that? By entering Python. And if I want to… import, if I want to run import sys. It's basically, this one basically gives you access to the sys module. And then, if I want to… Print. sys.version. Version of that module. It's basically printing out the version of that one. Alright? So… You can see we are running our own Python version here, independent of the system. Alright, so I'm gonna exit out of it. And if I want to… see, for example, if I want to install a package inside this obtainer or container. install. NumP. Alright. So… Okay… It's giving us some… Already satisfied, so… I have installed it inside the container. So I'm gonna exiting the container by just typing exit. So everything we just did was without root access and without touching the system. So that's the power of the containers. All right. I'm gonna exit out. And then… let's… let's say you want to customize a container, maybe install extra packages, or make some changes. Since .siF files are read-only, we first convert them into something writable called Sandbox, and then… Step one should be… Singularity. Build. Coffin, coffin sandbox… And then my boss… And my box is basically the name of the directory. This unpacks the image into a directory. Fall. 3.11.saf. So, now… Instead of installing system packages like any, any other packages, like VIM, like, And you name it. If you want to customize your container, you can just do it by converting it into writable, basically unpack it as a writable image, all right? That's what I did. Alright, so I'm gonna go singularity, shell, hyphen, hyphen, writable. my box. Okay, it's sunk down… Okay, so it's giving me some error, basically. Hold on, so MKDP. my box… And then… Alright, and then… bye. Okay, let's gonna be error. No such fun, or directory. What I'm trying to do is, I'm trying to create a directory path so that I can bind that path to the container, and… That's what I did, and now I am inside the container itself. Alright, so here, if I want to install any package, GIP install… I have an iPhone user, and then NumP, so… I just won… To install different packages in my container, and then get out of this container, and then Each time when I run the container, the package will be there. Alright. Thank you to… Get update… So I'm inside the container, I'm just updating the container, I can install multiple packages. And then I'm gonna exit out. Alright, so this… all of this was working without the pseudo-access. Alright. So, the idea is simple. Unpack, modify, and then rebuild. And then I'm gonna rebuild it by using similarity. Built. And then my thoughts… But, as Sanya, from converting that Customized container into my own version of the container. Hi, boss. So first, I unpacked, then modified, and now I'm rebuilding it. Okay, gauge. So, since we're waiting on this, the idea is simple, we pull an image from industry, then we convert it into a sandbox, so it becomes a writable. And next we modified it, for example, installing the Python packages, like numpy, and then finally we rebuilt it back into that SIF file. So the full flow is pulled, sandbox, modify, and then rebuild. And once you have that final .sin file, it's completely portable. You can copy it, share it, and run it on any system with Obtainer. That's the full life cycle of working with containers. Alright, so I have… Completely built or rebuild my own image of the container. And also, yeah, So, building the container from scratch is also another thing. Think of this file as a recipe. It tells Apptainer that, what base image to use. And what to install, and how the container should run. In this example, we start from Ubuntu. I'm not gonna do it hands-on, because it is just gonna take a lot of time. But the idea is that we start from Ubuntu, install the Python packages, and then add packages like numpy, or ANDAS, or any other packages, and we also define environment settings. And then, default run behavior. And once the file is ready, we build it using the same Singularity build, my app, or whatever name you want to give. And then the definition file. The definition file will be the one I created, say, for example, from my box. Okay, so that will be the definition file, and the new app container or container will be mybox.sif. In this case, it's going to be myapp.sif, so you can build the container, and then… from the scratch, and then install the packages, and then rebuild it, to an image. And then you can lift and shift that image and install it anywhere. on the system. So… Basically, building container is easy from the scratch. You have to have pulled the image from the container registry, and then you beef it up. You have to put some, like, packages, and then just rebuild the image from that basic Container image. Okay, so moving on to the next one. So now, this is all integrating containers with the cluster Using module files. Instead of using, instead of users running long act-in or exec commands, we can wrap the containers behind a simple command like, other module. So users just… users can just use… use moduleLoadMyApp. See, for example, in the previous screen, I created a container using my definition file, and then I want to use that container, inside my module, so… User can just, load… module, load my app, and then just run my app. So behind the scene, module file is just pointing to the container and running it using Singularity exact. So this makes it much easier for users, and it feels just like using any other software on the cluster. And it also helps with version control, reproducibility, and then making containers discoverable via module avail. So overall, module files can… overall, module files make containers feel native to the system. Alright, so, let's take a look at how to run the containers in batch jobs using Slurm. The key idea is SingularityExec is just another command inside your JavaScript. So. you write a normal Slurm script with your resource requests, like time, partition, and GPUs. Then inside that script. You basically load the app trainer. Using module load. Alright? And then you just give them the name of the obtainer, And then, the other… flags with that. So, see, for example, dash, dash dash envy flag basically exposes the GPU to the container. Alright? So, from Stern's perspective, nothing changes. You're just running a command, but that command happens inside a container. So, this makes it very easy to integrate containers into your existing flows. All right. All right, so let's talk about the biopaths, which are very important when working with containers. So, by default, containers are isolated. So, they don't automatically see your files on the host system. And then BindPath lets you map a directory from the host into the container. for example, this similarity is, like, hyphen hyphen bind slash data, column slash MNT, This maps slash data from the host. to slash MNT inside the container. So your code and data stay on the cluster. But the container can still access them. So this is basically mounting your… Directory inside your system to the container. So the syntax is simple. Host, path, Colin. container path. So, in practice, you will usually bind directories like slash scratch slash projects, so your jobs can read input data and write outputs to real storage. This is important because containers themselves are temporary, so you don't want to store important data inside them. Alright. So, in summary, so just to wrap things up, containers really help in three main areas. First is portability. You build an image, Once, and you can run it anywhere, your laptop, login node, or cluster. Second is reproducibility. Your environment stays the same, so your results today will still work tomorrow. And then the third is ease of deployment. No dependency issues, no admin tickets, just pull the image and run it. So overall, containers make working on HPC systems much simpler and more consistent. And This was all about obtainer containers, it was… I kept it simple because more you… deep dive. The more you want to learn. So, if you have any questions, And what is yours

How does this compare with conda environments?

Waqas: the Conda environment, is… not the container, you cannot lift and shift the condo environment. But… The lifting and shifting and portability is the main, key here on the container side. Say, for example, on your laptop, if you want to create a definition file. And then you just want to run the container on your cluster. So you're basically shifting your definition file to the cluster, and then running the container from there. See, for example, if you want to create a container in your laptop, and then later you realize that your workload is way heavier on your laptop, and then you want to run it on HPC cluster, you just lift the container, and just take it there, and start running the workload there on the HPC cluster.

Max: Any other questions?

Waqas: Alright, thank you so very much for joining.

Harnessing Accelerated HPC Workloads with Singularity | 2026 HPC Training Series #6

Related

Phone

Contact Us