Introduction to HPC | 2026 HPC Training Series #1

Transcript

Al: Welcome, everyone, to Research Computing Services Workshop Series Part 1: Introduction to High Performance Computing. My name is Al Tucker, and I will be your instructor for today. I am part of Research Computing Services' High Performance Computing Group. And you can ask questions of the HPC group. By emailing us at hpc-support at columbia.edu. So, a brief overview of what this session is about. This is a totally beginner's session, meant to provide an introduction to what an HPC is and how one is used, both here at Columbia and in the world beyond. No prior knowledge is assumed beyond having used a computer at some point. No account on the cluster is required. There will be no hands-on exercises. This session will also be recorded and placed online for later review at a link, which will be announced later after the class.

So… How does an HPC differ from consumer desktops? Well… Nearly all high-performance computing clusters run on the Linux operating system. And this includes ours. which is built on a version called Red Hat Linux. There are other versions of Linux more suitable to use at home, such as Rocky Linux, Ubuntu, Linux Mint, and more. All of these are open source. Meaning they're free to download. The standard way to use a Linux system is through the command line interface, or CLI.

So, the CLI is similar to what you see when using the Terminal application on Mac OS. or Windows subsystem for Linux in Windows 10 or higher. Since some skill and comfort with the command line is required to work on the cluster. All of the following courses in this series teach commands and techniques to bring you up to speed on the cleat. Exploring the command line on your own computer can be a stepping stone to more easily mastering the CLI on the HPC.

So a good way to prepare yourself for the CLI is by doing your normal, everyday tasks on your laptop using the Mac terminal, or WSL2, instead of using the typical mouse and menu graphical interface. Ordinary things, such as creating a text edit or notepad file, or making a new folder. can be done on your own laptop using the terminal, or WSL2, instead of using the graphical user interface. And here are just two examples here in these pictures.

You can Google to find many tutorials to learn your own command line. These are just some that I came up with through Googling on the internet. However, there are many more out there for you to be able to experiment with the CLI on your own machine. before you start using the HPC. So, to define what an HPC is, In high-performance computing. Rooms full of hundreds or thousands of computers. Each one, many times faster than your desktop or laptop, are networked together. To solve large, complex problems in unison. Much faster than any single machine. Another term you may hear is high throughput computing. And it's similar, but the focus there is repeatedly doing large numbers of smaller, independent tasks. This focus is intended to solve a large volume of simpler tasks, Faster. So this isn't necessarily an either-or situation, and clusters, including Columbia's HPC, can do both. It all depends on what type of problem you choose to solve. There's another term you might hear called supercomputer, and that's actually somewhat misleading. a supercomputer It's not one big computer. But it's a term that refers to the top HPC clusters in the world. Often custom-built at great expense to solve specific problems. At its heart, though, a supercomputer is merely a top-tier elite HPC.

So another way an HPC differs from your typical laptop or desktop is that this is a shared environment that many people use at once. To handle this sharing, something called a workload manager is used. What we use is called Slurm. So when you have a task or a job you want to run on the cluster. First, you submit it to Slurm. And then it's Slurm that keeps track of how many other people are using the cluster at the same time, and how many resources they're using, and then Slurm will hand it off to the appropriate compute node to run the job. So what this means is that on a cluster, your job might not start immediately. And it could even start hours later. Depending on how much power you want to use, relative to how much power other people are using. And when I say power, I mean things like how much memory you're requesting, how many processors you need to run your job, and all types of resources that you can define that will be covered in later sessions in this series. However, when your job does start, though. It will run hundreds of times faster than it ever could on your laptop or desktop. So each one of the many networked computers is called a compute node. These nodes are commonly reached through one or more front-end login nodes, which control access.

So, to use an analogy, you can think of the login nodes as the lobby, controlling access to the theater. Which is where the compute nodes are, and where the show actually goes on. The compute nodes in each cluster work in parallel with each other. Boosting processing speed to deliver high-performance computing. And here's a simplified cluster diagram to more fully illustrate what I'm talking about. Here, you are, and you access the cluster remotely. You're never going to be sitting in the same room at a keyboard as the cluster. You access the cluster remotely through the login node. The login node is connected via a high-speed network. To all of the compute nodes that are on the cluster. And all of these nodes, through the high-speed network, are connected to massive storage. What this allows is that since this storage is connected to everything. When you are on the login node, you see the same home directory and files that's on any of the compute nodes in the cluster. They all see the same thing, so it doesn't matter which node you are on, you will always have access to your files to run your jobs.

So, another thing about an HPC is the typical storage is very, very different from any of the types you commonly see. Storage on an HPC is something that's composed of distinct storage array nodes, and this picture here, we're showing 4 of them. Each one of these storage nodes is filled with multiple high-capacity disks. All of them working in tandem and connected over that specialized internal network to provide one vast amount of space. And that space can be partitioned By the cluster operating system as needed. The format of these discs is also very different. On your personal computer. Disks have a file system, and this picture represents some of the different types of file systems you may have heard of. on things like Windows machines, Mac OS, or Linux. Each one is designed to hold data on the disk in a certain order. So the operating system can save and retrieve your files. Now, on an HPC, this is a vastly more complex situation. HPCs use what's called a parallel file system.

So, what is a parallel file system? It's a type of file system That allows individual parts of any single file To be written across multiple disks. These disks work in parallel. Allowing multiple compute nodes to more easily read and write data at higher speeds. Insomnia, which is our newest cluster, uses a brand of parallel file system called GPFS, which stands for General Parallel File System.

So some of the benefits of a parallel file system are you've got high input-output throughput, meaning multiple processes can read and write data simultaneously. This is something that scales very well. So, hundreds or thousands of nodes and petabytes of data can be on the storage, and multiple nodes can read and write this at once. And as you… as you add more storage, you can get even better performance. You also have reduced bottlenecks in accessing that storage, because the load is distributed across multiple disks. Concurrent access, meaning simultaneously many applications and many users can access the same disk or the same file. This is something that's… extremely optimized for HPC, or especially AI loads, and it's also very fault-tolerant. Meaning, unlike a individual computer, if a disk dies, or even more than one disk dies, the system can keep on running. Whereas on a personal computer, once your disk dies, if you don't have a backup, you're done. It's also extremely efficient at handling large files, because typically, many data sets control many gigab… or rather have many gigabytes of data in them, especially when we're dealing with things like genomics projects.

So… Something else you may have heard of as a file system is NFS. And so what's the difference between a parallel file system and an NFS mount? NFS, which stands for Network File System. It's something that you may have heard of. It's typically used within, small work groups, like within a lab, or even on your home computer, you can use NFS. So… It's good for handling small tasks, but it's extremely limited. You have serial access, meaning you're sharing a disk from one computer out to another, or many computers, but they all have to go through that one computer in order to get to it. So this is limited bandwidth. it also makes a single point of failure, because if that disk in your computer that's being shared through NFS dies. None of the machines can access that data. It's typically something that's very useful for… it's easy to set up, it's useful for small labs, small work groups. But parallel file systems are designed for large workloads, and that's… they're designed to overcome all of these type of bottlenecks. You can have multiple clients, it scales very well.

There are multiple types of parallel file systems. We use GPFS on Insomnia, our newest cluster. On one of our older clusters, we use something called Luster, but there are other types out in the world as well. DGFS, D-A-OS, And other types that are… A little more obscure, too many to actually go over in a single class. But through all of these type of things, like striping your data across multiple disks, and using high-speed storage. arallel file systems are much, much faster and much more suitable for the workloads that an HPC typically has. So now, how do you access the HPC? Well, you saw that dotted line in the previous slide that I, showed you. So how you… what that dotted line represents, rather, is you using a command line program called Secure Shell, or SSH. And this is an example of NSSH here. At first, I'm on my local computer. Then I use SSH to access Insomnia, our latest cluster, though we have some others in the HPC. And after accessing Insomnia, things don't change very much.

Now, one way you can tell that you're on an entirely different computer here, as you are here. is that your prompt line, this line with the dollar sign in it, changes. So that's one indication that you're on an entirely different computer. Also, if you list the files in your directory, you should see something different. Here, you'll see the files on your home computer, but after SSH-ing into something else, or rather to the cluster, you should see an entirely different set of files, the things that you've put on to actually do work on the cluster.

So, when you think about it, the cloud is actually an example of an HPC at heart. It's just used for a different purpose. behind things like Facebook. Instagram, X, or Blue Sky are similar massive groups of individual node computers all working together. Where you submit a job on the HPC consisting of some type of scientific calculation. A job on Instagram would be your message to someone, or a post on their wall. storage in our HPC, which you would use to hold research data. Would function in a similar way. Only the storage there is holding pictures of your cat, or pictures of some party. Online gaming, too, at heart, is an HPC. It's the same concept. The interface may look different. You log into an HPC using a username and password, but that's the same way you would log into an online game, using a username and password, instead of research data.

The data in the online game is statistics of your character and their place in the online world. jobs on an HPC, which would be running a calculation in MATLAB or Python code on an online game are similar, but they're just directions to move, kick, shoot, or talk to another character. Most of the large services you see in the world use similar paradigms. But they're all used. With basically the same thing at heart. Moving data very quickly to get results for the user. So, some of the research uses of an HPC. Generally, one of two purposes are solved. Two problems are solved with an HPC for research. An HPC can process calculations that you could do on your laptop. Only the HPC can do them so much faster. That getting vastly more work done in short periods of time is possible. Or, another purpose, It's because of the large processing power and vast amounts of storage. The HPC enables complex processing of large datasets That you simply would not be able to do any other way.

So, some HPC uses here at Columbia. An example is Professor Alex Urban's group. And the Department of Chemical Engineering looks at various aspects of batteries and energy storage. And what new materials can be used for this purpose, or how existing materials can be made more efficient. In one collaboration with a colleague, their group examines hydrolyzing seawater to get oxygen. Seawater being incredibly impure makes this a tough problem to solve. Various additives can be used for the job, but the process is so complex that running physical experiments to exhaustively test every combination is not practical. Their group uses GPUs on the HPC to train and refine models to simulate these experiments thousands of times to narrow down which combinations have the most promise to work in the real world. It's the kind of work that, if done even on a powerful laptop, would take so much longer that advances in their field would be very few, and over much longer periods of time. Professor Molly Przeworski’s group, in the Department of Biological Sciences focuses on the genetics of evolution. Different lab members focus on different aspects. Will Milligan studies the evolution of mutation in primates. His work involves analyzing very large data files and thousands of small cluster jobs running at the same time. Hannah Munbee looks for commonalities among large amounts of raw genetic sequencing data. This is an extremely memory-intensive process called variant culling. The high memory and storage requirements far exceed what a consumer machine can fulfill. And Misha Langley looks at gene expressions in water buffalo. For the purpose of examining dubious claims made in previous studies. The extremely large amounts of data require the storage and speed of an HPC to work with.

Professor Yingwei and her Five Sigma group in the Department of Biostatistics Do single-cell genomics work on the mental health aspects of various diseases under grants through the Columbia Data Science Institute. under a specific grant related to Alzheimer's disease. They try to find links among various types of data gathered under different circumstances. Clinical data. Medical data, observational data, self-reported data, drug interaction data, and more.

So, since all of this data is gathered under different circumstances, with different end goals in mind. Even though it all falls under the umbrella of Alzheimer's. trying to find commonalities to create useful models about the progression of the disease is a massive task that simply is not possible without an HPC. Lab member Zhi Lu takes advantage of the short partition on the cluster. Which allows a user to have limited access to every single node on the cluster. To debug code for later use during long runs on the specific nodes that their group owns. Luca Kamaso is a PI in the Multi-Messenger Science Group in the Department of Astrophysics. His focus is studying plasma in the vicinity of black holes to determine the types of photons, neutrinos, cosmic rays, or other possible particles that may exist. This is the kind of work that can't be physically manipulated in a lab. So the cluster allows them to run processor and memory-intensive simulations on huge data sets that only an HPC can handle. His resource requirements are so large. that even with access to the HPC here at Columbia. Larger jobs need to be run on even more powerful clusters at places like NASA or Lawrence Livermore Laboratories.

There are other HPC resources that you can have access to as a member of the Columbia community, and one of these… is Empire AI. First announced in April 2024, Empire AI is a New York statewide initiative backed by the governor. Its purpose is to establish next-generation GPU-only computing clusters specifically for AI research done by educational institutions in the state of New York. Only a select number of institutions are part of Empire AI, and Columbia University was one of the first. If you want to find out more about Empire AI, you can visit the CUIT website at this URL here.

There are also other workshops, events, and trainings here at Columbia University. The Foundations of Research Computing has events, which can be found here at this URL. They have other workshops and boot camps, as well as a Python user group that meets in Butler Library. Also, speaking of the library, the library itself has other services that are at this URL here. You can find among them a data club, which talks about Python and R, And they also have a map club, which talks about GIS data. And this brings us to the end of our general overview about HPC. And I see some questions were typed into the chat. I could not focus on them and talk at the same time, so if anyone has any questions, now is the point to feel free, please, and speak up and ask. And… I suppose? Everything… was… clear.

Jessica: One of the questions we got is, "As an IT staff member who may get asked this question from interested faculty and students, it seems like the HPC is able to handle the security requirements around handling potentially sensitive and confidential data. Is that true, Al?

Al: The HPC is not, at this time, rated to handle, things like, sensitive data, personally identifiable data, HIPAA data, no. That is a project we are working for… we are working on, and there will be announcements in the future. But at this point, no. Sensitive data is not allowed on the HPC.

Jessica: Not allowed on Columbia's shared HPC, that's true, but we do have HPC, or we do have sensitive data and confidential data platforms available at Columbia, but this… the shared HPC is not one of them, correct, at this time.

Al: Yes, Jessica is right. There are resources at Columbia that can handle it. Secure Data Enclave can handle sensitive data, but no, the HPC that we run in our group is not rated for that.

Jessica: Okay, let's see… I think that was… oh, there was a question, are there resources provided for slurm knowledge, or is slurm something that can be learned on the go?

Al: Well, both. This is only the first course in this series. As the other courses progress, there will be more, more lessons about Slurm, about the command line, about Python in the rest of this series. But also, yes, Since these are only short training classes, it's not possible to teach every possible thing, and users are encouraged to both use the cluster documentation. And you get links for that when you get a cluster account. And also, yes, to, work on your own, because as system administrators, we are the experts on the cluster. But everyone has different types of scientific requirements, and we're not scientists, so oftentimes there may be things that you may want to do that we're not familiar with, and you may be better familiar than we are with the research aspects. Hmm, anything else?

Jessica: Feel free to speak up if anybody has questions.

Al: Yes, please. But… Thank you. Thank you, I'll go back and, correct that. So… If there aren't any more questions… I would like to thank all of you. For taking time out of your day to make my efforts worthwhile. And remember, you will get a link for this presentation that you can view after class if there's anything you want to review. And… Oh, you reminded the SDE, okay? Yes, thank you, yes, the SDE, you can find information about it on the CUIT website. And other than that, then, again, thank you all so very much. And… have a nice day!

Introduction to HPC | 2026 HPC Training Series #1

Phone

Contact Us

Introduction to HPC | 2026 HPC Training Series #1

Related

Phone

Contact Us