Chapter 8 Computing
It is entirely possible that you can do everything that you want on your personal machine – be that a laptop or a desktop. If that is the case then this chapter is not for you. If, however, you have spent many hours waiting on your computer to run models so that you can score them, running multiple imputations to ameliorate missing data in your datasets, or have data that surpasses your machine’s local memory then this chapter is for you.
Definitions of what “big data” is (are) a slipperly slope because the amount of data we generate each day is increasing and our ability to store it for the most part is getting cheaper. That being said a working definition put forward by Garret Goremond is something like big data is data that is greater than one third of your RAM. That isn’t too big for most people but it is a good launching point because big data is always context dependent. If you are used to working on your laptop and you try to open a 16 gb excel file, then that might be too big (especially if you are working on a software platform that loads data into memory). If you are at this point then you need help (or you just can take the productivity hit and let your models run for days).
At this point outside of buying a bigger machine (more CPUs for processing power, more RAM for data manipulation and more storage for well storage) then you will have to move to a server or a group of computers. Have no fear, though. This part of technology has been getting much easier since the early days of the internet andf even movoing to the cloud has gotten almost to the point of point and click.
So what resources are available to you?
8.1 High Performance Clusters
High performance clusters are a resource that are typically available on research universities. Cluster computing involves multiple computers being connected together and working togther as one large computer. These resources can drastically expand your computing power by taking your analysis, dividing it over many machines and then performing each task in parallel and the combining the results back together. This process is called parallel computing. Sometimes your job may be small enough that one node with a large number of threads or cores, more than your local machine, is all that is necessary.
Working with the cluster will typically requiring how to submit jobs through the job manager for the cluster. An example of a job manager is Slurm. There are others, but this involves writing an instruction script often in bash in order to instruct the cluster how to run your script. This is an easy task, but it will require that you know how to move your data source to the cluster enrvironment often through a tool like WinZip or FileZilla. After you have moved your data, analysis scripts and instruction script you will need to SSH tunnel into a node (basically log in remotely). There you can execute your instruction script and the cluster will do the rest. It is often a good practice to include some logging and if possible email messaging so that you get updates when your script is initiated and completed. You can then log back in to the FTP and move your completed analysis to your local machine.
Using a cluster harvest the power of parrell computing and multi-node environment. While a gui is possible, it is definitely more conducive to command line programming. As such the cluster is great for running long jobs that have already been debugged.
8.1.1 Compute or Memory
A key decision point for moving to an HPC or any kind of non-local computing is does your process need computational power (often for calculations) or does it need memory? Often the answer will be both. The programming enviornment of R and even Python use the RAM to store information while computations are running against it. As such you may run into the limit of your