This example requires Singularity installation on your local machine. Assume that you have an access to IST-cluster. If not, please contact cluster admin.
Example
The walkthourgh example show an execution of Tensorflow(Python) in our cluster.
Organize your working space
In this step, we will download the dataset from our local machine and setup working directory.
mkdir playground
cd playground
git clone https://github.com/lazyprogrammer/machine_learning_examples.git
mkdir -p machine_learning_examples/large_files
Download bench-tf.tar.gz and train.csv.zip in playground directory.
unzip train.csv.zip
mv train.csv machine_learning_examples/large_files/
mkdir my_env
tar -xvf bench-tf.tar.gz -C my_env
Organize your cluster working space
Secure shell into your front-end cluster
ssh -i ~/.ssh/vistec-id_rsa youraccount@ist-compute.vistec.ac.th
Create a directory for this example
mkdir playground
mkdir -p playground/output
Tensorflow-GPU (sbatch)
In the example, we will create a digit recognizer from Kaggle using Tensorflow 2
Prerequisite
- Download dataset to local machine.
- Packing require Anaconda environment. (ex. bench-tf.tar.zip)
Let’s Start In your local machine
- Create Singularity definition file.
touch Singularity
Then edit the file.
Bootstrap: docker
From: nvidia/cuda:10.2-base-ubuntu18.04
%post
apt-get -y update
apt-get install -y module-init-tools
%runscript
#!/bin/bash
cd /playground
source my_env/bin/activate
cd machine_learning_examples/ann_class2
python tensorflow2.py
nvidia-smi
- Build the singularity with
--sandbox
mode.singularity build --sandbox --fakeroot ml-example Singularity
This command requires root privillage.
- Copy the libraries and environment into Singularity container.
cp -r machine_learning_examples ml-example/playground cp -r my_env ml-example/playground
We need edit some file in library directory.
# Go to library directory
cd ml-example/playground/machine_learning_examples/ann_class2/
# Edit file tensorflow2.py
...
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
...
- Build a sif from the container.
singularity build --fakeroot ml-example.sif ml-example
- Copy the sif into cluster
scp ml-example.sif youraccount@ist-compute.vistec.ac.th
In IST-cluster
- Go to
playground
directory. It should contains a direcrotyoutput
and a fileml-example.sif
. Create a file namejob.sub
touch job.sub
- In the head of submit file (job.sub), you need to config a job. More Info about Slurm
#SBATCH --error=output/task.out.%j # STDOUT output is written in slurm.out.JOBID #SBATCH --output=output/task.out.%j # STDOUT error is written in slurm.err.JOBID #SBATCH --job-name=tensorflow # Job name #SBATCH --mem=16GB # Memory request for this job #SBATCH --nodes=1 # The number of nodes #SBATCH --partition=dgx #SBATCH --account=student #SBATCH --time=2:0:0 # Runing time 2 hours
You can change
task.out.%j
totask.out.%x
to use job name in output file. More filename pattern -
The remaining part of submit file is to activate ENV and execute python file. More about Environment Modules
Final sumbit file
#!/bin/bash -l #SBATCH --error=output/task.out.%j # STDOUT output is written in slurm.out.JOBID #SBATCH --output=output/task.out.%j # STDOUT error is written in slurm.err.JOBID #SBATCH --job-name=example1 # Job name #SBATCH --mem=16GB # Memory request for this job #SBATCH --nodes=1 # The number of nodes #SBATCH --partition=dgx #SBATCH --account=student #SBATCH --time=2:0:0 # Running time 2 hours #SBATCH --gpus=1 # The number of gpu module load Anaconda3 module load CUDA/10.2 module load HDFS CONTAINER=/ist/users/${USER}/sandbox/test.sif singularity run --nv ${CONTAINER}
- Send job
sbatch job.sub
- View job in queue:
squeue
,myjob
myjobs
- job’s output can be locate in
~/playground/output
directory
Interactive Job (sinteractive)
The interactive job help you to run a Bash shell in compute nodes in case of testing or developing model.
sinteractive -A student -p bash-cpu -N 1 -c 1 --mem=2G
-A slurm account
-p slurm partition
-N amount of nodes
-c CPU cores
--mem memory