Skip to main content Link Menu Expand (external link) Document Search Copy Copied

This example requires Singularity installation on your local machine. Assume that you have an access to IST-cluster. If not, please contact cluster admin.

Example

The walkthourgh example show an execution of Tensorflow(Python) in our cluster.

  1. Example
    1. Organize your working space
    2. Organize your cluster working space
    3. Tensorflow-GPU (sbatch)
    4. Interactive Job (sinteractive)

Organize your working space

In this step, we will download the dataset from our local machine and setup working directory.

mkdir playground
cd playground
git clone https://github.com/lazyprogrammer/machine_learning_examples.git
mkdir -p machine_learning_examples/large_files

Download bench-tf.tar.gz and train.csv.zip in playground directory.

unzip train.csv.zip
mv train.csv machine_learning_examples/large_files/
mkdir my_env
tar -xvf bench-tf.tar.gz -C my_env

Organize your cluster working space

Secure shell into your front-end cluster

ssh -i ~/.ssh/vistec-id_rsa youraccount@ist-compute.vistec.ac.th

Create a directory for this example

mkdir playground
mkdir -p playground/output

Tensorflow-GPU (sbatch)

In the example, we will create a digit recognizer from Kaggle using Tensorflow 2

Prerequisite

  • Download dataset to local machine.
  • Packing require Anaconda environment. (ex. bench-tf.tar.zip)

Let’s Start In your local machine

  1. Create Singularity definition file.
    touch Singularity
    

Then edit the file.

Bootstrap: docker
From: nvidia/cuda:10.2-base-ubuntu18.04

%post
    apt-get -y update
    apt-get install -y module-init-tools

%runscript
    #!/bin/bash
    cd /playground
    source my_env/bin/activate
    cd machine_learning_examples/ann_class2
    python tensorflow2.py
    nvidia-smi
  1. Build the singularity with --sandbox mode.
    singularity build --sandbox --fakeroot ml-example Singularity
    

    This command requires root privillage.

  2. Copy the libraries and environment into Singularity container.
    cp -r machine_learning_examples ml-example/playground
    cp -r my_env ml-example/playground
    

We need edit some file in library directory.

# Go to library directory
cd ml-example/playground/machine_learning_examples/ann_class2/

# Edit file tensorflow2.py
...
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
...
  1. Build a sif from the container.
    singularity build --fakeroot ml-example.sif ml-example
    
  2. Copy the sif into cluster
    scp ml-example.sif youraccount@ist-compute.vistec.ac.th
    

In IST-cluster

  1. Go to playground directory. It should contains a direcroty output and a file ml-example.sif. Create a file name job.sub
    touch job.sub
    
  2. In the head of submit file (job.sub), you need to config a job. More Info about Slurm
    #SBATCH --error=output/task.out.%j  # STDOUT output is written in slurm.out.JOBID
    #SBATCH --output=output/task.out.%j # STDOUT error is written in slurm.err.JOBID
    #SBATCH --job-name=tensorflow       # Job name
    #SBATCH --mem=16GB                  # Memory request for this job
    #SBATCH --nodes=1                   # The number of nodes
    #SBATCH --partition=dgx
    #SBATCH --account=student
    #SBATCH --time=2:0:0                # Runing time 2 hours
    

    You can change task.out.%j to task.out.%x to use job name in output file. More filename pattern

  3. The remaining part of submit file is to activate ENV and execute python file. More about Environment Modules

    Final sumbit file

     #!/bin/bash -l
     #SBATCH --error=output/task.out.%j   # STDOUT output is written in slurm.out.JOBID
     #SBATCH --output=output/task.out.%j  # STDOUT error is written in slurm.err.JOBID
     #SBATCH --job-name=example1          # Job name
     #SBATCH --mem=16GB                   # Memory request for this job
     #SBATCH --nodes=1                    # The number of nodes
     #SBATCH --partition=dgx
     #SBATCH --account=student
     #SBATCH --time=2:0:0                 # Running time 2 hours
     #SBATCH --gpus=1                     # The number of gpu
    
    
     module load Anaconda3
     module load CUDA/10.2
     module load HDFS
          
          
     CONTAINER=/ist/users/${USER}/sandbox/test.sif
    
     singularity run --nv ${CONTAINER}
    
  4. Send job
    sbatch job.sub
    
  5. View job in queue: squeue, myjob
    myjobs
    
  6. job’s output can be locate in ~/playground/output directory

Interactive Job (sinteractive)

The interactive job help you to run a Bash shell in compute nodes in case of testing or developing model.

sinteractive -A student -p bash-cpu -N 1 -c 1 --mem=2G
-A slurm account
-p slurm partition
-N amount of nodes
-c CPU cores
--mem memory

< Slurm Job Scheduler