Example

The walkthourgh example show an execution of Tensorflow(Python) in our cluster.

Example

Access to frontend

After getting a response from admin, user can access our cluster via ssh-key.

ssh -i ~/.ssh/vistec_id_rsa songpon@10.204.100.209

Organize your working space

In this step, we will download the dataset from our local machine and setup working directory.

mkdir playground
cd playground
git clone https://github.com/lazyprogrammer/machine_learning_examples.git
mkdir -p machine_learning_examples/large_files
mkdir output

Tensorflow-GPU (sbatch)

In the example, we will create a digit recognizer from Kaggle using Tensorflow 2

Prerequisite

Download dataset to local machine.
Packing require Anaconda environment. (ex. bench-tf.tar.zip)

Let’s start

Directory setup
In local machine : transfer dataset and Anaconda zip to cluster

rsync --progress -avz -e "ssh -i ~/.ssh/vistec_id_rsa" train.csv.zip songpon@ist-compute.vistec.ac.th:playground
rsync --progress -avz -e "ssh -i ~/.ssh/vistec_id_rsa" benc-tf.tar.zip songpon@ist-compute.vistec.ac.th:playground

In cluster terminal

cd playground
git clone https://github.com/lazyprogrammer/machine_learning_examples.git
mkdir -p machine_learning_examples/large_files
mkdir output
unzip train.csv.zip
mv train.csv machine_learning_examples/large_files/

Load Anaconda env

###### Extract Anaconda ENV
mkdir my_env
tar xvf bench-tf.tar.gz -C my_env

In the head of submit file (job.sub), you need to config a job. More Info about Slurm

#SBATCH --error=output/task.out.%j  # STDOUT output is written in slurm.out.JOBID
#SBATCH --output=output/task.out.%j # STDOUT error is written in slurm.err.JOBID
#SBATCH --job-name=tensorflow       # Job name
#SBATCH --mem=16GB                  # Memory request for this job
#SBATCH --nodes=1                   # The number of nodes
#SBATCH --partition=gpu-cluster
#SBATCH --account=scads
#SBATCH --time=2:0:0                # Runing time 2 hours
#SBATCH --gpus=2                    # A number of GPUs  

You can change task.out.%j to task.out.%x to use job name in output file. More filename pattern

The remaining part of submit file is to activate ENV and execute python file. More about Environment Modules

Final sumbit file

 #SBATCH --error=output/task.out.%j  # STDOUT output is written in slurm.out.JOBID
 #SBATCH --output=output/task.out.%j # STDOUT error is written in slurm.err.JOBID
 #SBATCH --job-name=example1         # Job name
 #SBATCH --mem=16GB                  # Memory request for this job
 #SBATCH --nodes=1                   # The number of nodes
 #SBATCH --partition=dgx
 #SBATCH --account=student
 #SBATCH --time=2:0:0                # Runing time 2 hours
 #SBATCH --gpus=2                    # A number of GPUs  

 module load Anaconda3
 module load CUDA/10.1
 module load cuDNN/7
 module load HDF5

 cd ~/playground
 source my_env/bin/activate
 python machine_learning_examples/ann_class2/tensorflow2.py

Send job
```
sbatch job.sub
```
View job in queue: squeue, myjob
```
myjobs
```
< Slurm Job Scheduler