Example
The walkthourgh example show an execution of Tensorflow(Python) in our cluster.
Access to frontend
After getting a response from admin, user can access our cluster via ssh-key.
ssh -i ~/.ssh/vistec_id_rsa songpon@10.204.100.209
Organize your working space
In this step, we will download the dataset from our local machine and setup working directory.
mkdir playground
cd playground
git clone https://github.com/lazyprogrammer/machine_learning_examples.git
mkdir -p machine_learning_examples/large_files
mkdir output
Tensorflow-GPU (sbatch)
In the example, we will create a digit recognizer from Kaggle using Tensorflow 2
Prerequisite
- Download dataset to local machine.
- Packing require Anaconda environment. (ex. bench-tf.tar.zip)
Let’s start
- Directory setup
In local machine : transfer dataset and Anaconda zip to clusterrsync --progress -avz -e "ssh -i ~/.ssh/vistec_id_rsa" train.csv.zip songpon@ist-compute.vistec.ac.th:playground rsync --progress -avz -e "ssh -i ~/.ssh/vistec_id_rsa" benc-tf.tar.zip songpon@ist-compute.vistec.ac.th:playground
In cluster terminal
cd playground git clone https://github.com/lazyprogrammer/machine_learning_examples.git mkdir -p machine_learning_examples/large_files mkdir output unzip train.csv.zip mv train.csv machine_learning_examples/large_files/
- Load Anaconda env
###### Extract Anaconda ENV mkdir my_env tar xvf bench-tf.tar.gz -C my_env
- In the head of submit file (job.sub), you need to config a job. More Info about Slurm
#SBATCH --error=output/task.out.%j # STDOUT output is written in slurm.out.JOBID #SBATCH --output=output/task.out.%j # STDOUT error is written in slurm.err.JOBID #SBATCH --job-name=tensorflow # Job name #SBATCH --mem=16GB # Memory request for this job #SBATCH --nodes=1 # The number of nodes #SBATCH --partition=gpu-cluster #SBATCH --account=scads #SBATCH --time=2:0:0 # Runing time 2 hours #SBATCH --gpus=2 # A number of GPUs
You can change
task.out.%j
totask.out.%x
to use job name in output file. More filename pattern -
The remaining part of submit file is to activate ENV and execute python file. More about Environment Modules
Final sumbit file
#SBATCH --error=output/task.out.%j # STDOUT output is written in slurm.out.JOBID #SBATCH --output=output/task.out.%j # STDOUT error is written in slurm.err.JOBID #SBATCH --job-name=example1 # Job name #SBATCH --mem=16GB # Memory request for this job #SBATCH --nodes=1 # The number of nodes #SBATCH --partition=dgx #SBATCH --account=student #SBATCH --time=2:0:0 # Runing time 2 hours #SBATCH --gpus=2 # A number of GPUs module load Anaconda3 module load CUDA/10.1 module load cuDNN/7 module load HDF5 cd ~/playground source my_env/bin/activate python machine_learning_examples/ann_class2/tensorflow2.py
- Send job
sbatch job.sub
- View job in queue:
squeue
,myjob
myjobs