Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Example

The walkthourgh example show an execution of Tensorflow(Python) in our cluster.

  1. Example
    1. Access to frontend
    2. Organize your working space
    3. Tensorflow-GPU (sbatch)

Access to frontend

After getting a response from admin, user can access our cluster via ssh-key.

ssh -i ~/.ssh/vistec_id_rsa songpon@10.204.100.209

Organize your working space

In this step, we will download the dataset from our local machine and setup working directory.

mkdir playground
cd playground
git clone https://github.com/lazyprogrammer/machine_learning_examples.git
mkdir -p machine_learning_examples/large_files
mkdir output

Tensorflow-GPU (sbatch)

In the example, we will create a digit recognizer from Kaggle using Tensorflow 2

Prerequisite

  • Download dataset to local machine.
  • Packing require Anaconda environment. (ex. bench-tf.tar.zip)

Let’s start

  1. Directory setup
    In local machine : transfer dataset and Anaconda zip to cluster
    rsync --progress -avz -e "ssh -i ~/.ssh/vistec_id_rsa" train.csv.zip songpon@ist-compute.vistec.ac.th:playground
    rsync --progress -avz -e "ssh -i ~/.ssh/vistec_id_rsa" benc-tf.tar.zip songpon@ist-compute.vistec.ac.th:playground
    

    In cluster terminal

    cd playground
    git clone https://github.com/lazyprogrammer/machine_learning_examples.git
    mkdir -p machine_learning_examples/large_files
    mkdir output
    unzip train.csv.zip
    mv train.csv machine_learning_examples/large_files/
    
  2. Load Anaconda env
    ###### Extract Anaconda ENV
    mkdir my_env
    tar xvf bench-tf.tar.gz -C my_env
    
  3. In the head of submit file (job.sub), you need to config a job. More Info about Slurm
    #SBATCH --error=output/task.out.%j  # STDOUT output is written in slurm.out.JOBID
    #SBATCH --output=output/task.out.%j # STDOUT error is written in slurm.err.JOBID
    #SBATCH --job-name=tensorflow       # Job name
    #SBATCH --mem=16GB                  # Memory request for this job
    #SBATCH --nodes=1                   # The number of nodes
    #SBATCH --partition=gpu-cluster
    #SBATCH --account=scads
    #SBATCH --time=2:0:0                # Runing time 2 hours
    #SBATCH --gpus=2                    # A number of GPUs  
    

    You can change task.out.%j to task.out.%x to use job name in output file. More filename pattern

  4. The remaining part of submit file is to activate ENV and execute python file. More about Environment Modules

    Final sumbit file

     #SBATCH --error=output/task.out.%j  # STDOUT output is written in slurm.out.JOBID
     #SBATCH --output=output/task.out.%j # STDOUT error is written in slurm.err.JOBID
     #SBATCH --job-name=example1         # Job name
     #SBATCH --mem=16GB                  # Memory request for this job
     #SBATCH --nodes=1                   # The number of nodes
     #SBATCH --partition=dgx
     #SBATCH --account=student
     #SBATCH --time=2:0:0                # Runing time 2 hours
     #SBATCH --gpus=2                    # A number of GPUs  
    
     module load Anaconda3
     module load CUDA/10.1
     module load cuDNN/7
     module load HDF5
    
     cd ~/playground
     source my_env/bin/activate
     python machine_learning_examples/ann_class2/tensorflow2.py
    
  5. Send job
    sbatch job.sub
    
  6. View job in queue: squeue, myjob
    myjobs
    

    < Slurm Job Scheduler