Just about two months ago, I became quite enamored with language models. It started with a GPT 3.5 Plus account and then rolled back to GPT2 so I could test my hand at working with language models. And I quickly ran into hurdles.
I have quite a powerful home lab server with 32 blazing-fast AMD Threadripper cores running at 4 GHz, 256GB DDR4 Ram, 27TB of usable ZFS storage, and a single 6GB GTX 1660 GPU. And this is where the problem started.
If you know anything about processing language models with PyTorch, then you already know what I am talking about. Even the simplest model can take hours to process on 24 CPU cores and the 6GB of Ram on my Nvidia 1660 GPU is enough to process very small batches. Though the GPU processes the data 10 times faster than even 24 CPU cores, you must process everything in very small batches while loading the previous batch as the training model. It is slow and time-consuming, human time-wise, and processor-wise. Then I started researching cloud GPUs. They had the power I needed, but at $6hr for a VM with 4 RTX6000, the bills can rack up quickly.
I generally knew I could automate this process with Ansible, but a lot of preparation work must go on before you can run the model. So, I came up with this Ansible playbook that I am now sharing with the world.
I have tested this playbook at least a dozen times, and it has worked in tests from a few minutes up to a few hours for a small dataset with 200 epochs.
This playbook uses the Hugging Face run_clm.py script to process the data. This playbook assumes you have a clean dataset formatted for Hugging Face Transformers. I have a Python script to clean my data, but it is specific to my data set. If you would like help cleaning your data get ahold of me through social media or the Contact page for this site.
Instead of one of my long and winding posts explaining the deep whys of this project, we will just jump straight into this one. First is a list of everything this playbook does.
Disclaimer
But there is a disclaimer first. This is only a playbook to train a language model with gpt2 using the Hugging Face Transformers library and the tools that go with it. This script will not start up or shut down your VM. It is up to you to get the VM started and shut it down when your job is complete. There are Ansible plugins to automatically create VMs and shut them down, and feel free to add those options to this playbook. I will not as the liability is too high if someone forgets to shut down their VM after running this.
My personal version uses the Linode API to do this automatically.
The Quick Why
When you process a language model, it takes tons of GPU memory and is quite slow on CPUs. But, once the model is trained, it takes much fewer resources to query the language model. While working on and testing this, I processed language models that took nearly all 26GB of memory across all 4 GPU cores, a total of 104GB, to process, but once I downloaded them back to the local workstation, the CPU’s were not too bad at querying the model and the 6GB GPU has been able to handle it also. So the point is to use the expensive GPUs to build the model and then test the model on your local workstation.
And finally, this was designed to work with Ubuntu 22.04. It may work with other Debian distributions but will not work on any distros that do not use apt-get. You can probably modify this to work with, yum, dnf, or other package managers.
Notes on how this works
This playbook can work one of two ways. You can do everything as root, and then it will use the /root folder as the remote_work_dir, but you will have to update the playbook to set /root for everything remote except the areas where it uses /tmp. Otherwise, like the route I use, you setup a user on your local workstation and then use a stack script, another Ansible script, or any other way you see fit to create a user on the remote machine. In most of my testing, I use gpuuser. Because of how ansible works it is better to have an ssh key between the user, gpuuser in this case, and the server with the password disabled for that user so ssh key auth, and sudo can work without a password. As I mentioned, I use Linode, so here is the bash script that automatically runs on the first bootup of the VM when I set up a Linode for this work.
#!/bin/bash USERNAME="gpuuser" PASSWORD="yourpassword" PUBLIC_KEY="Your public key, the contents of id_rsa.pub on most linux systems" # Update the system apt-get update -y && apt-get upgrade -y # Create a new user useradd -m -s /bin/bash "${USERNAME}" echo "${USERNAME}:${PASSWORD}" | chpasswd # Add the new user to the sudo group # After the user is added we remove the password for this user # As we will be using ssh keys only usermod -aG sudo "${USERNAME}" passwd -d "${USERNAME}" # Set up the SSH directory and authorized_keys file for the new user # This will fail if the /root user does not have a set of authorized_keys. #This is just for backup in case for some event the user can not authenticate #On Linode these root ssh keys are set by selecting the proper set of #ssh keys during install of the VM, either GUI or API mkdir -p "/home/${USERNAME}/.ssh" echo "${PUBLIC_KEY}" > "/home/${USERNAME}/.ssh/authorized_keys" chown -R "${USERNAME}:${USERNAME}" "/home/${USERNAME}/.ssh" chmod 700 "/home/${USERNAME}/.ssh" chmod 600 "/home/${USERNAME}/.ssh/authorized_keys"SERNAME/.ssh/authorized_keys chmod 700 /home/$USERNAME/.ssh chmod 600 /home/$USERNAME/.ssh/authorized_keys chown -R $USERNAME:$USERNAME /home/$USERNAME/.ssh
To The Playbook
First we check out the variables that should be set for your unique use case
vars: training_data_local_path: /path/to/your/cleaned/data/happy.txt training_data_remote_path: /tmp/preprocessed_data.txt output_dir: /tmp/gpt2_trained_model remote_work_dir: /path/to/your/workfolder log_file: /tmp/training_output.log timestamp: "{{ ansible_date_time.iso8601_basic_short }}" local_output_dir: "/path/to/your/workfolder/trained_output_identifier" num_epochs: 10 train_batch_size: 10
training_data_local_path
: This is the local path to the cleaned training data file that will be uploaded to the remote virtual machine.training_data_remote_path
: This is the path on the remote virtual machine where the training data file will be uploaded.output_dir
: This is the path on the remote virtual machine where the trained model will be saved.remote_work_dir
: This is the path on the remote virtual machine where all the work will be done.log_file
: This is the path on the remote virtual machine where the output log file will be saved.timestamp
: This is the current timestamp of the virtual machine when the playbook is executed.local_output_dir
: This is the local path where the trained model will be downloaded from the remote virtual machine.num_epochs
: This is the number of training epochs that will be used to train the model.train_batch_size
: This is the batch size used during training.eval_batch_size
: This is the batch size used during evaluation.
And now, the rest of the playbook.
name: Install required packages
: This task uses theapt
module to install the necessary packages on the remote machine. The packages that are installed include Python 3, Python 3 pip, Git, NVIDIA driver 510, and NVIDIA CUDA toolkit.
- name: Install required packages apt: name: - python3 - python3-pip - git - nvidia-driver-510 - nvidia-cuda-toolkit register: apt_result
name: Reboot host after installing NVIDIA driver
: If theapt
task installs the NVIDIA driver, this task reboots the remote host. This is necessary because a reboot is required for the driver to take effect.
- name: Reboot host after installing NVIDIA driver reboot: post_reboot_delay: 30 when: apt_result.changed
name: Wait for host to come back up
: This task waits for the remote host to become available again after the reboot.
- name: Wait for host to come back up wait_for: host: "{{ inventory_hostname }}" port: 22 delay: 10 timeout: 300
name: Install Python libraries
: This task uses thepip
module to install the Python libraries Torch, Torchvision, Transformers, and Datasets on the remote machine.
- name: Install Python libraries pip: name: - torch - torchvision - transformers - datasets state: present executable: pip3
name: Clone Hugging Face Transformers repository
: This task uses thegit
module to clone the Hugging Face Transformers repository to the remote machine.
- name: Clone Hugging Face Transformers repository git: repo: 'https://github.com/huggingface/transformers.git' dest: '{{ remote_work_dir }}/transformers' force: yes
name: Install Transformers library from the repository
: This task installs the Transformers library from the cloned repository on the remote machine.
- name: Install Transformers library from the repository pip: name: '{{ remote_work_dir }}/transformers' state: present executable: pip3
name: Install Transformers library in editable mode
: This task installs the Transformers library in editable mode on the remote machine.
- name: Install Transformers library in editable mode pip: name: '{{ remote_work_dir }}/transformers' editable: yes state: present executable: pip3
name: Install Transformers library dependencies
: This task installs the dependencies for the Transformers library on the remote machine.
- name: Install Transformers library dependencies pip: requirements: '{{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/requirements.txt' state: present executable: pip3
name: Upload preprocessed training data to VM
: This task uses thesynchronize
module to copy the preprocessed training data from the local machine to the remote machine.
- name: Upload preprocessed training data to VM synchronize: src: "{{ training_data_local_path }}" dest: "{{ training_data_remote_path }}" mode: push
name: Check if GPU is present
: This task checks if the remote machine has a GPU by running the commandnvidia-smi -L
. It registers the output of the command in thegpu_check
variable and ignores any errors that occur.
- name: Check if GPU is present shell: nvidia-smi -L register: gpu_check ignore_errors: true
name: Train model with GPU
: This task trains the GPT-2 model on the remote machine using the GPU if it is available. It runs the commandpython3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py
with the appropriate options to start the training process. The output of the command is redirected to/tmp/training_output.log
. If the GPU is not available, this task is skipped.
- name: Train model with GPU shell: > python3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 --train_file "{{ training_data_remote_path }}" --do_train --num_train_epochs {{ num_epochs }} --per_device_train_batch_size {{ train_batch_size }} --per_device_eval_batch_size {{ eval_batch_size }} --save_steps 10000 --save_total_limit 2 --evaluation_strategy epoch \ --output_dir "{{ output_dir }}" > /tmp/training_output.log 2>&1 args: creates: "{{ output_dir }}" environment: OMP_NUM_THREADS: 1 PYTHONUNBUFFERED: 1 when: gpu_check.rc == 0 async: 0 poll: 0 ignore_errors: yes register: gpu_train delegate_to: "{{ inventory_hostname }}" run_once: true
name: Train model without GPU
: This task trains the GPT-2 model on the remote machine without the GPU if it is not available. It runs the commandpython3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py
with the appropriate options to start the training process. The output of the command is redirected to/tmp/training_output.log
. If the GPU is available, this task is skipped.
- name: Train model without GPU shell: > python3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 --train_file "{{ training_data_remote_path }}" --do_train --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --save_steps 10000 --save_total_limit 2 --evaluation_strategy epoch \ --output_dir "{{ output_dir }}" > /tmp/training_output.log 2>&1 args: creates: "{{ output_dir }}" environment: OMP_NUM_THREADS: 24 PYTHONUNBUFFERED: 1 when: gpu_check.rc != 0 async: 0 poll: 0 ignore_errors: yes register: cpu delegate_to: "{{ inventory_hostname }}" run_once: true
name: Download the trained model to the local machine
: This task uses thesynchronize
module to copy the trained model from the remote machine to the local machine.
- name: Download the trained model to the local machine synchronize: src: "{{ output_dir }}" dest: "{{ local_output_dir }}" mode: pull
name: Clean up remote files
: This task removes the preprocessed training data and the trained model from the remote machine.
- name: Clean up remote files ansible.builtin.file: path: "{{ item }}" state: absent loop: - "{{ training_data_remote_path }}" - "{{ output_dir }}"
Now that we have gone through what each part does, below is the full playbook.
Complete Playbook
- name: Set up and train GPT-2 model on GPU VM hosts: gpu_vm become: yes vars: training_data_local_path: /path/to/your/cleaned/data/happy.txt training_data_remote_path: /tmp/preprocessed_data.txt output_dir: /tmp/gpt2_trained_model remote_work_dir: /path/to/your/workfolder log_file: /tmp/training_output.log timestamp: "{{ ansible_date_time.iso8601_basic_short }}" local_output_dir: "/path/to/your/workfolder/trained_output_identifier" num_epochs: 10 train_batch_size: 10 eval_batch_size: 10 tasks: - name: Update packages apt: update_cache: yes cache_valid_time: 3600 - name: Install required packages apt: name: - python3 - python3-pip - git - nvidia-driver-510 - nvidia-cuda-toolkit register: apt_result - name: Reboot host after installing NVIDIA driver reboot: post_reboot_delay: 30 when: apt_result.changed - name: Wait for host to come back up wait_for: host: "{{ inventory_hostname }}" port: 22 delay: 10 timeout: 300 - name: Install Python libraries pip: name: - torch - torchvision - transformers - datasets state: present executable: pip3 - name: Clone Hugging Face Transformers repository git: repo: 'https://github.com/huggingface/transformers.git' dest: '{{ remote_work_dir }}/transformers' force: yes - name: Install Transformers library from the repository pip: name: '{{ remote_work_dir }}/transformers' state: present executable: pip3 - name: Install Transformers library in editable mode pip: name: '{{ remote_work_dir }}/transformers' editable: yes state: present executable: pip3 - name: Install Transformers library dependencies pip: requirements: '{{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/requirements.txt' state: present executable: pip3 - name: Upload preprocessed training data to VM synchronize: src: "{{ training_data_local_path }}" dest: "{{ training_data_remote_path }}" mode: push - name: Check if GPU is present shell: nvidia-smi -L register: gpu_check ignore_errors: true - name: Train model with GPU shell: > python3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 --train_file "{{ training_data_remote_path }}" --do_train --num_train_epochs {{ num_epochs }} --per_device_train_batch_size {{ train_batch_size }} --per_device_eval_batch_size {{ eval_batch_size }} --save_steps 10000 --save_total_limit 2 --evaluation_strategy epoch \ --output_dir "{{ output_dir }}" > /tmp/training_output.log 2>&1 args: creates: "{{ output_dir }}" environment: OMP_NUM_THREADS: 1 PYTHONUNBUFFERED: 1 when: gpu_check.rc == 0 async: 0 poll: 0 ignore_errors: yes register: gpu_train delegate_to: "{{ inventory_hostname }}" run_once: true - name: Train model without GPU shell: > python3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 --train_file "{{ training_data_remote_path }}" --do_train --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --save_steps 10000 --save_total_limit 2 --evaluation_strategy epoch \ --output_dir "{{ output_dir }}" > /tmp/training_output.log 2>&1 args: creates: "{{ output_dir }}" environment: OMP_NUM_THREADS: 24 PYTHONUNBUFFERED: 1 when: gpu_check.rc != 0 async: 0 poll: 0 ignore_errors: yes register: cpu delegate_to: "{{ inventory_hostname }}" run_once: true - name: Download the trained model to the local machine synchronize: src: "{{ output_dir }}" dest: "{{ local_output_dir }}" mode: pull - name: Clean up remote files ansible.builtin.file: path: "{{ item }}" state: absent loop: - "{{ training_data_remote_path }}" - "{{ output_dir }}"
The inventory file I have been using for this is quite simple since we define the variables in the actual playbook. Make sure to change 127.0.0.1 for the IP address of the VM you built to run this playbook on.
[gpu_vm] 127.0.0.1 [gpu_vm:vars] ansible_become=yes
And that is it. This playbook should built the entire system to train the language model in about 5 minutes and then its off to the races until everything is done and the trained model is downloaded back to your workstation. Don’t forget to shut down the VM and you should be good to go.
If you have any questions or comments you can leave them here, contact me on social media at the social media links on this page or contact me directly through the contact form.
Have a wonderful day.