diff --git a/notebooks/en/fine_tuning_vlm_trl.ipynb b/notebooks/en/fine_tuning_vlm_trl.ipynb index 4981a362..29b5851d 100644 --- a/notebooks/en/fine_tuning_vlm_trl.ipynb +++ b/notebooks/en/fine_tuning_vlm_trl.ipynb @@ -1,4357 +1,4367 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "vKadZFQ2IdJb" - }, - "source": [ - "# Fine-Tuning a Vision Language Model (Qwen2-VL-7B) with the Hugging Face Ecosystem (TRL)\n", - "\n", - "\n", - "\n", - "_Authored by: [Sergio Paniego](https://github.com/sergiopaniego)_\n", - "\n" - ] + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "vKadZFQ2IdJb" + }, + "source": [ + "# Fine-Tuning a Vision Language Model (Qwen2-VL-7B) with the Hugging Face Ecosystem (TRL)\n", + "\n", + "\n", + "\n", + "_Authored by: [Sergio Paniego](https://github.com/sergiopaniego)_\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JATmSI8mcyW2" + }, + "source": [ + "🚨 **WARNING**: This notebook is resource-intensive and requires substantial computational power. If you’re running this in Colab, it will utilize an A100 GPU.\n", + "\n", + "In this recipe, we’ll demonstrate how to fine-tune a [Vision Language Model (VLM)](https://huggingface.co/blog/vlms) using the Hugging Face ecosystem, specifically with the [Transformer Reinforcement Learning library (TRL)](https://huggingface.co/docs/trl/index).\n", + "\n", + "**🌟 Model & Dataset Overview**\n", + "\n", + "We’ll be fine-tuning the [Qwen2-VL-7B](https://qwenlm.github.io/blog/qwen2-vl/) model on the [ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) dataset. This dataset includes images of various chart types paired with question-answer pairs—ideal for enhancing the model's visual question-answering capabilities.\n", + "\n", + "**📖 Additional Resources**\n", + "\n", + "If you’re interested in more VLM applications, check out:\n", + "- [Multimodal Retrieval-Augmented Generation (RAG) Recipe](https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_vlms): where I guide you through building a RAG system using Document Retrieval (ColPali) and Vision Language Models (VLMs).\n", + "- [Phil Schmid's tutorial](https://www.philschmid.de/fine-tune-multimodal-llms-with-trl): an excellent deep dive into fine-tuning multimodal LLMs with TRL.\n", + "- [Merve Noyan's **smol-vision** repository](https://github.com/merveenoyan/smol-vision/tree/main): a collection of engaging notebooks on cutting-edge vision and multimodal AI topics.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QoD6dxPeXDKR" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gSHmDKNFoqjC" + }, + "source": [ + "## 1. Install Dependencies\n", + "\n", + "Let’s start by installing the essential libraries we’ll need for fine-tuning! 🚀\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" }, + "id": "GCMhPmFdIGSb", + "outputId": "b08f3edd-03e2-42af-a075-04dafa232c66" + }, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "JATmSI8mcyW2" - }, - "source": [ - "🚨 **WARNING**: This notebook is resource-intensive and requires substantial computational power. If you’re running this in Colab, it will utilize an A100 GPU.\n", - "\n", - "In this recipe, we’ll demonstrate how to fine-tune a [Vision Language Model (VLM)](https://huggingface.co/blog/vlms) using the Hugging Face ecosystem, specifically with the [Transformer Reinforcement Learning library (TRL)](https://huggingface.co/docs/trl/index).\n", - "\n", - "**🌟 Model & Dataset Overview**\n", - "\n", - "We’ll be fine-tuning the [Qwen2-VL-7B](https://qwenlm.github.io/blog/qwen2-vl/) model on the [ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) dataset. This dataset includes images of various chart types paired with question-answer pairs—ideal for enhancing the model's visual question-answering capabilities.\n", - "\n", - "**📖 Additional Resources**\n", - "\n", - "If you’re interested in more VLM applications, check out:\n", - "- [Multimodal Retrieval-Augmented Generation (RAG) Recipe](https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_vlms): where I guide you through building a RAG system using Document Retrieval (ColPali) and Vision Language Models (VLMs).\n", - "- [Phil Schmid's tutorial](https://www.philschmid.de/fine-tune-multimodal-llms-with-trl): an excellent deep dive into fine-tuning multimodal LLMs with TRL.\n", - "- [Merve Noyan's **smol-vision** repository](https://github.com/merveenoyan/smol-vision/tree/main): a collection of engaging notebooks on cutting-edge vision and multimodal AI topics.\n" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + " Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", + " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", + " Preparing metadata (pyproject.toml) ... \u001b[?25l\u001b[?25hdone\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m844.5/844.5 kB\u001b[0m \u001b[31m15.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m59.6/59.6 MB\u001b[0m \u001b[31m43.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m324.6/324.6 kB\u001b[0m \u001b[31m30.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25h" + ] + } + ], + "source": [ + "!pip install -U -q git+https://github.com/huggingface/trl.git bitsandbytes peft qwen-vl-utils trackio\n", + "# Tested with trl==0.22.0.dev0, bitsandbytes==0.47.0, peft==0.17.1, qwen-vl-utils==0.0.11, trackio==0.2.8" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V0-2Lso6wkIh" + }, + "source": [ + "Log in to Hugging Face to upload your fine-tuned model! 🗝️\n", + "\n", + "You’ll need to authenticate with your Hugging Face account to save and share your model directly from this notebook.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 17, + "referenced_widgets": [ + "3d7dbcc18c1b4dceae74bdfb75a8da8e", + "b2ec4b8bf53245c88cb74e59612713fb", + "c865a38342ac4b49951d3dad4a774838", + "92b7e089d8804be2a4f7a1e7cd82e89d", + "b49628cfbdfc40b7956baa4cbf93829b", + "8ade50abf0714d2d921fbd74648c5004", + "fcb2f724d1ad4157b907d48fa993efd4", + "14de7b80d5bf46e7b900682b20d61cef", + "10561db48a4b430e98b0e11bd8aa1ef6", + "23511e8b562e4f7e99aa2854a9d98c17", + "76467e7df06e43cdb2a76c9ecc0f8c15", + "582995cfeaf44e118d498cb7ccfd857a", + "8a732dcf3d1e4664b99b6f68df75ae17", + "23d8b425314a447b8239de0b294e5db9", + "30c9d00326c04024a9a171a83352281b", + "4cc0e10451564ae190dbe249b15ca94d", + "ddaddfb293174a79840350d5c5d98a3e", + "a2e17e3dcd684ac89e09de8fa4082e90", + "efbd24054785495f9c521b097d487b83", + "0bf82ddbeeab48eab8f77dd8d99071c1" + ] }, + "id": "xcL4-bwGIoaR", + "outputId": "289d8507-dde1-46c4-e6ad-eb1d072cc868" + }, + "outputs": [ { - "cell_type": "markdown", - "metadata": { - "id": "QoD6dxPeXDKR" + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "3d7dbcc18c1b4dceae74bdfb75a8da8e", + "version_major": 2, + "version_minor": 0 }, - "source": [ - "" + "text/plain": [ + "VBox(children=(HTML(value='