|
| 1 | +.. |
| 2 | + Copyright 2021 The HuggingFace Team. All rights reserved. |
| 3 | +
|
| 4 | + Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with |
| 5 | + the License. You may obtain a copy of the License at |
| 6 | + |
| 7 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 8 | + |
| 9 | + Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on |
| 10 | + an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
| 11 | + specific language governing permissions and limitations under the License. |
| 12 | + |
| 13 | +CLIP |
| 14 | +----------------------------------------------------------------------------------------------------------------------- |
| 15 | + |
| 16 | +Overview |
| 17 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 18 | + |
| 19 | +The CLIP model was proposed in `Learning Transferable Visual Models From Natural Language Supervision |
| 20 | +<https://arxiv.org/abs/2103.00020>`__ by Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, |
| 21 | +Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. CLIP |
| 22 | +(Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be |
| 23 | +instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing |
| 24 | +for the task, similarly to the zero-shot capabilities of GPT-2 and 3. |
| 25 | + |
| 26 | +The abstract from the paper is the following: |
| 27 | + |
| 28 | +*State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This |
| 29 | +restricted form of supervision limits their generality and usability since additional labeled data is needed to specify |
| 30 | +any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a |
| 31 | +much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes |
| 32 | +with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 |
| 33 | +million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference |
| 34 | +learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study |
| 35 | +the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks |
| 36 | +such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The |
| 37 | +model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need |
| 38 | +for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot |
| 39 | +without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained |
| 40 | +model weights at this https URL.* |
| 41 | + |
| 42 | +Usage |
| 43 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 44 | + |
| 45 | +CLIP is a multi-modal vision and language model. It can be used for image-text similarity and for zero-shot image |
| 46 | +classification. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text |
| 47 | +features. Both the text and visual features are then projected to a latent space with identical dimension. The dot |
| 48 | +product between the projected image and text features is then used as a similar score. |
| 49 | + |
| 50 | +To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, |
| 51 | +which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image. The authors |
| 52 | +also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. |
| 53 | +The :class:`~transformers.CLIPFeatureExtractor` can be used to resize (or rescale) and normalize images for the model. |
| 54 | + |
| 55 | +The :class:`~transformers.CLIPTokenizer` is used to encode the text. The :class:`~transformers.CLIPProcessor` wraps |
| 56 | +:class:`~transformers.CLIPFeatureExtractor` and :class:`~transformers.CLIPTokenizer` into a single instance to both |
| 57 | +encode the text and prepare the images. The following example shows how to get the image-text similarity scores using |
| 58 | +:class:`~transformers.CLIPProcessor` and :class:`~transformers.CLIPModel`. |
| 59 | + |
| 60 | + |
| 61 | +.. code-block:: |
| 62 | +
|
| 63 | + >>> import torch |
| 64 | + >>> from PIL import Image |
| 65 | + >>> import requests |
| 66 | +
|
| 67 | + >>> from transformers import CLIPProcessor, CLIPModel |
| 68 | +
|
| 69 | + >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") |
| 70 | + >>> processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") |
| 71 | +
|
| 72 | + >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
| 73 | + >>> image = Image.open(requests.get(url, stream=True).raw) |
| 74 | +
|
| 75 | + >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True) |
| 76 | +
|
| 77 | + >>> outputs = model(**inputs) |
| 78 | + >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score |
| 79 | + >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities |
| 80 | +
|
| 81 | +
|
| 82 | +This model was contributed by `valhalla <https://huggingface.co/valhalla>`__. The original code can be found `here |
| 83 | +<https://github.com/openai/CLIP>`__. |
| 84 | + |
| 85 | +CLIPConfig |
| 86 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 87 | + |
| 88 | +.. autoclass:: transformers.CLIPConfig |
| 89 | + :members: from_text_vision_configs |
| 90 | + |
| 91 | + |
| 92 | +CLIPTextConfig |
| 93 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 94 | + |
| 95 | +.. autoclass:: transformers.CLIPTextConfig |
| 96 | + :members: |
| 97 | + |
| 98 | + |
| 99 | +CLIPVisionConfig |
| 100 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 101 | + |
| 102 | +.. autoclass:: transformers.CLIPVisionConfig |
| 103 | + :members: |
| 104 | + |
| 105 | + |
| 106 | + |
| 107 | +CLIPTokenizer |
| 108 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 109 | + |
| 110 | +.. autoclass:: transformers.CLIPTokenizer |
| 111 | + :members: build_inputs_with_special_tokens, get_special_tokens_mask, |
| 112 | + create_token_type_ids_from_sequences, save_vocabulary |
| 113 | + |
| 114 | +CLIPTokenizerFast |
| 115 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 116 | + |
| 117 | +.. autoclass:: transformers.CLIPTokenizerFast |
| 118 | + :members: |
| 119 | + |
| 120 | + |
| 121 | +CLIPFeatureExtractor |
| 122 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 123 | + |
| 124 | +.. autoclass:: transformers.CLIPFeatureExtractor |
| 125 | + :members: |
| 126 | + |
| 127 | + |
| 128 | +CLIPProcessor |
| 129 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 130 | + |
| 131 | +.. autoclass:: transformers.CLIPProcessor |
| 132 | + :members: |
| 133 | + |
| 134 | + |
| 135 | + |
| 136 | +CLIPModel |
| 137 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 138 | + |
| 139 | +.. autoclass:: transformers.CLIPModel |
| 140 | + :members: forward, get_text_features, get_image_features |
| 141 | + |
| 142 | + |
| 143 | +CLIPTextModel |
| 144 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 145 | + |
| 146 | +.. autoclass:: transformers.CLIPTextModel |
| 147 | + :members: forward |
| 148 | + |
| 149 | + |
| 150 | +CLIPVisionModel |
| 151 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 152 | + |
| 153 | +.. autoclass:: transformers.CLIPVisionModel |
| 154 | + :members: forward |
0 commit comments