Inaccurate Inference Results When Using OWLv2's `image_guided_detection`

### System Info

- `transformers` version: 4.35.2
- Platform: Linux-5.15.120+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- Tensorflow version (GPU?): 2.14.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.7.5 (gpu)
- Jax version: 0.4.20
- JaxLib version: 0.4.20
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No

### Who can help?

@ArthurZucker
@younesbelkada 
@amyeroberts

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Run the sample code of OWLv2: https://huggingface.co/docs/transformers/model_doc/owlv2#transformers.Owlv2ForObjectDetection.image_guided_detection.example

### Expected behavior

- The bounding boxes intended to encircle two cats in the sample code appear in inappropriate positions.
- In the sample code (https://huggingface.co/docs/transformers/model_doc/owlv2#transformers.Owlv2ForObjectDetection.image_guided_detection.example), 13 boxes appear in locations unrelated to cats. I ran the exact same code in Google Colab and was able to replicate the same inference results.
- The previous model (OWLViT) was able to detect without any issues.

I am unsure if this behavior is expected, so I would appreciate some advice. If you are aware of any methods to improve the inference performance, please let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inaccurate Inference Results When Using OWLv2's `image_guided_detection` #27821

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inaccurate Inference Results When Using OWLv2's image_guided_detection #27821

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Inaccurate Inference Results When Using OWLv2's `image_guided_detection` #27821