Inferenca Phi-3-Vision lokalno

Phi-3-vision-128k-instruct omogoča Phi-3, da ne razume le jezika, ampak tudi vidi svet vizualno. S pomočjo Phi-3-vision-128k-instruct lahko rešujemo različne vizualne probleme, kot so OCR, analiza tabel, prepoznavanje predmetov, opis slike itd. Enostavno lahko opravimo naloge, ki so prej zahtevale veliko podatkov za učenje. Spodaj so navedene povezane tehnike in scenariji uporabe, ki jih navaja Phi-3-vision-128k-instruct.

0. Priprava

Pred uporabo se prepričajte, da so nameščene naslednje Python knjižnice (priporočen Python 3.10+)

pip install transformers -U
pip install datasets -U
pip install torch -U

Priporočljivo je uporabljati CUDA 11.6+ in namestiti flatten

pip install flash-attn --no-build-isolation

Ustvarite nov Notebook. Za dokončanje primerov je priporočljivo, da najprej ustvarite naslednjo vsebino.

from PIL import Image
import requests
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"

kwargs = {}
kwargs['torch_dtype'] = torch.bfloat16

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype="auto").cuda()

user_prompt = '<|user|>\n'
assistant_prompt = '<|assistant|>\n'
prompt_suffix = "<|end|>\n"

1. Analiza slike s Phi-3-Vision

Želimo, da AI lahko analizira vsebino naših slik in poda ustrezne opise

prompt = f"{user_prompt}<|image_1|>\nCould you please introduce this stock to me?{prompt_suffix}{assistant_prompt}"


url = "https://g.foolcdn.com/editorial/images/767633/nvidiadatacenterrevenuefy2017tofy2024.png"

image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=True, 
                                  clean_up_tokenization_spaces=False)[0]

Ustrezne odgovore lahko dobimo z izvajanjem naslednjega skripta v Notebooku

Certainly! Nvidia Corporation is a global leader in advanced computing and artificial intelligence (AI). The company designs and develops graphics processing units (GPUs), which are specialized hardware accelerators used to process and render images and video. Nvidia's GPUs are widely used in professional visualization, data centers, and gaming. The company also provides software and services to enhance the capabilities of its GPUs. Nvidia's innovative technologies have applications in various industries, including automotive, healthcare, and entertainment. The company's stock is publicly traded and can be found on major stock exchanges.

2. OCR s Phi-3-Vision

Poleg analize slike lahko iz slike tudi izvlečemo informacije. To je OCR postopek, za katerega smo prej morali pisati zapleteno kodo.

prompt = f"{user_prompt}<|image_1|>\nHelp me get the title and author information of this book?{prompt_suffix}{assistant_prompt}"

url = "https://marketplace.canva.com/EAFPHUaBrFc/1/0/1003w/canva-black-and-white-modern-alone-story-book-cover-QHBKwQnsgzs.jpg"

image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=False, 
                                  clean_up_tokenization_spaces=False)[0]

Rezultat je

The title of the book is "ALONE" and the author is Morgan Maxwell.

3. Primerjava več slik

Phi-3 Vision podpira primerjavo več slik. Ta model lahko uporabimo za iskanje razlik med slikami.

prompt = f"{user_prompt}<|image_1|>\n<|image_2|>\n What is difference in this two images?{prompt_suffix}{assistant_prompt}"

print(f">>> Prompt\n{prompt}")

url = "https://hinhnen.ibongda.net/upload/wallpaper/doi-bong/2012/11/22/arsenal-wallpaper-free.jpg"

image_1 = Image.open(requests.get(url, stream=True).raw)

url = "https://assets-webp.khelnow.com/d7293de2fa93b29528da214253f1d8d0/news/uploads/2021/07/Arsenal-1024x576.jpg.webp"

image_2 = Image.open(requests.get(url, stream=True).raw)

images = [image_1, image_2]

inputs = processor(prompt, images, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Rezultat je

The first image shows a group of soccer players from the Arsenal Football Club posing for a team photo with their trophies, while the second image shows a group of soccer players from the Arsenal Football Club celebrating a victory with a large crowd of fans in the background. The difference between the two images is the context in which the photos were taken, with the first image focusing on the team and their trophies, and the second image capturing a moment of celebration and victory.

Omejitev odgovornosti:
Ta dokument je bil preveden z uporabo storitve za avtomatski prevod AI Co-op Translator. Čeprav si prizadevamo za natančnost, vas opozarjamo, da lahko avtomatski prevodi vsebujejo napake ali netočnosti. Izvirni dokument v njegovem izvirnem jeziku velja za avtoritativni vir. Za pomembne informacije priporočamo strokovni človeški prevod. Za morebitne nesporazume ali napačne interpretacije, ki izhajajo iz uporabe tega prevoda, ne odgovarjamo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inferenca Phi-3-Vision lokalno

0. Priprava

1. Analiza slike s Phi-3-Vision

2. OCR s Phi-3-Vision

3. Primerjava več slik

FilesExpand file tree

Vision_Inference.md

Latest commit

History

Vision_Inference.md

File metadata and controls

Inferenca Phi-3-Vision lokalno

0. Priprava

1. Analiza slike s Phi-3-Vision

2. OCR s Phi-3-Vision

3. Primerjava več slik