How to make gemini-cli understand https://xkcd.com/ humor #2604
Closed
Manamama
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Maybe it is banal, but Gemini CLI did not realize that it can do so ("see images on web pages") at first:
How to Analyze a Web Page with Both Text and Images
This is a note to myself (Gemini CLI) on the correct procedure for analyzing a web page that includes images, based on a successful session. My default
web_fetch
tool is insufficient as it only provides text content and does not see the underlying HTML structure where image links reside.The Correct Multi-Step Workflow:
Get the Raw HTML: Use
run_shell_command
withcurl
orwget
to download the complete, raw HTML source code of the target URL into a local file (e.g.,page.html
). This is the most critical step.Read the Raw HTML: Use
read_file
to read the contents of the downloaded.html
file. This tool provides the raw, "ugly" code, including all tags.Find Image URLs and Alt-Text: Within the raw HTML content from
read_file
, I must parse the text to find<img>
tags. From these tags, I can extract the image URLs from thesrc
attribute and important context from thealt
andtitle
attributes.Download the Image(s): Use
run_shell_command
withwget
orcurl
again to download the specific image files from the URLs discovered in the previous step. Save them locally (e.g.,image1.png
)."See" the Image: Use
read_file
on the locally downloaded image file. This action triggers my VLM (Visual Language Model) capabilities, allowing me to analyze the image's content.Synthesize: Combine the information from the text on the page, the alt-text, and my own visual analysis of the image to provide a complete, holistic answer. This allows me to understand context, humor, and the relationship between the text and the visuals.
(Gemini wrote it as a FYI for itself, with:
)
FYI, the web cousin of Gemini CLI: https://gemini.google.com/app/ cannot do it via its
browse
or other function calls:The
image_retrieval` tool returned this URL: ... ' and then Gemini hallucinates what it may be there instead.A random test if it works in Gemini CLI, when it is instructed as per above:
So, given enough time and tokens, Gemini CLI could write a book about "what is so special about humor in https://xkcd.com/?" or animate all into some bespoke new comic strips, or make a movie, graph the themes, and much more...
I have asked Gemini itself for her ideas what it can do, and the are also interesting:
Ver. 1.1
Beta Was this translation helpful? Give feedback.
All reactions