Problems running CDio 0.45.7 headless on Alpine 3.19_alpha20230901 (edge) VM #1970

ChristopherW · 2023-11-11T23:46:20Z

ChristopherW
Nov 11, 2023

Version 0.45.7 running headless on an Alpine 3.19alpha, modifications to content_fetcher.py were seemingly required to get everything running.

I first had to update Python 3.7 to 3.11 as various things weren't able to compile (pillow cffi wheel errors and so on), which were resolved by unpicking older dependencies on other software. Once compiled, I did an apk add chromium chromium-webdriver, which installed Chrome and driver 118.0.5993.117. Selenium 4.15.2

On running up CDio, I found that any attempts to use the WebDriver Chrome/Javascript method did not work, erroring immediately at the Chrome stage. Various errors and possible red herrings, including session not created: DevToolsActivePort file doesn't exist and other immediate bailouts to errors.

CDio was also not capable of detecting the $WEBDRIVER_URL environment variable, so I resorted to adding export WEBDRIVER_URL="http://localhost:3456" in /etc/profilethen issuedsource /etc/profile` before restarting the service, which appeared to work. (Cose a non-default port for testing.)

I then found that Chrome was not being opened with the correct parameters, causing it to crash immediately. example from chromedriver log:

[1699648988.978][INFO]: Launching chrome: /usr/lib/chromium/chrome --allow-pre-commit-input --disable-background-networking --disable-client-side-phishing-detection --disable-default-apps --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --enable-automation --enable-logging --log-level=0 --no-first-run --no-service-autorun --password-store=basic --remote-debugging-port=0 --test-type=webdriver --use-mock-keychain --user-data-dir=/tmp/.org.chromium.Chromium.fbINeH data:,
[12224:12244:1110/204311.333601:ERROR:bus.cc(407)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory
[12224:12224:1110/204311.677624:ERROR:ozone_platform_x11.cc(239)] Missing X server or $DISPLAY
[12224:12224:1110/204311.677705:ERROR:env.cc(255)] The platform failed to initialize.  Exiting.
[1699648992.057][INFO]: [bbf9c9d9865a0ceb8f89ff6a4855ce89] RESPONSE InitSession ERROR session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/lib/chromium/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
[1699648992.057][DEBUG]: Log type 'driver' lost 0 entries on destruction
[1699648992.057][DEBUG]: Log type 'browser' lost 0 entries on destruction
[1699649092.171][INFO]: [0a1f368cd79d28ee0a11b81ab436b9f9] COMMAND InitSession {
   "capabilities": {
      "alwaysMatch": {
         "browserName": "chrome",
         "goog:chromeOptions": {
            "args": [  ],
            "extensions": [  ]
         },
         "pageLoadStrategy": "normal"
      },
      "firstMatch": [ {
      } ]
   }
}
[1699649092.210][INFO]: Populating Preferences file: {
   "alternate_error_pages": {
      "enabled": false
   },
   "autofill": {
      "enabled": false
   },
   "browser": {
      "check_default_browser": false
   },
   "distribution": {
      "import_bookmarks": false,
      "import_history": false,
      "import_search_engine": false,
      "make_chrome_default_for_user": false,
      "skip_first_run_ui": true
   },
   "dns_prefetching": {
      "enabled": false
   },
   "profile": {
      "content_settings": {
         "pattern_pairs": {
            "https://*,*": {
               "media-stream": {
                  "audio": "Default",
                  "video": "Default"
               }
            }
         }
      },
      "default_content_setting_values": {
         "geolocation": 1
      },
      "default_content_settings": {
         "geolocation": 1,
         "mouselock": 1,
         "notifications": 1,
         "popups": 1,
         "ppapi-broker": 1
      },
      "password_manager_enabled": false
   },
   "safebrowsing": {
      "enabled": false
   },
   "search": {
      "suggest_enabled": false
   },
   "translate": {
      "enabled": false
   }
}
[1699649092.210][INFO]: Populating Local State file: {
   "background_mode": {
      "enabled": false
   },
   "ssl": {
      "rev_checking": {
         "enabled": false
      }
   }
}

This went on for a while as I gradually picked my way through errors. This server is a minimal headless Alpine VM, so has none of the usual GUI things. In an effort to resolve this issue, I resorted to installing the following:

dbus. dbus-dev and dbus-x11
xfce4 and xfce4-terminal
xvfb and xvfb-run
jpeg-dev and zlib1g-dev (for Chrome)
I did also try installing firefox and the gecko-driver, but the CDio code seems to expect Chromium and its webdriver, so didn't pursue that further.

These packages seemed to improve things, but I still found Chrome wasn't running properly. I then looked at how I could get Selenium to launch Chrome 'properly', as various things have changed and permanently deprecated (disabled), as I found out while hacking about in the code.

I resorted to focusing on content_fetcher.py as it appeared to be doing the legwork of calling the browser. After line 613 (options = ChromeOptions()), I added

        options.add_argument("--no-sandbox")
        options.add_argument(f"--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36")
        options.add_argument("--disable-dev-shm-usage")
        options.add_argument("--disable-gpu")
        options.add_argument("--headless=new")

Specifying ChromeOptions.add_argument("headless") makes no difference as it appears to be for either a capabilities-based launch or for the local webdriver (CDio is using the remote webdriver with a few differences, which caught me out for a while due to unfamiliarity). I tried the --remote-debugging-pipe method as I'd noted some references to fixed bidi control, but it didn't solve my scenario. The --disable-gpu switch is an older switch which some say is Windows-specific and was required on older versions of Selenium, but has been deprecated as it's no longer needed. I left it because it wasn't breaking anything, and I might run this on Windows in future.

The --headless switch has recently changed to --headless=new in Selenium. This is discussed on stackoverflow on comments to a reply (including comments by someone on the Selenium steering group who offered some useful insight) but it's been poorly communicated as older replies abound online to similar sorts of browser problems.

After line 622, because my specific site had a mandatory cookie banner, while testing, I added the following xpath clicker (which has also very recently changed syntax, thanks Selenium): (included the comments to myself as a reference)

            self.driver.find_element("xpath", '//*[@id="onetrust-accept-btn-handler"]').click()
# previous code which now doesn't work
#            self.driver.find_element(By.XPATH, '//*[@id="onetrust-accept-btn-handler"]').click()
# older previous command which has been formally deprecated and no longer supported, but widely documented online
# driver.find_element(By.LINK_TEXT, "See an example alert").click()
# https://www.selenium.dev/documentation/webdriver/interactions/alerts/
# https://selenium-python.readthedocs.io/locating-elements.html

I realised that I could specify this xpath filter in the Filters & Triggers section, but wasn't sure whether it was also possible to append a .click() command. Due to the extreme resource constraints of my VM, running any test takes upwards of a minute, so I was running out of patience debugging my issues. Could this feature be added perhaps in an "actionable" section, to dismiss banners prior to screengrabs, unless it's already possible somehow else?

For the sake of completeness, I also amended after line 651,

        options.add_argument("--no-sandbox")
        options.add_argument("--headless=new")
        options.add_argument("--disable-dev-shm-usage")
        options.add_argument("--disable-gpu")

I modified

        self.driver = webdriver.Remote(
            command_executor=self.command_executor,
            options=ChromeOptions())

to become

        self.driver = webdriver.Remote(
            command_executor=self.command_executor,
            options=options)
#            options=ChromeOptions())

otherwise these arguments seemed to not pass to chromedriver, causing an error on testing the webdriver connection. (it also seems to run a crawl/capture after any modification to a check, is this entirely necessary, could it be offered as a per-check option?)

I also modified Line 628's self.driver.set_window_size(1280, 720) to self.driver.set_window_size(1920, 1080) without any apparent undue effects.

After these modifications, testing did appear to work cleanly. However I noticed that any small parsing error caused by a coding error in content_fetcher (my fault from the amateur hacking) would often chromedriver to not be able to open or pass control signals to Chromium properly, thus did not properly kill the spawned chromium processes. This resulted in having several abandoned Chromium processes running using a lot of resources. I had to go and perform some pkill tidy-up and restart chromedriver. This seemed to be due to Selenium not being told to quit/exit/kill the sessions properly following an exception or error, perhaps that's something which could be done as a failsafe if CDio/content_fetcher script errors out mid-loop?

I ended up hacking the file directly as I could see no way of specifying arguments to pass to the webdriver. If I was more skilled with github I could submit a PR but I don't have any kind of managed code environment set up on this machine. I think it might be more useful if a user was able to configure custom Chromium flags they wanted to pass to the webdriver in the CDio settings, and CDio also did some informal checks for things like an X environment and other apparent prerequisites (dbus etc) in order for Chrome to run properly on a headless machine.

Finally there's the apparent 'quirk' of changedetection.io not exiting cleanly on the first Ctrl-C if running either interactively or daemonised. Interactively, this forces you to Ctrl-C again, but start-stop-daemon invoked by OpenRC is easily confused and bails with an error. Early on I had to manually kill the process, tidy up the pidfile and couple of times had to issue a zap command (rc-service changedetection zap) to make openrc 'forget' the service was in a crashed state. There's probably a neater way of dealing with this than I ended up using in my init file.

On the whole now, things are running well. My VM is running on a Synology NAS, so it's not very powerful. However it can manage one Chrome check alongside other plaintext checks no issue, which is all I want it to do at the moment. Chrome takes around 40-60 seconds to spawn and very occasionally bails, though it's now able to fairly reliably perform a successful check. Occasionally it fails or times out, so I don't know whether being able to manually dial in higher timeout thresholds or adjust any implicit/explicit waits for Selenium into the check options through the web GUI might help.

Fantastic application though, really useful. Tried it out after failing/10 with other similar apps/scripts with steeper learning curves once you really got into it, and the Telegram integration was surprisingly simple once I'd followed some instructions online. Not sure to what extent I can include excerpts of monitored text or screengrabs in Telegram notifications; I've not played with it much yet due to the effort required just to get even one test run out of my VM. However, it is working. I call that a win for now. 😄

dgtlmoon · 2023-11-11T23:52:55Z

dgtlmoon
Nov 11, 2023
Maintainer

Moved to 'discussions', there's way too much in your post to unpack

1 reply

ChristopherW Nov 12, 2023
Author

Understood. It began as a bug report but developed...

dgtlmoon · 2023-11-11T23:52:57Z

dgtlmoon
Nov 11, 2023
Maintainer

Heya

1 You dont mention how you installed it at all, docker? pip? unzip? tape drive?

2
CDio was also not capable of detecting the $WEBDRIVER_URL environment variable, so I resorted to adding export WEBDRIVER_URL="http://localhost:3456"` but how did you run it? having to export the env var is always how it works for any application...?

If we dont know anything about how you tried to run it.. then i dont know how to help you here if it all

2 replies

ChristopherW Nov 12, 2023
Author

I installed with pip3, everything is native on the VM as it didn't have enough resources to run Docker images.

ChristopherW Nov 12, 2023
Author

I was expecting chrome-webdriver to have set an environment variable as part of its install, which was a little gotcha.

dgtlmoon · 2023-11-11T23:56:29Z

dgtlmoon
Nov 11, 2023
Maintainer

#1783 the selenium driver options were tweaked, maybe you have an old selenium library version?

4 replies

ChristopherW Nov 12, 2023
Author

I have the latest publicly available version for Python, per pip3 list: selenium 4.15.2

sofakng Nov 15, 2023

I think I might be having a similar problem.

Does Changedetection pass '--no-sandbox' to ChromeOptions by default?

It doesn't seem to do so and I can't add it using the URL parameters either...

dgtlmoon Nov 15, 2023
Maintainer

@sofakng but you dont say exactly what is the problem, pasting any errors, version information etc, theres no context to help you

sofakng Nov 15, 2023

Sorry... I'm trying to use CDio with Browserless (Docker image) and WebDriver (instead of Playwright or Puppeteer).

I've set WEBDRIVER_URL to 'http://browserless:3000/webdriver' but when changing the Fetch Method it returns an error:
Content fetcher 'html_webdriver' did not respond properly, unable to use it. Message: session not created: Chrome failed to start: exited normally. (session not created: DevToolsActivePort file doesn't exist) (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.) Stacktrace: #0 0x55f8a3014fb3 #1 0x55f8a2ce84a7 #2 0x55f8a2d1bc93 #3 0x55f8a2d1810c #4 0x55f8a2d5aac6 #5 0x55f8a2d51713 #6 0x55f8a2d2418b #7 0x55f8a2d24f7e #8 0x55f8a2fda8d8 #9 0x55f8a2fde800 #10 0x55f8a2fe8cfc #11 0x55f8a2fdf418 #12 0x55f8a2fac42f #13 0x55f8a30034e8 #14 0x55f8a30036b4 #15 0x55f8a3014143 #16 0x7fcdc01b76ba

Browserless shows the following log messages:

browserless  | [1700067016.437][INFO]: Launching chrome: /usr/bin/google-chrome --allow-pre-commit-input --disable-background-networking --disable-client-side-phishing-detection --disable-default-apps --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --enable-automation --enable-logging --log-level=0 --no-first-run --no-service-autorun --password-store=basic --remote-debugging-port=0 --test-type=webdriver --use-mock-keychain --user-data-dir=/tmp/browserless-data-dir-GdhYUY
browserless  | [53:53:1115/165016.475568:FATAL:zygote_host_impl_linux.cc(127)] No usable sandbox! Update your kernel or see https://chromium.googlesource.com/chromium/src/+/main/docs/linux/suid_sandbox_development.md for more information on developing with the SUID sandbox. If you want to live dangerously and need an immediate workaround, you can try using --no-sandbox.

If I understand correctly it seems like CDio might not be passing the no-sandbox argument?

Uh oh!

Problems running CDio 0.45.7 headless on Alpine 3.19_alpha20230901 (edge) VM #1970

Uh oh!

ChristopherW Nov 11, 2023

Replies: 3 comments · 7 replies

Uh oh!

dgtlmoon Nov 11, 2023 Maintainer

Uh oh!

ChristopherW Nov 12, 2023 Author

Uh oh!

dgtlmoon Nov 11, 2023 Maintainer

Uh oh!

ChristopherW Nov 12, 2023 Author

Uh oh!

ChristopherW Nov 12, 2023 Author

Uh oh!

dgtlmoon Nov 11, 2023 Maintainer

Uh oh!

Uh oh!

ChristopherW Nov 12, 2023 Author

Uh oh!

sofakng Nov 15, 2023

Uh oh!

dgtlmoon Nov 15, 2023 Maintainer

Uh oh!

sofakng Nov 15, 2023

ChristopherW
Nov 11, 2023

Replies: 3 comments 7 replies

dgtlmoon
Nov 11, 2023
Maintainer

ChristopherW Nov 12, 2023
Author

dgtlmoon
Nov 11, 2023
Maintainer

ChristopherW Nov 12, 2023
Author

ChristopherW Nov 12, 2023
Author

dgtlmoon
Nov 11, 2023
Maintainer

ChristopherW Nov 12, 2023
Author

dgtlmoon Nov 15, 2023
Maintainer