Skip to content

Conversation

navin772
Copy link
Member

@navin772 navin772 commented Oct 16, 2025

User description

🔗 Related Issues

In here - #16419 (comment), the txt and image files were getting downloaded as html. The webserver wasn't handling the mimetypes, this PR fixes it.

💥 What does this PR do?

🔧 Implementation Notes

I have used filetype for mimetype guessing but it isn't very good for text file mime guessing since it works on magic bytes of files. We are already using filetype and it was added in this PR - #14771 as per discussion in #14765 (comment).

We can also use the mimetypes library.

💡 Additional Considerations

🔄 Types of changes

  • Bug fix (backwards compatible)

PR Type

Bug fix


Description

  • Add proper MIME type detection using mimetypes library for file serving

  • Handle binary files (images, audio, video) separately from text files

  • Remove workarounds in download tests that accepted incorrect file extensions

  • Simplify test assertions to expect correct file types


Diagram Walkthrough

flowchart LR
  A["Webserver File Handler"] -->|"guess_type()"| B["MIME Type Detection"]
  B -->|"Binary Content"| C["Read as Binary"]
  B -->|"Text Content"| D["Read as Text + Encode"]
  C -->|"Correct Extension"| E["File Downloaded Correctly"]
  D -->|"Correct Extension"| E
  F["Test Assertions"] -->|"Remove Workarounds"| G["Expect Correct File Types"]
Loading

File Walkthrough

Relevant files
Bug fix
webserver.py
Implement proper MIME type detection for file serving       

py/test/selenium/webdriver/common/webserver.py

  • Import mimetypes module for MIME type detection
  • Refactor _serve_file() to handle binary and text files separately
    based on MIME type
  • Update do_GET() to use mimetypes.guess_type() instead of hardcoded
    .json check
  • Binary files (images, audio, video, applications) are read in binary
    mode; text files use latin-1 encoding
+16/-3   
Tests
remote_downloads_tests.py
Remove download test workarounds for correct file types   

py/test/selenium/webdriver/remote/remote_downloads_tests.py

  • Remove TODO comments about Chrome downloading files as HTML
  • Simplify test_get_downloadable_files() to assert exact file names
    instead of accepting multiple extensions
  • Simplify test_download_file() to search for .txt files directly
    without fallback extensions
  • Update _browser_downloads() wait condition to expect correct
    file_2.jpg instead of multiple extensions
  • Reduce wait timeout from 5 to 3 seconds
+6/-16   

@selenium-ci selenium-ci added the C-py Python Bindings label Oct 16, 2025
Copy link
Contributor

qodo-merge-pro bot commented Oct 16, 2025

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🟡
🎫 #1234
🔴 Ensure clicking links with JavaScript in href triggers expected behavior in Firefox 42
similar to Selenium 2.47.1.
Provide a fix or regression handling for versions 2.48.0/2.48.2 where alerts no longer
show.
🟡
🎫 #14765
🟢 Ensure tests continue to validate image/content types without imghdr, potentially via
alternative libraries or methods.
🔴 Remove or replace usage of deprecated imghdr module in tests to be compatible with Python
3.13+.
🟡
🎫 #5678
🔴 Investigate and resolve "ConnectFailure (Connection refused)" errors when instantiating
multiple ChromeDriver instances on Ubuntu.
Provide guidance or code changes to prevent connection issues after the first ChromeDriver
instance.
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
No custom compliance provided

Follow the guide to enable custom compliance check.

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

Copy link
Contributor

qodo-merge-pro bot commented Oct 16, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Simplify file serving by reading all files in binary mode
Suggestion Impact:The commit changed _serve_file to always open files in binary mode and return the raw bytes, removing the previous latin-1 text re-encoding path. It also added MIME type detection with filetype and simple fallbacks.

code diff:

     def _serve_file(self, file_path):
         """Serve a file from the HTML root directory."""
-        content_type, _ = mimetypes.guess_type(file_path)
-
-        if content_type and (
-            content_type.startswith("image/")
-            or content_type.startswith("application/")
-            or content_type.startswith("video/")
-            or content_type.startswith("audio/")
-        ):
-            with open(file_path, "rb") as f:
-                return f.read()
+        with open(file_path, "rb") as f:
+            content = f.read()
+
+        kind = filetype.guess(content)
+        if kind is not None:
+            return content, kind.mime
+
+        # fallback for text files that filetype can't detect
+        if file_path.endswith(".txt"):
+            return content, "text/plain"
+        elif file_path.endswith(".json"):
+            return content, "application/json"
         else:
-            # text files
-            with open(file_path, encoding="latin-1") as f:
-                return f.read().encode("utf-8")
+            return content, "text/html"
 

Simplify the _serve_file method by opening all files in binary mode ("rb") to
avoid unnecessary and potentially corrupting text re-encoding.

py/test/selenium/webdriver/common/webserver.py [72-87]

 def _serve_file(self, file_path):
     """Serve a file from the HTML root directory."""
-    content_type, _ = mimetypes.guess_type(file_path)
+    with open(file_path, "rb") as f:
+        return f.read()
 
-    if content_type and (
-        content_type.startswith("image/")
-        or content_type.startswith("application/")
-        or content_type.startswith("video/")
-        or content_type.startswith("audio/")
-    ):
-        with open(file_path, "rb") as f:
-            return f.read()
-    else:
-        # text files
-        with open(file_path, encoding="latin-1") as f:
-            return f.read().encode("utf-8")
-

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that re-encoding text files from latin-1 to utf-8 is problematic and can cause corruption. The proposed change to read all files in binary mode is simpler, more robust, and the correct approach for a web server.

Medium
Learned
best practice
Default Content-Type when unknown
Suggestion Impact:The commit refactored content-type detection to return both content and MIME type from _serve_file using filetype.guess, and added explicit fallbacks (e.g., text/plain, application/json, text/html). This ensures a valid Content-Type is always set even when guessing fails, aligning with the suggestion's intent.

code diff:

     def _serve_file(self, file_path):
         """Serve a file from the HTML root directory."""
-        content_type, _ = mimetypes.guess_type(file_path)
-
-        if content_type and (
-            content_type.startswith("image/")
-            or content_type.startswith("application/")
-            or content_type.startswith("video/")
-            or content_type.startswith("audio/")
-        ):
-            with open(file_path, "rb") as f:
-                return f.read()
+        with open(file_path, "rb") as f:
+            content = f.read()
+
+        kind = filetype.guess(content)
+        if kind is not None:
+            return content, kind.mime
+
+        # fallback for text files that filetype can't detect
+        if file_path.endswith(".txt"):
+            return content, "text/plain"
+        elif file_path.endswith(".json"):
+            return content, "application/json"
         else:
-            # text files
-            with open(file_path, encoding="latin-1") as f:
-                return f.read().encode("utf-8")
+            return content, "text/html"
 
     def _send_response(self, content_type="text/html"):
         """Send a response."""
@@ -102,8 +103,7 @@
                 self._send_response("text/html")
                 self.wfile.write(html)
             elif os.path.isfile(file_path):
-                content_type, _ = mimetypes.guess_type(file_path)
-                content = self._serve_file(file_path)
+                content, content_type = self._serve_file(file_path)
                 self._send_response(content_type)
                 self.wfile.write(content)

Ensure a valid Content-Type is always sent by defaulting to a sensible MIME type
when guessing fails. This avoids sending a None header value.

py/test/selenium/webdriver/common/webserver.py [104-108]

 elif os.path.isfile(file_path):
     content_type, _ = mimetypes.guess_type(file_path)
+    if not content_type:
+        content_type = "application/octet-stream"
     content = self._serve_file(file_path)
     self._send_response(content_type)
     self.wfile.write(content)

[Suggestion processed]

Suggestion importance[1-10]: 6

__

Why:
Relevant best practice - Validate inputs and states early, providing safe defaults to prevent logic errors and None-related issues.

Low
  • Update

@cgoldberg
Copy link
Member

good find! thanks for fixing the tests also.

@navin772 navin772 merged commit 8a91342 into SeleniumHQ:trunk Oct 17, 2025
4 checks passed
@navin772 navin772 deleted the content-type-webserver branch October 17, 2025 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants