maxvst · corsair20141 · Oct 8, 2020 · Oct 8, 2020 · Oct 8, 2020 · Oct 24, 2020
diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml
@@ -0,0 +1,39 @@
+# This workflow will upload a Python Package using Twine when a release is created
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python#publishing-to-package-registries
+
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+
+name: Upload Python Package
+
+on:
+  release:
+    types: [published]
+
+permissions:
+  contents: read
+
+jobs:
+  deploy:
+
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python
+      uses: actions/setup-python@v3
+      with:
+        python-version: '3.10'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install build
+    - name: Build package
+      run: python -m build
+    - name: Publish package
+      uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
+      with:
+        user: __token__
+        password: ${{ secrets.PYPI_API_TOKEN }}
diff --git a/.gitignore b/.gitignore
@@ -126,3 +126,5 @@ dmypy.json
 *.pdf
 *.html
 chromedriver
+requirements.txt
+main.py
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2019 Maksim
+Copyright (c) 2020
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -1,66 +1,80 @@
-# python-selenium-chrome-html-to-pdf-converter
+# pyhtml2pdf
 Simple python wrapper to convert HTML to PDF with headless Chrome via selenium.
 
-## Installation
-Clone repository, move to project root dir, install virtualenv, install dependencies:
+## Install
 ```
-git clone https://github.com/maxvst/python-selenium-chrome-html-to-pdf-converter.git
-cd python-selenium-chrome-html-to-pdf-converter
-python3 -m venv venv
-source venv/bin/activate
-pip install -r requirements.txt
+pip install pyhtml2pdf
 ```
-Install chrome (chromium) browser.
 
-Download chromedriver from http://chromedriver.chromium.org/ and put it to project root directory.
+## Dependencies
+
+ - [Selenium Chrome Webdriver](https://chromedriver.chromium.org/downloads) (If Chrome is installed on the machine you won't need to install the chrome driver)
+ - [Ghostscript](https://www.ghostscript.com/download.html)
+
+## Example
+
+### **Convert to PDF**
+
+**Use with website url**
 
-## Demo
 ```
-cd examples
-python converter.py https://google.com google.pdf
+from pyhtml2pdf import converter
+
+converter.convert('https://pypi.org', 'sample.pdf')
 ```
 
-## Why use selenium?
-TODO: Add description
+**Use with html file from local machine**
 
-## CSS recomendations
+```
+import os
+from pyhtml2pdf import converter
 
-Basic configuration for single page:
+path = os.path.abspath('index.html')
+converter.convert(f'file:///{path}', 'sample.pdf')
 ```
-@page {
-    size: A4;
-    margin: 0mm;
-}
+
+**Some JS objects may have animations or take a some time to render. You can set a time out in order to help render those objects. You can set timeout in seconds**
+
+```
+converter.convert(source, target, timeout=2)
 ```
 
-For printing double-sided documents use
+**Compress the converted PDF**
+
+Some PDFs may be oversized. So there is a built in PDF compression feature.
+
+The power of the compression,
+ - 0: default
+ - 1: prepress
+ - 2: printer
+ - 3: ebook
+ - 4: screen
+
 ```
-@page :left {
-    margin-left: 4cm;
-    margin-right: 2cm;
-}
-
-@page :right {
-    margin-left: 4cm;
-    margin-right: 2cm;
-}
-
-@page :first {
-    margin-top: 10cm    /* Top margin on first page 10cm */
-}
+converter.convert(source, target, compress=True, power=0)
 ```
 
-Control pagination with page-break-before, page-break-after, page-break-inside like
+### **Pass Print Options**
+
+You can use print options mentioned [here](https://vanilla.aslushnikov.com/?Page.printToPDF)
+
 ```
-h1 { page-break-before : right }
-h2 { page-break-after : avoid }
-table { page-break-inside : avoid }
+converter.convert( f"file:///{path}", f"sample.pdf", print_options={"scale": 0.95} )
 ```
-Control widows and оrphans like
+
+### **Compress PDF**
+
+**Use it to compress a PDF file from local machine**
+
 ```
-@page {
-    orphans:4;
-    widows:2;
-}
+import os
+from pyhtml2pdf import compressor
+
+compressor.compress('sample.pdf', 'compressed_sample.pdf')
 ```
-More descriptions see at https://www.tutorialspoint.com/css/css_paged_media.htm
+
+Inspired the works from,
+
+ - https://github.com/maxvst/python-selenium-chrome-html-to-pdf-converter.git
+ - https://github.com/theeko74/pdfc
+
diff --git a/SECURITY.md b/SECURITY.md
@@ -0,0 +1,51 @@
+# Security Policy
+
+## Supported Versions
+
+We support security fixes for the latest released version and the `master` branch.
+
+| Version | Supported |
+| ------- | --------- |
+| Latest  | ✅        |
+| Older   | ❌        |
+
+## Reporting a Vulnerability
+
+If you believe you’ve found a security vulnerability, **please do not open a public GitHub issue**.
+
+Instead, report it privately using one of the following:
+
+### Preferred: GitHub Private Vulnerability Reporting
+- Go to: **Security** → **Advisories** → **Report a vulnerability**
+- Provide as much detail as possible (see “What to include” below).
+
+### Alternative: Email
+- Email: **[email protected]**
+
+## What to Include
+
+Please include:
+- A clear description of the issue and potential impact
+- Steps to reproduce (proof-of-concept if available)
+- Affected versions/branches
+- Any suggested fix or mitigation (if you have one)
+
+## Response Timeline
+
+We aim to:
+- Acknowledge receipt within **3 business days**
+- Provide a status update within **7 business days**
+- Release a fix as soon as practical based on severity and complexity
+
+## Coordinated Disclosure
+
+We follow coordinated disclosure practices. Please allow reasonable time to investigate and remediate before any public disclosure.
+
+## Security Updates
+
+Security fixes may be released as:
+- Patch releases
+- Advisory notes (GitHub Security Advisory)
+- Changelog entries (when appropriate)
+
+Thank you for helping keep this project and its users safe.
diff --git a/examples/converter.py b/examples/converter.py
diff --git a/pyhtml2pdf/__init__.py b/pyhtml2pdf/__init__.py
diff --git a/pyhtml2pdf/compressor.py b/pyhtml2pdf/compressor.py
@@ -0,0 +1,131 @@
+import logging
+import os
+import platform
+import subprocess
+from pathlib import Path
+from tempfile import NamedTemporaryFile, _TemporaryFileWrapper
+from typing import Literal, Union
+
+from .utils import _pdf_has_suspicious_content
+
+MAX_BYTES = 25 * 1024 * 1024
+
+logger = logging.getLogger(__name__)
+
+
+def compress(
+    source: str | os.PathLike | _TemporaryFileWrapper,
+    target: str | os.PathLike,
+    power: int = 0,
+    ghostscript_command: Union[Literal["gs", "gswin64c", "gswin32c"], None] = None,
+    max_pdf_size: int = MAX_BYTES,
+    timeout: int = 10,
+    force_process: bool = False,
+) -> None:
+    """
+
+    :param source: Source PDF file
+    :param target: Target location to save the compressed PDF
+    :param power: Power of the compression. Default value is 0. This can be
+                    0: default,
+                    1: prepress,
+                    2: printer,
+                    3: ebook,
+                    4: screen
+    :param ghostscript_command: The name of the ghostscript executable. If set to the default value None, is attempted
+                                to be inferred from the OS.
+                                If the OS is not Windows, "gs" is used as executable name.
+                                If the OS is Windows, and it is a 64-bit version, "gswin64c" is used. If it is a 32-bit
+                                version, "gswin32c" is used.
+    :param max_pdf_size: Maximum allowed size for the PDF in bytes. Default is 25 MB.
+    :param timeout: Timeout in seconds
+    :param force_process: Whether to process even if suspicious content is found (Be extra careful with this setting).
+    """
+    quality = {0: "/default", 1: "/prepress", 2: "/printer", 3: "/ebook", 4: "/screen"}
+
+    if ghostscript_command is None:
+        if platform.system() == "Windows":
+            if platform.machine().endswith("64"):
+                ghostscript_command = "gswin64c"
+            else:
+                ghostscript_command = "gswin32c"
+        else:
+            ghostscript_command = "gs"
+
+    if isinstance(source, _TemporaryFileWrapper):
+        source = source.name
+
+    source = Path(source)
+    target = Path(target)
+
+    if not source.is_file():
+        raise FileNotFoundError("Source file does not exist")
+
+    if source.suffix.lower() != ".pdf":
+        raise ValueError("Source file is not a PDF")
+
+    issues = _pdf_has_suspicious_content(source, max_pdf_size)
+
+    if issues:
+        logger.warning(
+            "Warning: The PDF file has been flagged for suspicious content.\n\n- %s\n\nProcessing has been skipped to avoid potential security risks.\n\n"
+            "If you believe this is an error, you can set force_process=True to override this behavior. Proceed with caution!\n",
+            "\n- ".join(issues),
+        )
+
+        if not force_process:
+            logger.error(
+                "PDF file flagged for suspicious content. Process aborted.\n\n"
+            )
+            raise RuntimeError(
+                "PDF file flagged for suspicious content. Process aborted."
+            )
+
+    try:
+        subprocess.call(
+            [
+                ghostscript_command,
+                "-dSAFER",
+                "-sDEVICE=pdfwrite",
+                "-dCompatibilityLevel=1.4",
+                "-dPDFSETTINGS={}".format(quality[power]),
+                "-dNOPAUSE",
+                "-dQUIET",
+                "-dBATCH",
+                "-sOutputFile={}".format(target.as_posix()),
+                source.as_posix(),
+            ],
+            shell=platform.system() == "Windows",
+            timeout=timeout,
+        )
+    except subprocess.TimeoutExpired:
+        logger.error(
+            "PDF processing took too long (DoS protection triggered). If you believe this is an error, try increasing the timeout parameter."
+        )
+
+        raise TimeoutError
+
+
+def _compress(
+    result: bytes,
+    target: str | os.PathLike,
+    power: int,
+    timeout: int,
+    ghostscript_command: Union[Literal["gs", "gswin64c", "gswin32c"], None] = None,
+):
+    with NamedTemporaryFile(
+        suffix=".pdf", delete=platform.system() != "Windows"
+    ) as tmp_file:
+        tmp_file.write(result)
+
+        # Ensure minimum timeout of 20 seconds for compression when call from converter.py
+        _timeout: int = max(timeout, 20)
+
+        compress(
+            source=tmp_file,
+            target=target,
+            power=power,
+            ghostscript_command=ghostscript_command,
+            max_pdf_size=Path(tmp_file.name).stat().st_size + 1_000_000,
+            timeout=_timeout,
+        )
-Original file line number
+Diff line change
@@ Expand Up / @@ -126,3 +126,5 @@ dmypy.json @@
     *.pdf
     *.html
     chromedriver
+    requirements.txt
+    main.py