rewrote scrapers for new ui by mueslimak3r · Pull Request #3 · hoelty/litepubify

mueslimak3r · 2024-06-02T23:18:18Z

Opening this as a draft because it hasn't been thoroughly tested, and hasn't been formatted to respect the "debug" flag.
Also, because I scrape the series-part list from a specific series' page, the series parts/stories don't have stats such as rating. Only one-shot stories get those stats

This was easier than handling clicking the "View Full 128 Part Series" button on the author works page before parsing the list from there. The list on the author's page is what has the stats in each story's card.

In its current state this is adequate for me, and I'll leave this as-is.
If anyone wants to finish what I've started, I'll keep an eye out and update this PR as needed.

Closes #2

domaniko · 2024-08-13T18:23:36Z

Thank you for the PR.

It works for quite some texts, but I found some issues with others.

These small changes improved it a lot for me:

@@ -390,7 +390,7 @@ def parse_series_page(page_url, author):
 
 def parse_author_works_page(html):
     soup = bs4.BeautifulSoup(html, 'html.parser')
-    author_element = soup.find('h1', class_='headline__title')
+    author_element = soup.find('title')
     if not author_element:
         error("Cannot determine author on member page.")
     if "Stories by " in author_element.text.strip():

and

@@ -478,16 +478,16 @@ def get_story_text(st):
     #[0].select("div[class^=_item_title]")[0]['href']
 
     #vals = re.findall('<option value=".*?">(\d+)</option>', sel_match.group(1))
-    if not paginator_elements: # just one page
-        error("Couldn't find paginator elements.")
     complete_text = ""
 
     end = 1
-    for pe in paginator_elements:
-        if pe.text.strip() == '' or not pe.text.strip().isnumeric():
-            continue
-        if int(pe.text.strip()) > end:
-            end = int(pe.text.strip())
+    if paginator_parent_element:
+        for pe in paginator_elements:
+            if pe.text.strip() == '' or not pe.text.strip().isnumeric():
+                continue
+            if int(pe.text.strip()) > end:
+                end = int(pe.text.strip())

The generated EPUBs now do not have an additional line break between paragraphs wich makes reading in some EBook Readers a little bit more awkward.

mueslimak3r · 2024-10-10T04:55:51Z

@domaniko can you open a PR to merge your changes into my branch so they can be added to this PR with proper attribution?

I'm also happy to just make your changes on my end and push them.

domaniko · 2024-10-10T16:33:14Z

Done so mueslimak3r#1

domaniko · 2024-10-19T08:19:32Z

@mueslimak3r Could you please consider mueslimak3r#2

Issue with authors who only published single stories

mueslimak3r added 2 commits June 2, 2024 16:13

rewrote scrapers using bs4 for new ui

b2f89ba

fixed title parsing for series part

13b0f77

mueslimak3r marked this pull request as ready for review July 15, 2024 02:29

Additional fix needed for new UI

31df3c7

mueslimak3r and others added 2 commits October 10, 2024 15:56

Merge pull request #1 from domaniko/new-ui-fix

6e5182d

Issue with authors who only published single stories

24bbf71

mueslimak3r and others added 6 commits January 20, 2025 23:00

Merge pull request #2 from domaniko/new-ui-fix

ac26d6a

Issue with authors who only published single stories

support series url instead of needing url of first story in series

a4d2b58

Merge branch 'new-ui-fix'

333ae81

Add license that matches upstream

1533ba1

Update README.md

e9bcf3c

fixed HTML error where & wasn't escaped

c145124

mueslimak3r force-pushed the new-ui-fix branch from 00e1ba2 to c145124 Compare April 20, 2025 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrote scrapers for new ui#3

rewrote scrapers for new ui#3
mueslimak3r wants to merge 11 commits intohoelty:masterfrom
mueslimak3r:new-ui-fix

mueslimak3r commented Jun 2, 2024 •

edited

Loading

Uh oh!

domaniko commented Aug 13, 2024

Uh oh!

mueslimak3r commented Oct 10, 2024

Uh oh!

domaniko commented Oct 10, 2024

Uh oh!

domaniko commented Oct 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mueslimak3r commented Jun 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domaniko commented Aug 13, 2024

Uh oh!

mueslimak3r commented Oct 10, 2024

Uh oh!

domaniko commented Oct 10, 2024

Uh oh!

domaniko commented Oct 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mueslimak3r commented Jun 2, 2024 •

edited

Loading