Skip to content

rewrote scrapers for new ui#3

Open
mueslimak3r wants to merge 11 commits intohoelty:masterfrom
mueslimak3r:new-ui-fix
Open

rewrote scrapers for new ui#3
mueslimak3r wants to merge 11 commits intohoelty:masterfrom
mueslimak3r:new-ui-fix

Conversation

@mueslimak3r
Copy link

@mueslimak3r mueslimak3r commented Jun 2, 2024

Opening this as a draft because it hasn't been thoroughly tested, and hasn't been formatted to respect the "debug" flag.
Also, because I scrape the series-part list from a specific series' page, the series parts/stories don't have stats such as rating. Only one-shot stories get those stats

This was easier than handling clicking the "View Full 128 Part Series" button on the author works page before parsing the list from there. The list on the author's page is what has the stats in each story's card.

In its current state this is adequate for me, and I'll leave this as-is.
If anyone wants to finish what I've started, I'll keep an eye out and update this PR as needed.

Closes #2

@mueslimak3r mueslimak3r marked this pull request as ready for review July 15, 2024 02:29
@domaniko
Copy link

Thank you for the PR.

It works for quite some texts, but I found some issues with others.

These small changes improved it a lot for me:

@@ -390,7 +390,7 @@ def parse_series_page(page_url, author):
 
 def parse_author_works_page(html):
     soup = bs4.BeautifulSoup(html, 'html.parser')
-    author_element = soup.find('h1', class_='headline__title')
+    author_element = soup.find('title')
     if not author_element:
         error("Cannot determine author on member page.")
     if "Stories by " in author_element.text.strip():

and

@@ -478,16 +478,16 @@ def get_story_text(st):
     #[0].select("div[class^=_item_title]")[0]['href']
 
     #vals = re.findall('<option value=".*?">(\d+)</option>', sel_match.group(1))
-    if not paginator_elements: # just one page
-        error("Couldn't find paginator elements.")
     complete_text = ""
 
     end = 1
-    for pe in paginator_elements:
-        if pe.text.strip() == '' or not pe.text.strip().isnumeric():
-            continue
-        if int(pe.text.strip()) > end:
-            end = int(pe.text.strip())
+    if paginator_parent_element:
+        for pe in paginator_elements:
+            if pe.text.strip() == '' or not pe.text.strip().isnumeric():
+                continue
+            if int(pe.text.strip()) > end:
+                end = int(pe.text.strip())

The generated EPUBs now do not have an additional line break between paragraphs wich makes reading in some EBook Readers a little bit more awkward.

@mueslimak3r
Copy link
Author

@domaniko can you open a PR to merge your changes into my branch so they can be added to this PR with proper attribution?

I'm also happy to just make your changes on my end and push them.

@domaniko
Copy link

Done so mueslimak3r#1

@domaniko
Copy link

@mueslimak3r Could you please consider mueslimak3r#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

needs updating

2 participants