Skip to content

Fix Static Crawling Issue Due to Newly Implemented Anti-Scraping Mechanism#109

Open
JunTingLin wants to merge 3 commits intomlouielu:masterfrom
JunTingLin:fix-static-crawl-issue
Open

Fix Static Crawling Issue Due to Newly Implemented Anti-Scraping Mechanism#109
JunTingLin wants to merge 3 commits intomlouielu:masterfrom
JunTingLin:fix-static-crawl-issue

Conversation

@JunTingLin
Copy link
Copy Markdown

作者您好,

首先感謝您開發並分享這麼實用的專案。我在使用過程中發現,自從過年之後,原本透過靜態爬蟲requests去抓取http://isin.twse.com.tw/isin/C_public.jsp?strMode=2 上的所有股票代號資料的方法已經無法正常運作了。我推測這可能是網站加強了防爬機制的結果。

為了解決這個問題,我對fetch.py中的fetch_data函數進行了一番修正,改用Selenium進行動態爬蟲。考慮到可能有使用者會在無GUI環境下運行此專案,我有啟用了無頭模式(headless mode)。但...一旦啟用無頭模式後,就頻繁遇到連線失敗的問題。經過一番嘗試後,我發現了一個可行的解決方案:先訪問主頁面https://isin.twse.com.tw 並暫停幾秒,然後再去訪問目標URL,這樣就能順利獲取所需的資料了。

如果我的修改存在任何問題,或者有更好的解決方案,請隨時聯繫我。

@JeffBla
Copy link
Copy Markdown
Contributor

JeffBla commented Apr 11, 2024

Hello JunTingLin! I think I encountered the same problem with you. The update function fails.
I analyze it and make some adjustments in #110
I'm wondering whether it would be better not to use Selenium?

Copy link
Copy Markdown

@mitchhuang777 mitchhuang777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Consider adding try-except blocks can help handle potential exceptions.
  2. use WebDriverWait(driver, 10).until rather than time.sleep

driver.get(main_page_url)
time.sleep(5) # 等待JavaScript渲染完成
driver.get(url)
time.sleep(5) # 等待JavaScript渲染完成
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magical number is not a good way :(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

777

# 使用WebDriver先訪問主頁面,再訪問指定的URL
main_page_url = "https://isin.twse.com.tw"
driver.get(main_page_url)
time.sleep(5) # 等待JavaScript渲染完成
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magical number is not a good way :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants