Skip to content

Commit ea901b2

Browse files
committed
rename to v2ex_scrapy
1 parent 4d643d1 commit ea901b2

File tree

19 files changed

+283
-171
lines changed

19 files changed

+283
-171
lines changed

README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# 一个爬取v2ex.com网站的爬虫
2+
3+
学习scrapy写的一个小爬虫
4+
5+
## 不建议自行运行爬虫,数据已经有了
6+
7+
## 爬取相关数据说明
8+
9+
数据都放在了sqlite数据库,方便分,整个数据库大小2.1GB。
10+
11+
爬虫源码放在了GitHub,在GitHub我release了完整的sqlite数据库文件
12+
13+
爬虫从`topic_id = 1`开始爬,路径为`https://www.v2ex.com/t/{topic_id}`。 服务器可能返回404/403/302/200,如果是404说明帖子被删除了,如果是403说明是爬虫被限制了,302一般是跳转到登陆页面,有的也是跳转到主页,200返回正常页面。
14+
15+
爬虫没有登陆,所以爬取的数据不完全,比如水下火热的帖子就没有爬到,还有就是如果是302的帖子会记录帖子id,404/403不会记录。
16+
17+
爬取过程中会帖子内容,评论,以及评论的用户信息。
18+
19+
注1:爬了一半才发现V站帖子附言没有爬,附言从`topic_id = 448936`才会爬取
20+
21+
注2:select count(*) from member 得到的用户数比较小,大概20W,是因为爬取过程中是根据评论,以及发帖信息爬取用户的,如果一个用户注册之后既没有评论也没有发帖,那么这个账号就爬不到。还有就是因为部分帖子访问不了,也可能导致部分账号没有爬。还有部分用户号被删除,这一部分也没有爬。(代码改了,可以爬,但是都已经爬完了……)
22+
23+
注3:时间均为UTC+0的秒数
24+
25+
## 运行
26+
27+
### 安装依赖
28+
29+
```bash
30+
pip install -r .\requirements.txt
31+
```
32+
33+
### 配置
34+
35+
#### 代理
36+
37+
更改 `v2ex_scrapy/settings.py``PROXIES`的值 如
38+
39+
```python
40+
[
41+
"http://127.0.0.1:7890"
42+
]
43+
```
44+
45+
请求会随机选择一个代理,如果需要更高级的代理方式可以使用第三方库,或者自行实现Middleware
46+
47+
### 运行爬虫
48+
49+
```bash
50+
scrapy crawl v2ex
51+
```
52+
53+
### 注意事项
54+
55+
爬取过程中出现403基本上是因为IP被限制了,等待一段时间即可

analysis.py

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import plotly.express as px
2+
import plotly.graph_objects as go
3+
import pandas
4+
import sqlite3
5+
6+
conn = sqlite3.connect("./v2ex.sqlite")
7+
c = conn.cursor()
8+
9+
topic = pandas.read_sql(
10+
"""
11+
SELECT strftime('%Y-%m', create_at, 'unixepoch') AS date, COUNT(*) AS topic_count
12+
FROM topic
13+
GROUP BY date;
14+
""",
15+
conn,
16+
)
17+
# drop 1970
18+
fig = px.line(
19+
topic[1:],
20+
x="date",
21+
y="topic_count",
22+
)
23+
fig.show()
24+
25+
comment = pandas.read_sql(
26+
"""
27+
SELECT strftime('%Y-%m', create_at, 'unixepoch') AS date, COUNT(*) AS comment_count
28+
FROM comment
29+
GROUP BY date;
30+
""",
31+
conn,
32+
)
33+
fig = px.line(
34+
comment,
35+
x="date",
36+
y="comment_count",
37+
)
38+
fig.show()
39+
40+
user = pandas.read_sql(
41+
"""
42+
SELECT strftime('%Y-%m', create_at, 'unixepoch') AS date, COUNT(*) AS user_count
43+
FROM member
44+
GROUP BY date;
45+
""",
46+
conn,
47+
)
48+
49+
50+
# drop 1970
51+
fig = px.line(
52+
user[1:],
53+
x="date",
54+
y="user_count",
55+
)
56+
57+
fig.show()

db.sql

Lines changed: 0 additions & 40 deletions
This file was deleted.

query.sql

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
select count(*)
2+
from comment;
3+
-- where create_at between strftime('%s', '2013-01-01') and strftime('%s', '2014-12-31');
4+
5+
select count(*)
6+
from member;
7+
8+
select count(*)
9+
from topic;
10+
11+
-- top comment by thank_count
12+
select topic_id, c.id, thank_count
13+
from comment c
14+
left join topic t on t.id = c.topic_id
15+
order by thank_count desc;
16+
17+
-- top topic by votes
18+
select id, title, votes
19+
from topic
20+
order by votes desc;
21+
22+
-- top topic by clicks
23+
select id, title, clicks
24+
from topic
25+
order by clicks desc;
26+
27+
28+
-- top node
29+
select node, count(node) as count
30+
from topic
31+
group by node
32+
order by count desc;
33+
34+
-- comment number group by user
35+
select commenter, count(commenter) as comment_count
36+
from comment
37+
group by commenter
38+
order by comment_count desc;
39+
40+
-- topic number group by user
41+
select author, count(author) as topic_count
42+
from topic
43+
group by author
44+
order by topic_count desc;
45+
46+
-- topic number group by year
47+
SELECT date,
48+
SUM(topic_count) OVER (ORDER BY date ) AS cumulative_topic_count
49+
FROM (SELECT strftime('%Y-%m', create_at, 'unixepoch') AS date, COUNT(*) AS topic_count
50+
FROM topic
51+
GROUP BY date)
52+
ORDER BY date;
53+
54+
-- user number group by year
55+
SELECT date,
56+
SUM(user_count) OVER (ORDER BY date ) AS cumulative_user_count
57+
FROM (SELECT strftime('%Y-%m', create_at, 'unixepoch') AS date, COUNT(*) AS user_count
58+
FROM member
59+
GROUP BY date)
60+
ORDER BY date;
61+
62+
-- comment number group by year
63+
SELECT date,
64+
SUM(comment_count) OVER (ORDER BY date ) AS cumulative_comment_count
65+
FROM (SELECT strftime('%Y-%m', create_at, 'unixepoch') AS date, COUNT(*) AS comment_count
66+
FROM comment
67+
GROUP BY date)
68+
ORDER BY date;
69+
70+
-- new topic number group by year
71+
SELECT strftime('%Y-%m', create_at, 'unixepoch') AS date, COUNT(*) AS topic_count
72+
FROM topic
73+
GROUP BY date;
74+
75+
-- new user number group by year
76+
SELECT strftime('%Y-%m', create_at, 'unixepoch') AS date, COUNT(*) AS user_count
77+
FROM member
78+
GROUP BY date;
79+
80+
-- new comment number group by year
81+
SELECT strftime('%Y-%m', create_at, 'unixepoch') AS date, COUNT(*) AS comment_count
82+
FROM comment
83+
GROUP BY date;
84+
85+
-- tag usage count
86+
select t.value as tag, count(*) as count
87+
from topic,
88+
json_each(tag) as t
89+
group by t.value
90+
order by count desc;
91+
92+
-- node usage count
93+
select node, count(*) as count
94+
from topic
95+
group by node
96+
order by count desc;

requirements.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
arrow==1.2.3
2+
pandas==2.0.1
3+
plotly==5.14.1
4+
Scrapy==2.9.0
5+
SQLAlchemy==2.0.17

scrapy.cfg

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
# https://scrapyd.readthedocs.io/en/latest/deploy.html
55

66
[settings]
7-
default = tutorial_scrapy.settings
7+
default = v2ex_scrapy.settings
88

99
[deploy]
1010
#url = http://localhost:6800/
11-
project = tutorial_scrapy
11+
project = v2ex_scrapy

tutorial_scrapy/spiders/V2exCommentSpider.py

Lines changed: 0 additions & 51 deletions
This file was deleted.

tutorial_scrapy/spiders/quotes_spider.py

Lines changed: 0 additions & 20 deletions
This file was deleted.

tutorial_scrapy/DB.py renamed to v2ex_scrapy/DB.py

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,14 @@
11
import json
2-
import sqlite3
3-
from typing import List, Type, Union
2+
from typing import Type, Union
43

5-
from sqlalchemy import create_engine, exists, func, select, text
4+
from sqlalchemy import create_engine, text
65
from sqlalchemy.orm import Session
76

8-
from tutorial_scrapy import utils
9-
from tutorial_scrapy.items import (
7+
from v2ex_scrapy.items import (
108
Base,
119
CommentItem,
1210
MemberItem,
1311
TopicItem,
14-
TopicSupplementItem,
1512
)
1613

1714

@@ -25,7 +22,7 @@ def __new__(cls):
2522

2623
def __init__(self):
2724
self.engine = create_engine(
28-
"sqlite:///v2ex.sqlite",
25+
"sqlite:///v2ex2.sqlite",
2926
echo=False,
3027
json_serializer=lambda x: json.dumps(x, ensure_ascii=False),
3128
)
@@ -53,6 +50,6 @@ def exist(
5350

5451
def get_max_topic_id(self) -> int:
5552
result = self.session.execute(text("SELECT max(id) FROM topic")).fetchone()
56-
if result is None:
53+
if result is None or result[0] is None:
5754
return 1
5855
return int(result[0])
File renamed without changes.

0 commit comments

Comments
 (0)