GitHub - xiaoxiaosuaxuan/newscrapy: newscrapy

项目框架

打开总文件夹后，需要关心的有以下几个目录和文件：

newscrapy\ : scrapy的项目目录
    ---newscrapy\spiders\: 编写的爬虫目录
results\ : 爬取的结果目录
test.py : 一个用来运行爬虫的脚本

为了运行爬虫，需要首先安装一些库，如pip install scrapy, pymongo，其他可能依赖的库请自行安装。

一个简单测试

当安装必要的库后，可以运行 test.py，如果成功，你会看到终端的输出，并且\results目录里对应的结果文件里可以看到爬取的内容。

test.py的实质是在命令行调用：scrapy crawl 报纸名 -a start=开始日期 -a end=结束日期，因此你可以修改该文件中的name, start, end来指定具体的爬虫行为。

spiders

newscrapy\spiders是爬虫目录，为不同报纸编写的爬虫会放在该目录下。该目录下已经有几个编写好的爬虫，例如baotoudaily.py是为《包头日报》编写的爬虫文件。

每个爬虫文件中都会定义一个继承自CrawlSpider的爬虫类，这个类规定了爬取的规则，即：

需要爬取哪些链接
每个链接对应的页面需要爬取哪些内容

实例说明

下面以 workerdaily.py中定义的爬虫来说明如何编写一个爬虫类：

class mySpider(CrawlSpider):
    name = "workerdaily"
    newspapers = "工人日报"
    allowed_domains = ['www.workercn.cn']

    def start_requests(self):
        dates = dateGen(self.start, self.end, "%Y/%m/%d")
        template = "https://www.workercn.cn/papers/grrb/{date}/1/page.html"
        for d in dates:
            yield FormRequest(template.format(date = d))

    rules = (
        Rule(LinkExtractor(allow=('grrb/\d+/\d+/\d+/\d+/page.html'), restrict_xpaths="//*[@id='pageTitle']")),
        Rule(LinkExtractor(allow=('grrb/\d+/\d+/\d+/\d+/news-\d+.html')), callback="parse_item")
    )

    def parse_item(self, response):
        try:
            title1 = response.xpath("//*[@id='pretitle']").xpath("string(.)").get()
            title2 = response.xpath("//*[@id='ctitle']").xpath("string(.)").get()
            title = title1 + ' ' + title2
            content = response.xpath("//*[@id='ccontent']").xpath('string(.)').get()
            url = response.url
            date = re.search('grrb/(\d+/\d+/\d+)/', url).group(1)
            date = '-'.join([date[0:4], date[5:7], date[8:10]])
            imgs = response.xpath("//*[@id='imgs']//img/@src").getall()
            imgs = [parse.urljoin(url, imgurl) for imgurl in imgs]
            html  =response.text
        except Exception as e:
            return
        
        item = NewscrapyItem()
        item['title'] = title
        item['content'] = content
        item['date'] = date
        item['imgs'] = imgs
        item['url'] = response.url
        item['newspaper'] = self.newspapers
        item['html'] = html
        yield item

类名为mySpider，继承自CrawlSpider。为了统一，所有的爬虫类名都为mySpider。
属性：
- name，定义了这个爬虫的名称（请与爬虫的类名区分），必须是唯一的。命令scrapy crawl 报纸名 -a start=开始日期 -a end=结束日期中，使用的报纸名即name。为了统一，一个爬虫的name应与其所在的python文件名相同。
- newspaper，即该爬虫对应报纸的中文名称。
- allowed_domains，爬虫允许爬取的域名范围。
start_request: 这个函数规定了爬虫的种子。
rules：规定了当我们爬取到一个页面时，这个页面上的哪些链接会被继续爬取，以及不同的链接被爬取到后会执行哪些操作。我们定义的规则都是用来提取链接的，因此都形如Rule(LinkExtractor(...))。参数的含义如下：
- allow: 用正则表达式定义了链接的格式
- callback: 是一个回调函数，定义了爬取到对应url的页面后，对该页面执行的操作。具体来说，scrapy爬取到的页面被包装成一个response对象，作为参数传递给这个函数。
- restrict_xpath: 定义在页面的哪一部分提取url，不是必须的。
这里我们定义了两条规则，这也是日报类爬虫大多数的情况。为什么定义两条规则？因为并非所有的页面都有内容要提取，例如页面1对应的是第一条规则，这个页面是日报的版面，上面没有新闻内容，我们只需要提取页面上的链接，不需要解析页面的内容，也不需要回调函数。而页面2对应的是第二条规则，我们需要解析内容，因此也需要回调函数。
parse_item：这个函数即解析页面时的回调函数。它的参数response是scrapy爬取到的网页，可以理解成html页面的头结点。在这个函数里，我们通过xpath语法来解析html文件，获取所需要的内容，如标题，正文等。

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
interface		interface
newscrapy		newscrapy
.gitignore		.gitignore
binzhoudaily.py		binzhoudaily.py
chaozhoudaily_.py		chaozhoudaily_.py
chongqingchenbao.py		chongqingchenbao.py
chongqingdaily.py		chongqingdaily.py
chongqingshangbao.py		chongqingshangbao.py
chongqingwanbao.py		chongqingwanbao.py
daliandaily.py		daliandaily.py
dalianwanbao.py		dalianwanbao.py
fushundaily.py		fushundaily.py
fushunwanbao.py		fushunwanbao.py
guyuandaily.py		guyuandaily.py
hainanbao.py		hainanbao.py
hezedaily.py		hezedaily.py
huanghaichenkan.py		huanghaichenkan.py
huangnanbao.py		huangnanbao.py
jianghaiwanbao.py		jianghaiwanbao.py
jiangsushangbao.py		jiangsushangbao.py
jinandaily.py		jinandaily.py
jinanshibao.py		jinanshibao.py
jingjidaobao.py		jingjidaobao.py
jiujiangwanbao.py		jiujiangwanbao.py
lianyungangdaily.py		lianyungangdaily.py
liaochengdaily.py		liaochengdaily.py
liaochengwanbao.py		liaochengwanbao.py
liaoningdaily.py		liaoningdaily.py
liaoshenwanbao.py		liaoshenwanbao.py
linyidaily.py		linyidaily.py
lunanshangbao.py		lunanshangbao.py
mudanwanbao.py		mudanwanbao.py
nantongdaily.py		nantongdaily.py
neimenggushangbao.py		neimenggushangbao.py
pengchengwanbao.py		pengchengwanbao.py
qinghaidaily.py		qinghaidaily.py
readme.md		readme.md
sanxiadushibao.py		sanxiadushibao.py
scrapy.cfg		scrapy.cfg
test.py		test.py
wuhaidaily.py		wuhaidaily.py
wujindaily.py		wujindaily.py
wulanchabushidaily.py		wulanchabushidaily.py
xihaidushibao.py		xihaidushibao.py
xinxiaoxibao.py		xinxiaoxibao.py
yangzhoudaily.py		yangzhoudaily.py
yangziwanbao.py		yangziwanbao.py
yantaidaily.py		yantaidaily.py
yantaiwanbao.py		yantaiwanbao.py
yimengwanbao.py		yimengwanbao.py
yingkoudaily.py		yingkoudaily.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

项目框架

一个简单测试

spiders

实例说明

推荐阅读

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

项目框架

一个简单测试

spiders

实例说明

推荐阅读

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages