爬虫实例1-爬取新闻列表和发布时间

admin

145792
文章

119
评论

2021年7月24日05:29:15评论87 views字数 1112阅读3分42秒阅读模式

一、新建工程

scrapy startproject shop

二、Items.py文件代码：

import scrapy

class ShopItem(scrapy.Item):

title = scrapy.Field()

time = scrapy.Field()

三、shopspider.py文件爬虫代码

# -*-coding:UTF-8-*-

import scrapy

from shop.items import ShopItem

class shopSpider(scrapy.Spider):

name = "shop"

allowed_domains = ["news.xxxxxxx.xx.cn"]

start_urls = ["http://news.xxxxx.xxx.cn/hunan/"]

def parse(self,response):

item = ShopItem()

item['title'] = response.xpath("//div[@class='txttotwe2']/ul/li/a/text()").extract()

item['time'] = response.xpath("//div[@class='txttotwe2']/ul/li/font/text()").extract()

yield item

四、pipelines.py文件代码（打印出内容）：

注意：如果在shopspider.py文件中打印出内容则显示的是unicode编码，而在pipelines.py打印出来的信息则是正常的显示内容。

class ShopPipeline(object):

def process_item(self, item, spider):

count=len(item['title'])

print 'news count: ' ,count

for i in range(0,count):

print 'biaoti: '+item['title'][i]

print 'shijian: '+item['time'][i]

return item

五、爬取显示的结果：

root@kali:~/shop# scrapy crawl shop --nolog

news count: 40

biaoti: xxx建成国家食品安全示范城市

shijian: (2017-06-16)

biaoti: xxxx考试开始报名

……………………

…………………..

本文始发于微信公众号（飓风网络安全）：爬虫实例1-爬取新闻列表和发布时间

免责声明:文章中涉及的程序(方法)可能带有攻击性，仅供安全研究与教学之用，读者将其信息做其他用途，由读者承担全部法律及连带责任，本站不承担任何法律及连带责任；如有问题可邮件联系(建议使用企业邮箱或有效邮箱,避免邮件被拦截，联系方式见首页)，望知悉。

左青龙
微信扫一扫

右白虎
微信扫一扫

爬虫实例1-爬取新闻列表和发布时间

实战某凤网站导致的代码审计

【0day】泛微OA前台登录绕过+后台组合拳RCE

广联达远程代码执行代码审计

PHP基础-表单和请求

java agent 学习

用Python搞了个基金查询机器人，还可以拓展！

OpenJDK16 ZGC 源码分析

『每周译Go』开启并发模式

KYXSCMS 灰盒测试

「GoCN酷Go推荐」使用gops诊断运行中的go程序

发表评论