利用Python制作属于自己的url提取器

admin 2021年5月6日19:13:09评论38 views字数 6494阅读21分38秒阅读模式

相关环境

Python3

requests

parsel

threading

queue

argparse

sys

其实简单点只需要前面三个库就好了,只是强迫症就想搞正规点。

大概思路

  1. 输入关键字,分析url ,以及翻页。

  2. 模拟爬取,提取出单页所需url。

  3. 实现多页爬取。

  4. 多线程爬取+可控参数

1. 分析url

01. 输入内容可控

这里以SecIn为例:

利用Python制作属于自己的url提取器

查看url,会发现有SecIn的关键字

&wd 就是搜索内容的参数,测试一下,发现实际也如此。

把多余的参数去掉,只留有用的。
https://www.baidu.com/s?&wd=SecIn

02. 翻页可控

多翻页几回,看看有什么参数来回变化。
最终我发现

利用Python制作属于自己的url提取器

得出 : pn值等于页面数*10 -10 .

拼接URL: https://www.baidu.com/s?&wd=SecIn&pn=60

最大的页数只有76

2. 模拟爬取

```python
import requests
import parsel
import pprint

url = 'https://www.baidu.com/s?&wd=SecIn&pn=60'
baidu_header={
'Accept9': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'BD_UPN=12314753; BAIDUID=B9C663242EBFE646684C271051BFEA9E:FG=1; PSTM=1585744388; BIDUPSID=94FB3558C0BDAA06C330FDC42A453409; H_WISE_SIDS=139912_143932_143381_142018_144883_145118_141744_144419_144134_144472_144483_136861_144490_131246_144682_137749_138883_140259_141941_127969_144171_140066_144341_140593_142421_144607_144727_143922_144485_131423_100806_142206_107316_144306_143478_144966_142426_144534_143667_144333_144238_143853_142273_110085; MSA_WH=1141_666; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDUSS=RLMWhkcS1aZWRSVUtHY05wbnFja2xqZmJjZ0s4VjRwM3JjRlQzQTNmOUZTTjllRVFBQUFBJCQAAAAAAAAAAAEAAABhYeIxTElYTkhPTkcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEW7t15Fu7ded1; yjs_js_security_passport=d9a54629350d171585f4bd079e15e2b0d527848e_1589101657_js; delPer=0; BD_CK_SAM=1; PSINO=1; COOKIE_SESSION=11125_0_7_2_7_20_0_3_4_3_2_4_0_0_6_0_1589110468_0_1589121587%7C9%2388918_49_1589018226%7C9; ZD_ENTRY=baidu; BDRCVFR[C0p6oIjvx-c]=mbxnW11j9Dfmh7GuZR8mvqV; H_PS_PSSID=1463_31325_21110_31594_30841_31464_31322_30823_31164_22157; H_PS_645EC=ecec2k70KZZ5j0c3Qg7w5uVxJu8RsA1acZ7WKo%2Bq%2BO7eYcEW2gDIzOg8B0Y'
,'Host': 'www.baidu.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
header={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}

response = requests.get(url, headers=baidu_header, timeout=5)
# print(url)
response.encoding = response.apparent_encoding
sel = parsel.Selector(response.text)
links2 = sel.xpath('//*[@data-click]')
links1 = links2.xpath('./@href').getall()
pprint.pprint(links1)

```

pprint的结果:

利用Python制作属于自己的url提取器

这边我们只要link?url=的内容,其他都不是我们想要的内容,先遍历内容出来,然后进行判断。

for i in links1:
if 'link?url=' in i:
print(i)

结果:

利用Python制作属于自己的url提取器

这时候的url就是正确的url了。

但是这些url都是百度跳转的url,并不是真实的网站url,所以我们还得进行一个模拟浏览,然后然返回真实的url。

在上面的判断基础上,再加入一个爬取。判断是否能正常访问,如果可以,就返回真实的url。

real_url = requests.get(i, headers=header, timeout=6)
if real_url.status_code == 200:
print(real_url.url)

到这一个简单的单页url提取就完成了。

利用Python制作属于自己的url提取器

01.多页爬取

只需要遍历一个步长为10 最大值为760的pn值即可。有些网页是外国的,访问会超时所以的加个try 进行错误跳过。

```
for i in range(0,760,10):
url='https://www.baidu.com/s?&wd=SecIn&pn=%s'%i
print(url)
try :

        response = requests.get(url, headers=baidu_header, timeout=5)
        response.encoding = response.apparent_encoding
        sel = parsel.Selector(response.text)
        links2 = sel.xpath('//*[@data-click]')  
        links1 = links2.xpath('./@href').getall() 
        # pprint.pprint(links1)
        for i in links1:
            if 'link?url=' in i:
                # print(i)
                real_url = requests.get(i, headers=header, timeout=6)
                if real_url.status_code == 200:
                    print(real_url.url)
      except :
        pass

```

实现多线程爬取

这边继承的写法都是固定的,思路就是运行脚本,判断是否输入数据,如果输入wd、pn参数,则执行main()入参数,拼接url,判断传入的url是否为空,如果不为空,那就运行spider方法,然后写入内容。否则就打印帮助菜单,提示输入。
我就直接贴出完整代码了

```
import requests
import threading
import sys
import argparse
import parsel
from queue import Queue

baidu_header={
'Accept9': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'BD_UPN=12314753; BAIDUID=B9C663242EBFE646684C271051BFEA9E:FG=1; PSTM=1585744388; BIDUPSID=94FB3558C0BDAA06C330FDC42A453409; H_WISE_SIDS=139912_143932_143381_142018_144883_145118_141744_144419_144134_144472_144483_136861_144490_131246_144682_137749_138883_140259_141941_127969_144171_140066_144341_140593_142421_144607_144727_143922_144485_131423_100806_142206_107316_144306_143478_144966_142426_144534_143667_144333_144238_143853_142273_110085; MSA_WH=1141_666; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDUSS=RLMWhkcS1aZWRSVUtHY05wbnFja2xqZmJjZ0s4VjRwM3JjRlQzQTNmOUZTTjllRVFBQUFBJCQAAAAAAAAAAAEAAABhYeIxTElYTkhPTkcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEW7t15Fu7ded1; yjs_js_security_passport=d9a54629350d171585f4bd079e15e2b0d527848e_1589101657_js; delPer=0; BD_CK_SAM=1; PSINO=1; COOKIE_SESSION=11125_0_7_2_7_20_0_3_4_3_2_4_0_0_6_0_1589110468_0_1589121587%7C9%2388918_49_1589018226%7C9; ZD_ENTRY=baidu; BDRCVFR[C0p6oIjvx-c]=mbxnW11j9Dfmh7GuZR8mvqV; H_PS_PSSID=1463_31325_21110_31594_30841_31464_31322_30823_31164_22157; H_PS_645EC=ecec2k70KZZ5j0c3Qg7w5uVxJu8RsA1acZ7WKo%2Bq%2BO7eYcEW2gDIzOg8B0Y'
,'Host': 'www.baidu.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
header={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}

class ThreadRun(threading.Thread):
def init(self, q):
super(ThreadRun, self).init()
self.q = q

def run(self):
    while not self.q.empty():
        url = self.q.get()
        try:
            self.spider(url)
        except:
            pass

def spider(self, url):
    response = requests.get(url, headers=baidu_header, timeout=5)
    response.encoding = response.apparent_encoding
    sel = parsel.Selector(response.text)

    links2 = sel.xpath('//*[@data-click]')
    links1 = links2.xpath('./@href').getall()
    for i in links1:
        if 'link?url=' in i:
            # print(i)
            real_url = requests.get(i, headers=header, timeout=6)
            if real_url.status_code == 200:
                print(real_url.url)
                with open('url.txt',mode='a') as f1:
                    f1.write(real_url.url + 'n')

def main(wd,pn):
q = Queue()
thread = []
threadnum = 50
pn = int(pn) * 10 - 10
for i in range(0, int(pn) + 10, 10):
q.put('https://www.baidu.com/s?wd=%s&pn=%s' % (wd, str(i)))

for i in range(threadnum):
    thread.append(ThreadRun(q))
for t in thread:
    t.start()
for q in thread:
    q.join()

if name == 'main':
f = open('url.txt', mode='w')
f.close()
parser = argparse.ArgumentParser(description='----Sec-in----')
parser.add_argument('-W', '--keyword', help='searc contex!', metavar='')
parser.add_argument('-P', '--page', metavar='', type=int, help='page')
args = parser.parse_args()
if len(sys.argv) < 2: # 这是转义了的小于号
print('-----Plase Enter: search context an pagen-----')
parser.print_help()
sys.exit(-1)
else:
main(args.keyword, args.page)
```

感谢观看QAQ<

相关推荐: 原创干货 | 【恶意代码分析技巧】03-java

1.java介绍 在上一篇文章里我们提到过,python是一门基于虚拟机的语言,在python执行的时候,python解释器将源代码转为字节码,然后再由python解释器来执行这些字节码。Java也是一门基于虚拟机的语言,java代码首先经过java编译器(j…

  • 左青龙
  • 微信扫一扫
  • weinxin
  • 右白虎
  • 微信扫一扫
  • weinxin
admin
  • 本文由 发表于 2021年5月6日19:13:09
  • 转载请保留本文链接(CN-SEC中文网:感谢原作者辛苦付出):
                   利用Python制作属于自己的url提取器https://cn-sec.com/archives/246534.html