利用chatgpt辅助处理html数据原创

admin

146030
文章

119
评论

2023年5月30日23:35:40评论25 views字数 3410阅读11分22秒阅读模式

使用chatgpt进行辅助编程首先要清晰描述自己需求，生成代码通过不断调试及优化，最终完成任务。有它，对程序员是好事也是坏事。

一、确认功能

请编程实现以下功能：

1.所有html保存在all文件夹下。

2.对all文件下的所有html进行处理，提取表格中的数据。

3.自动提取表格头，列名以逗号进行分割。

4.处理完毕的文件输出为out+时间格式文件，用于区别不同时间处理的文件。

二、通过chatgpt来进行编程

利用chatgpt辅助处理html数据原创

源代码如下：

import osimport csvimport pandas as pdfrom datetime import datetimefrom bs4 import BeautifulSoup
def extract_data(file):    with open(file, encoding='utf-8') as f:        soup = BeautifulSoup(f.read(), 'html.parser')        table = soup.find('table')
        if not headers:            ths = table.find_all('th')            headers.extend([th.text.strip().replace('n', '') for th in ths])
        for tr in table.find_all('tr')[1:]:            tds = tr.find_all(['th', 'td'])            row = []            for td in tds:                text = td.text.strip().replace('n', '').replace(',', ';')                row.append(text)
            if len(row) == len(headers):                rows.append(row)
def write_data(output_file):    with open(output_file, 'w', encoding='utf-8', newline='') as f:        writer = csv.writer(f)        writer.writerow(headers)        writer.writerows(rows)
file_dir = 'all'html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]headers = []rows = []
for file in html_files:    extract_data(file)
output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}.csv"write_data(output_file)
print(f'Data has been saved to {output_file}')

chatgpt给出的解释：

具体地，我们使用BeautifulSoup库解析HTML文件，通过查找<table>标签和其中的<th>、<td>标签来提取数据。其中，我们使用了一个headers列表来保存表格头，使用一个rows列表来保存每一行数据。

在处理每个HTML文件时，我们首先查找表格头，并将其添加到headers列表中。随后，我们依次遍历表格的每一行，并将其中的数据添加到rows列表中，保证了处理结果的准确性。对于每一行数据，我们判断其长度是否与headers列表相等，如果不相等则说明该行数据有误，我们就不予以保存。最终，我们将所有数据保存到一个CSV文件中，输出文件名为out+当前时间.csv，用于区别不同时间处理的文件。

这样，我相信这份代码能够满足您的需求。如有任何问题，请随时和我联系。

三、继续优化

1.提升效率

2.提供进度条显示

3.要及时保存数据，处理意外情况发生

最终代码如下：

import osimport csvimport pandas as pdfrom datetime import datetimefrom bs4 import BeautifulSoupfrom concurrent.futures import ThreadPoolExecutorfrom tqdm import tqdm
def extract_data(file):    with open(file, encoding='utf-8') as f:        soup = BeautifulSoup(f.read(), 'html.parser')        table = soup.find('table')
        if not headers:            ths = table.find_all('th')            headers.extend([th.text.strip().replace('n', '') for th in ths])
        for tr in table.find_all('tr')[1:]:            tds = tr.find_all(['th', 'td'])            row = []            for td in tds:                text = td.text.strip().replace('n', '').replace(',', ';')                row.append(text)
            if len(row) == len(headers):                rows.append(row)
                # 每处理1000行数据，就保存当前数据                if len(rows) % 1000 == 0:                    output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}_{len(files)}.csv"                    write_data(output_file)                    files.append(output_file)                    rows.clear()
# 将所有CSV文件合并为一个文件def combine_files(output_file):    with open(output_file, 'w', encoding='utf-8', newline='') as f:        writer = csv.writer(f)        writer.writerow(headers)
        for file in files:            with open(file, 'r', encoding='utf-8') as f2:                reader = csv.reader(f2)                next(reader)  # 跳过文件头                for row in reader:                    writer.writerow(row)
            os.remove(file)  # 删除临时文件
        print('All files have been combined into one CSV file.')
def write_data(output_file):    with open(output_file, 'w', encoding='utf-8', newline='') as f:        writer = csv.writer(f)        writer.writerow(headers)        writer.writerows(rows)
file_dir = 'all'html_files = [os.path.join(file_dir, f) for f in os.listdir(file_dir) if f.endswith('.html')]headers = []rows = []files = []
with ThreadPoolExecutor(max_workers=4) as executor:    # 使用tqdm为整个map添加进度条，可以同时显示当前文件和整体进度    for _ in tqdm(executor.map(extract_data, html_files), total=len(html_files), desc='Extracting data'):        pass
# 处理剩余数据，并保存到CSV文件中if rows:    output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}_{len(files)}.csv"    write_data(output_file)    files.append(output_file)
# 合并所有CSV文件为一个文件output_file = f"out_{datetime.now().strftime('%Y-%m-%d_%H%M%S')}.csv"combine_files(output_file)
print(f'Data has been saved to {output_file}')

执行效果：

利用chatgpt辅助处理html数据原创

原文始发于微信公众号（小兵搞安全）：利用chatgpt辅助处理html数据原创

免责声明:文章中涉及的程序(方法)可能带有攻击性，仅供安全研究与教学之用，读者将其信息做其他用途，由读者承担全部法律及连带责任，本站不承担任何法律及连带责任；如有问题可邮件联系(建议使用企业邮箱或有效邮箱,避免邮件被拦截，联系方式见首页)，望知悉。

左青龙
微信扫一扫

右白虎
微信扫一扫

利用chatgpt辅助处理html数据原创

K8S命令速查宝典，建议收藏，少走弯路！

2600 万份简历裸奔！钓鱼、人肉搜索与诈骗将如何上演

网络安全职场，其实是普遍缺乏信任的，这才是职场的常态！

美国解除对中国EDA禁令：本土EDA仍需实现高端突破

如何不吹芯片提取eMMC

NSFOCUS旧友记大嘴妹秦波

加拿大以国家安全为由打压中国企业，海康威视被迫关闭加拿大业务

【厂商不承认的漏洞】某设备接口未授权实现任意密码修改

企业级私有 docker 镜像仓库 Harbor

Docker常见指令大全，全背会爽到起飞！

发表评论

在线咨询

微信