我常用的有四个浏览器,分别是 Edge、Firefox、Chrome 和 Brave,为的就是分开对应工作生活中的不同场景,例如工作用 Chrome,摸鱼用 Firefox,Brave当隐私浏览器等。但是很麻烦的问题是,各个浏览器里都有一堆书签,这些书签可能有重复,没法集中起来管理书签,于是我想写个简单的脚本把这四个浏览器的书签都整理合并到一起。
这个脚本主要有以下几个功能:
- 解析书签文件:从HTML文件中提取书签信息。
- URL标准化:将URL进行标准化处理,方便后续去重。
- 去重处理:移除重复的书签。
- 分类书签:根据域名对书签进行分类。
- 保存结果:将整理后的书签保存为新的HTML文件。
首先,我需要对书签的URL进行标准化处理,去除URL中的路径末尾斜杠等不必要的部分,以便准确比较URL。以下是normalize_url函数的实现:
from urllib.parse import urlparse, urlunparse
def normalize_url(url):
parsed_url = urlparse(url)
normalized_path = parsed_url.path.rstrip('/')
normalized_url = urlunparse((parsed_url.scheme, parsed_url.netloc, normalized_path, '', '', ''))
return normalized_url
我使用BeautifulSoup库解析书签HTML文件,提取其中的书签链接和对应的标签信息。以下是parse_bookmarks函数的实现:
from bs4 import BeautifulSoup
def parse_bookmarks(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
soup = BeautifulSoup(file, 'html.parser')
bookmarks = {}
for a_tag in soup.find_all('a'):
url = a_tag.get('href')
if url:
normalized_url = normalize_url(url)
bookmarks[normalized_url] = a_tag
return bookmarks, soup
使用集合(set)来移除重复的书签:
def remove_duplicates(bookmarks):
return list(set(bookmarks.values()))
为了更好地组织书签,我根据域名对书签进行分类:
from urllib.parse import urlparse
def categorize_by_domain(bookmarks):
categories = {}
for url, a_tag in bookmarks.items():
parsed_url = urlparse(url)
domain = parsed_url.netloc
if domain not in categories:
categories[domain] = []
categories[domain].append(a_tag)
return categories
最后,将去重和分类后的书签保存为新的HTML文件。这个脚本提供了两种保存方式:一种是不分类保存,另一种是按域名分类保存。
def save_to_html(bookmarks, original_soup, output_file): with open(output_file, 'w', encoding='utf-8') as file: head = original_soup.find('head') file.write('<!DOCTYPE NETSCAPE-Bookmark-file-1>\n') file.write('<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\n') file.write('<TITLE>Bookmarks</TITLE>\n') file.write(str(head)) file.write('<body>\n') file.write('<dl><p>\n') for a_tag in bookmarks: file.write(f'{str(a_tag)}\n') file.write('</p></dl>\n') file.write('</body>\n') def save_to_html_with_categories(categories, original_soup, output_file): with open(output_file, 'w', encoding='utf-8') as file: head = original_soup.find('head') file.write('<!DOCTYPE NETSCAPE-Bookmark-file-1>\n') file.write('<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\n') file.write('<TITLE>Bookmarks</TITLE>\n') file.write(str(head)) file.write('<body>\n') file.write('<dl><p>\n') for category, a_tags in categories.items(): file.write(f'<dt><h3>{category}</h3></dt>\n') file.write('<dl><p>\n') for a_tag in a_tags: file.write(f' {str(a_tag)}\n') file.write('</p></dl>\n') file.write('</p></dl>\n') file.write('</body>\n')
在去重过程中,可能会遇到一些冲突的书签(同一URL对应多个书签标签)。我需要检测并处理这些冲突:
import collections def check_conflicts(bookmarks): counter = collections.Counter(bookmarks.keys()) conflicts = return conflicts def resolve_conflicts(bookmarks, conflicts): resolved_bookmarks = {} conflicted_bookmarks = [] for url, a_tag in bookmarks.items(): if url in conflicts: conflicted_bookmarks.append(a_tag) else: resolved_bookmarks = a_tag return resolved_bookmarks, conflicted_bookmarks
以下是主程序的实现,可以通过命令行输入多个书签文件路径,程序会自动合并、去重并保存结果:
if __name__ == "__main__": input_files = [] while True: file_path = input("请输入书签文件路径(输入为空时结束):") if not file_path: break input_files.append(file_path) if not input_files: print("未输入任何书签文件路径。") exit() all_bookmarks = {} original_soup = None for file_path in input_files: bookmarks, soup = parse_bookmarks(file_path) all_bookmarks.update(bookmarks) if original_soup is None: original_soup = soup conflicts = check_conflicts(all_bookmarks) resolved_bookmarks, conflicted_bookmarks = resolve_conflicts(all_bookmarks, conflicts) output_file = 'processed_bookmarks.html' unique_bookmarks = remove_duplicates(resolved_bookmarks) save_to_html(unique_bookmarks + conflicted_bookmarks, original_soup, output_file) print(f"书签已保存到 {output_file}")
from bs4 import BeautifulSoup from urllib.parse import urlparse, urlunparse import collections def normalize_url(url): parsed_url = urlparse(url) normalized_path = parsed_url.path.rstrip('/') normalized_url = urlunparse((parsed_url.scheme, parsed_url.netloc, normalized_path, '', '', '')) return normalized_url def parse_bookmarks(file_path): with open(file_path, 'r', encoding='utf-8') as file: soup = BeautifulSoup(file, 'html.parser') bookmarks = {} for a_tag in soup.find_all('a'): url = a_tag.get('href') if url: normalized_url = normalize_url(url) bookmarks[normalized_url] = a_tag return bookmarks, soup def remove_duplicates(bookmarks): return list(set(bookmarks.values())) def categorize_by_domain(bookmarks): categories = {} for url, a_tag in bookmarks.items(): parsed_url = urlparse(url) domain = parsed_url.netloc if domain not in categories: categories[domain] = [] categories[domain].append(a_tag) return categories def save_to_html(bookmarks, original_soup, output_file): with open(output_file, 'w', encoding='utf-8') as file: head = original_soup.find('head') file.write('<!DOCTYPE NETSCAPE-Bookmark-file-1>\n') file.write('<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\n') file.write('<TITLE>Bookmarks</TITLE>\n') file.write(str(head)) file.write('<body>\n') file.write('<dl><p>\n') for a_tag in bookmarks: file.write(f'{str(a_tag)}\n') file.write('</p></dl>\n') file.write('</body>\n') def save_to_html_with_categories(categories, original_soup, output_file): with open(output_file, 'w', encoding='utf-8') as file: head = original_soup.find('head') file.write('<!DOCTYPE NETSCAPE-Bookmark-file-1>\n') file.write('<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\n') file.write('<TITLE>Bookmarks</TITLE>\n') file.write(str(head)) file.write('<body>\n') file.write('<dl><p>\n') for category, a_tags in categories.items(): file.write(f'<dt><h3>{category}</h3></dt>\n') file.write('<dl><p>\n') for a_tag in a_tags: file.write(f' {str(a_tag)}\n') file.write('</p></dl>\n') file.write('</p></dl>\n') file.write('</body>\n') def check_conflicts(bookmarks): counter = collections.Counter(bookmarks.keys()) conflicts = return conflicts def resolve_conflicts(bookmarks, conflicts): resolved_bookmarks = {} conflicted_bookmarks = [] for url, a_tag in bookmarks.items(): if url in conflicts: conflicted_bookmarks.append(a_tag) else: resolved_bookmarks = a_tag return resolved_bookmarks, conflicted_bookmarks if __name__ == "__main__": input_files = [] while True: file_path = input("请输入书签文件路径(输入为空时结束):") if not file_path: break input_files.append(file_path) if not input_files: print("未输入任何书签文件路径。") exit() all_bookmarks = {} original_soup = None for file_path in input_files: bookmarks, soup = parse_bookmarks(file_path) all_bookmarks.update(bookmarks) if original_soup is None: original_soup = soup conflicts = check_conflicts(all_bookmarks) resolved_bookmarks, conflicted_bookmarks = resolve_conflicts(all_bookmarks, conflicts) print("选择分类选项:") print("1: 不分类") print("2: 按照域名分类") choice = input("请输入选项(1/2):") output_file = 'processed_bookmarks.html' if choice == '1': unique_bookmarks = remove_duplicates(resolved_bookmarks) save_to_html(unique_bookmarks + conflicted_bookmarks, original_soup, output_file) elif choice == '2': categories = categorize_by_domain(resolved_bookmarks) if conflicted_bookmarks: categories['Conflicted Bookmarks'] = conflicted_bookmarks save_to_html_with_categories(categories, original_soup, output_file) else: print("无效选项") print(f"书签已保存到 {output_file}")
原文始发于微信公众号(imBobby的自留地):Python脚本:浏览器书签的自动整理与去重
免责声明:文章中涉及的程序(方法)可能带有攻击性,仅供安全研究与教学之用,读者将其信息做其他用途,由读者承担全部法律及连带责任,本站不承担任何法律及连带责任;如有问题可邮件联系(建议使用企业邮箱或有效邮箱,避免邮件被拦截,联系方式见首页),望知悉。
- 左青龙
- 微信扫一扫
-
- 右白虎
- 微信扫一扫
-
评论