Python脚本:浏览器书签的自动整理与去重

admin 2024年5月21日20:42:05评论66 views字数 7053阅读23分30秒阅读模式

我常用的有四个浏览器,分别是 Edge、Firefox、Chrome 和 Brave,为的就是分开对应工作生活中的不同场景,例如工作用 Chrome,摸鱼用 Firefox,Brave当隐私浏览器等。但是很麻烦的问题是,各个浏览器里都有一堆书签,这些书签可能有重复,没法集中起来管理书签,于是我想写个简单的脚本把这四个浏览器的书签都整理合并到一起。

脚本功能

这个脚本主要有以下几个功能:

  • 解析书签文件:从HTML文件中提取书签信息。
  • URL标准化:将URL进行标准化处理,方便后续去重。
  • 去重处理:移除重复的书签。
  • 分类书签:根据域名对书签进行分类。
  • 保存结果:将整理后的书签保存为新的HTML文件。
URL标准化

首先,我需要对书签的URL进行标准化处理,去除URL中的路径末尾斜杠等不必要的部分,以便准确比较URL。以下是normalize_url函数的实现:

from urllib.parse import urlparse, urlunparsedef normalize_url(url):    parsed_url = urlparse(url)    normalized_path = parsed_url.path.rstrip('/')    normalized_url = urlunparse((parsed_url.scheme, parsed_url.netloc, normalized_path, '', '', ''))    return normalized_url
解析书签文件

我使用BeautifulSoup库解析书签HTML文件,提取其中的书签链接和对应的标签信息。以下是parse_bookmarks函数的实现:

from bs4 import BeautifulSoupdef parse_bookmarks(file_path):    with open(file_path, 'r', encoding='utf-8') as file:        soup = BeautifulSoup(file, 'html.parser')    bookmarks = {}    for a_tag in soup.find_all('a'):        url = a_tag.get('href')        if url:            normalized_url = normalize_url(url)            bookmarks[normalized_url] = a_tag    return bookmarks, soup
去重处理

使用集合(set)来移除重复的书签:

def remove_duplicates(bookmarks):    return list(set(bookmarks.values()))
分类

为了更好地组织书签,我根据域名对书签进行分类:

from urllib.parse import urlparsedef categorize_by_domain(bookmarks):    categories = {}    for url, a_tag in bookmarks.items():        parsed_url = urlparse(url)        domain = parsed_url.netloc        if domain not in categories:            categories[domain] = []        categories[domain].append(a_tag)    return categories
保存结果

最后,将去重和分类后的书签保存为新的HTML文件。这个脚本提供了两种保存方式:一种是不分类保存,另一种是按域名分类保存。

def save_to_html(bookmarks, original_soup, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        head = original_soup.find('head')
        file.write('<!DOCTYPE NETSCAPE-Bookmark-file-1>\n')
        file.write('<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\n')
        file.write('<TITLE>Bookmarks</TITLE>\n')
        file.write(str(head))
        file.write('<body>\n')
        file.write('<dl><p>\n')

        for a_tag in bookmarks:
            file.write(f'{str(a_tag)}\n')

        file.write('</p></dl>\n')
        file.write('</body>\n')

def save_to_html_with_categories(categories, original_soup, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        head = original_soup.find('head')
        file.write('<!DOCTYPE NETSCAPE-Bookmark-file-1>\n')
        file.write('<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\n')
        file.write('<TITLE>Bookmarks</TITLE>\n')
        file.write(str(head))
        file.write('<body>\n')
        file.write('<dl><p>\n')

        for category, a_tags in categories.items():
            file.write(f'<dt><h3>{category}</h3></dt>\n')
            file.write('<dl><p>\n')
            for a_tag in a_tags:
                file.write(f'    {str(a_tag)}\n')
            file.write('</p></dl>\n')

        file.write('</p></dl>\n')
        file.write('</body>\n')
处理冲突

在去重过程中,可能会遇到一些冲突的书签(同一URL对应多个书签标签)。我需要检测并处理这些冲突:

import collections

def check_conflicts(bookmarks):
    counter = collections.Counter(bookmarks.keys())
    conflicts = 
return conflicts def resolve_conflicts(bookmarks, conflicts): resolved_bookmarks = {} conflicted_bookmarks = [] for url, a_tag in bookmarks.items(): if url in conflicts: conflicted_bookmarks.append(a_tag) else: resolved_bookmarks
= a_tag return resolved_bookmarks, conflicted_bookmarks
主程序

以下是主程序的实现,可以通过命令行输入多个书签文件路径,程序会自动合并、去重并保存结果:

if __name__ == "__main__":
    input_files = []
    while True:
        file_path = input("请输入书签文件路径(输入为空时结束):")
        if not file_path:
            break
        input_files.append(file_path)

    if not input_files:
        print("未输入任何书签文件路径。")
        exit()

    all_bookmarks = {}
    original_soup = None

    for file_path in input_files:
        bookmarks, soup = parse_bookmarks(file_path)
        all_bookmarks.update(bookmarks)
        if original_soup is None:
            original_soup = soup

    conflicts = check_conflicts(all_bookmarks)
    resolved_bookmarks, conflicted_bookmarks = resolve_conflicts(all_bookmarks, conflicts)

    output_file = 'processed_bookmarks.html'

    unique_bookmarks = remove_duplicates(resolved_bookmarks)
    save_to_html(unique_bookmarks + conflicted_bookmarks, original_soup, output_file)

    print(f"书签已保存到 {output_file}")
最终代码如下:

from bs4 import BeautifulSoup
from urllib.parse import urlparse, urlunparse
import collections


def normalize_url(url):
    parsed_url = urlparse(url)
    normalized_path = parsed_url.path.rstrip('/')
    normalized_url = urlunparse((parsed_url.scheme, parsed_url.netloc, normalized_path, '', '', ''))
    return normalized_url


def parse_bookmarks(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file, 'html.parser')

    bookmarks = {}
    for a_tag in soup.find_all('a'):
        url = a_tag.get('href')
        if url:
            normalized_url = normalize_url(url)
            bookmarks[normalized_url] = a_tag

    return bookmarks, soup


def remove_duplicates(bookmarks):
    return list(set(bookmarks.values()))


def categorize_by_domain(bookmarks):
    categories = {}
    for url, a_tag in bookmarks.items():
        parsed_url = urlparse(url)
        domain = parsed_url.netloc
        if domain not in categories:
            categories[domain] = []
        categories[domain].append(a_tag)
    return categories


def save_to_html(bookmarks, original_soup, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        head = original_soup.find('head')
        file.write('<!DOCTYPE NETSCAPE-Bookmark-file-1>\n')
        file.write('<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\n')
        file.write('<TITLE>Bookmarks</TITLE>\n')
        file.write(str(head))
        file.write('<body>\n')
        file.write('<dl><p>\n')

        for a_tag in bookmarks:
            file.write(f'{str(a_tag)}\n')

        file.write('</p></dl>\n')
        file.write('</body>\n')


def save_to_html_with_categories(categories, original_soup, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        head = original_soup.find('head')
        file.write('<!DOCTYPE NETSCAPE-Bookmark-file-1>\n')
        file.write('<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\n')
        file.write('<TITLE>Bookmarks</TITLE>\n')
        file.write(str(head))
        file.write('<body>\n')
        file.write('<dl><p>\n')

        for category, a_tags in categories.items():
            file.write(f'<dt><h3>{category}</h3></dt>\n')
            file.write('<dl><p>\n')
            for a_tag in a_tags:
                file.write(f'    {str(a_tag)}\n')
            file.write('</p></dl>\n')

        file.write('</p></dl>\n')
        file.write('</body>\n')


def check_conflicts(bookmarks):
    counter = collections.Counter(bookmarks.keys())
    conflicts = 
return conflicts def resolve_conflicts(bookmarks, conflicts): resolved_bookmarks = {} conflicted_bookmarks = [] for url, a_tag in bookmarks.items(): if url in conflicts: conflicted_bookmarks.append(a_tag) else: resolved_bookmarks
= a_tag return resolved_bookmarks, conflicted_bookmarks if __name__ == "__main__": input_files = [] while True: file_path = input("请输入书签文件路径(输入为空时结束):") if not file_path: break input_files.append(file_path) if not input_files: print("未输入任何书签文件路径。") exit() all_bookmarks = {} original_soup = None for file_path in input_files: bookmarks, soup = parse_bookmarks(file_path) all_bookmarks.update(bookmarks) if original_soup is None: original_soup = soup conflicts = check_conflicts(all_bookmarks) resolved_bookmarks, conflicted_bookmarks = resolve_conflicts(all_bookmarks, conflicts) print("选择分类选项:") print("1: 不分类") print("2: 按照域名分类") choice = input("请输入选项(1/2):") output_file = 'processed_bookmarks.html' if choice == '1': unique_bookmarks = remove_duplicates(resolved_bookmarks) save_to_html(unique_bookmarks + conflicted_bookmarks, original_soup, output_file) elif choice == '2': categories = categorize_by_domain(resolved_bookmarks) if conflicted_bookmarks: categories['Conflicted Bookmarks'] = conflicted_bookmarks save_to_html_with_categories(categories, original_soup, output_file) else: print("无效选项") print(f"书签已保存到 {output_file}")
最终HTML文件效果:

Python脚本:浏览器书签的自动整理与去重

PART ONE

点个在看你最好看

原文始发于微信公众号(imBobby的自留地):Python脚本:浏览器书签的自动整理与去重

免责声明:文章中涉及的程序(方法)可能带有攻击性,仅供安全研究与教学之用,读者将其信息做其他用途,由读者承担全部法律及连带责任,本站不承担任何法律及连带责任;如有问题可邮件联系(建议使用企业邮箱或有效邮箱,避免邮件被拦截,联系方式见首页),望知悉。
  • 左青龙
  • 微信扫一扫
  • weinxin
  • 右白虎
  • 微信扫一扫
  • weinxin
admin
  • 本文由 发表于 2024年5月21日20:42:05
  • 转载请保留本文链接(CN-SEC中文网:感谢原作者辛苦付出):
                   Python脚本:浏览器书签的自动整理与去重https://cn-sec.com/archives/2762612.html
                  免责声明:文章中涉及的程序(方法)可能带有攻击性,仅供安全研究与教学之用,读者将其信息做其他用途,由读者承担全部法律及连带责任,本站不承担任何法律及连带责任;如有问题可邮件联系(建议使用企业邮箱或有效邮箱,避免邮件被拦截,联系方式见首页),望知悉.

发表评论

匿名网友 填写信息