一个统计EPUB电子书总字数和转txt文件的小工具

发表于 2025-06-02 更新于 2025-06-03 分类于计算机

从Z-library上下载的epub电子书有点多了。有时候想要统计这些电子书都有多少字，另外想要简单做一个全局查询功能，节省导入到阅读器软件里去查询的时间。在AI的帮助下，完成了这个工具。

一、背景：epub格式介绍

下面是百度百科给出的定义：

ePub（Electronic Publication的缩写，意为：电子出版），是一个自由的开放标准，属于一种可以“自动重新编排”的内容；也就是文字内容可以根据阅读设备的特性，以最适于阅读的方式显示。EPub档案内部使用了XHTML或DTBook （一种由DAISY Consortium提出的XML标准）来展现文字、并以zip压缩格式来包裹档案内容

本质上，epub文件是一个zip压缩包，当我们使用zip解压工具进行解包以后，可以看到其内部大致结构是这样（左图是我从Z-library上下载的《悉达多》的电子书预览界面，右图是解包内容），其包含META-INF 和OEBPS 文件夹（有些电子书可能还会有别的文件夹），在这些文件夹下面是一系列xml文件或html文件，记录着书籍的目录信息或者具体的章节内容。

在解包得到的目录当中，有一个文件路径 META-INF/container.xml ，这是epub文件最重要的内容清单，其格式大致如下，其中记录了根文件路径、目录文件的文件名（例如此处是OEBPS/content.opf）。

我们也可以去看一看这个文件：

这是书籍的目录，其中引用了一堆html文件的地址，每一个html文件对应着书籍当中的一个章节。

关于html格式的介绍，网上有许多，此处不再赘述。需要注意的是，epub文件中封装的html并不支持完整的html标准（例如许多动态效果无法实现），只支持一些基础的文本标签（如标题 <h1>,<h2>,等，段落<p>，引文和链接<a>，图片 <img>，等等，以及一些基础的CSS样式表），以允许出版商在书籍中添加多样的富文本效果。

二、原理与代码

基于上一段中对epub格式的介绍，我们大致可以知道如何去解析一个epub文件：使用zip文件接口解压缩并获取内容清单文件 META-INF/container.xml ，随后根据这个内容清单文件去寻找根文件目录以及目录文件，最后根据目录文件去解析书籍内容的html文件，即可获得整本epub的文本内容。

我们的目标有三个：

统计电子书字数
全局查询功能
epub转txt

这三个目标其实都很好实现，只要能够按照上述思路成功解析epub文件即可。

在python中，有下面这些工具可以为我们所用：

zipfile: 读取zip存档的工具，可以帮助我们获取epub文件内部的各个文件
xml: 读取和解析xml文件的工具，可以帮助我们解析内容清单文件
BeautifulSoup(bs4): HTML代码解析工具，它比xml更强大，可以对各个章节文本的xhtml文件进行处理

下面是代码。具体功能的实现见代码注释：

import os,sys,zipfile,re,json
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup

## 统计中英文混杂文本的单词数和字符数。
# 中文单词数等于中文字符数。英文单词数等于以空格分隔的连续英文单词。
def count_words_and_chars(text):
    chinese_char_pattern = re.compile(r'[\u4e00-\u9fff]')
    english_word_pattern = re.compile(r'\b[a-zA-Z]+\b')
    chinese_word_count = len(chinese_char_pattern.findall(text)) # 中文总字数
    english_word_count = len(english_word_pattern.findall(text)) # 英文单词数
    char_count = len(text) # 总字符数
    total_word_count = chinese_word_count + english_word_count # 总字数
    return total_word_count, char_count, chinese_word_count, english_word_count

## 流式提取 EPUB 文件中的文本内容，统计字数，并写入文件(如果convert_to_txt==True)。
def extract_epub_text(epub_path,convert_to_txt=False,text_to_find=None):
    try:
        text_find_result_dt = {} # 用于存储text finding结果的字典
        if(convert_to_txt):      # 如果需要写入文件，则在此处创建文件，后续进行流式写入
            output_file_path = epub_path+".txt"
            output = open(output_file_path,'w',encoding="utf-8")
        with zipfile.ZipFile(epub_path, 'r') as epub: # 打开epub文件。后续对这个epub文件对象进行操作。
            # 找到并解析内容清单 (container.xml)
            container_file = "META-INF/container.xml"
            with epub.open(container_file) as container:
                tree = ET.parse(container)
                root = tree.getroot()
                # 获取根文件路径
                rootfile_path = root.find(".//{urn:oasis:names:tc:opendocument:xmlns:container}rootfile").attrib['full-path']
            # 打开根文件 (通常是 .opf 文件)
            print(f"rootfile:\t{rootfile_path}")
            with epub.open(rootfile_path) as rootfile:
                tree = ET.parse(rootfile)
                root = tree.getroot()
                # 查找所有内容文件的引用
                items = root.findall(".//{http://www.idpf.org/2007/opf}item")
                text_files = [item.attrib['href'] for item in items if item.attrib['media-type'] == 'application/xhtml+xml']
            total_word_count = total_char_count = total_zh_word_count = total_en_word_count = 0 # 字数统计的变量，此处创建并赋值为0
            for text_file in text_files: # 根据所有内容文件的引用，依次读取这些文件，并提取文本内容
                print(f"File:\t{text_file}")
                try:
                    dirname = os.path.dirname(rootfile_path)
                    if(dirname==""): text_fpath = text_file
                    else: # 如果文件路径中有斜杠，需要分情况做一些处理
                        if("\\" in text_file): text_fpath = dirname+"\\"+text_file
                        else:                  text_fpath = dirname+"/"+text_file
                    with epub.open(text_fpath) as tf: # 打开对应的文件，并进行解析
                        txt1 = tf.read()
                        bs4obj = BeautifulSoup(txt1,'lxml')      # 使用beautifulsoup解析，这个库比xml更稳定
                        text   = bs4obj.get_text().strip() # 使用beautifulsoup获取页面上的所有文本内容
                        if(convert_to_txt): # 如果指定了要导出到text文件，则此处开启写入。
                            output.write(text)
                            output.write("\n")
                        if(text_to_find is not None): # 查询字符串是否在这个章节中
                            if(text_to_find in text):
                                text_find_result_dt[text_file] = text
                        word_count, char_count, chinese_word_count, english_word_count = count_words_and_chars(text)
                        print(f"\tword_count:\t{word_count}")
                        print(f"\tchar_count:\t{char_count}")
                        print(f"\tchinese_word_count:\t{chinese_word_count}")
                        print(f"\tenglish_word_count:\t{english_word_count}")
                        total_word_count += word_count
                        total_char_count += char_count
                        total_zh_word_count += chinese_word_count
                        total_en_word_count += english_word_count
                except Exception as e:
                    print(f"Warning: Failed to parse `{text_file}` - message:{e}", file=sys.stderr)
            if(convert_to_txt):
                output.close()
                print(f"Text file is saved as `{output_file_path}`.")
            if((text_to_find is None)==False):
                print("\n\n===> Text finding result: <===")
                print(text_find_result_dt.keys())
                json_text = json.dumps(text_find_result_dt,ensure_ascii=False,indent="\t")
                with open("text_finding_result.json",'w',encoding="utf-8") as jsonfile:
                    jsonfile.write(json_text)
                print("See more details in `text_finding_result.json`")
            return [total_word_count,total_char_count,total_zh_word_count,total_en_word_count]
    except Exception as e:
        print(f"Error: Unable to process EPUB file - {e}", file=sys.stderr)
        return [0,0,0,0]

def main():
    if len(sys.argv) < 2:
        print("Usage: python epub_convert_tool.py <epub_file> [convert_to_txt] [text_to_find]")
        print("\t<epub_file>        Input file path")
        print("\t[convert_to_txt]   0: not convert, 1: convert to a txt file in the same name")
        print("\t[text_to_find]     Text you want to find")
        sys.exit(1)

    epub_file = sys.argv[1]
    convert_to_txt = 0
    if(len(sys.argv)>2):
        if(sys.argv[2]=="1"):
            convert_to_txt = 1
    if(len(sys.argv)==4):
        text_to_find = sys.argv[3]
    else:
        text_to_find = None

    if not os.path.isfile(epub_file) or not epub_file.lower().endswith('.epub'):
        print("Error: Please provide a valid EPUB file.")
        sys.exit(1)

    wc,cc,zh_wc,en_wc = extract_epub_text(epub_file,convert_to_txt,text_to_find)
    print(f"\n\n>>> total count of `{epub_file}` <<<\n")
    print(f"word_count:\t{wc}")
    print(f"char_count:\t{cc}")
    print(f"chinese_word_count:\t{zh_wc}")
    print(f"english_word_count:\t{en_wc}")
    
if __name__ == "__main__":
    main()

三、运行效果

这是一个命令行应用，因此直接运行时，会得到下面这样的帮助文档。

Usage: python epub_convert_tool.py <epub_file> [convert_to_txt] [text_to_find]
        <epub_file>        Input file path
        [convert_to_txt]   0: not convert, 1: convert to a txt file in the same name
        [text_to_find]     Text you want to find

其接受三个参数：

epub_file 是传入的epub文件的路径
convert_to_txt 是一个指定是否要转换txt文件的数字，0代表不转换，1代表转换。这个参数是可选参数，默认不转换
text_to_find 是需要搜索的字符串。这个参数是可选参数，默认为空，如果要使用，需要先指定 convert_to_txt参数，再指定text_to_find

下面以统计《李娟阿勒泰系列典藏合集》这本书为例进行展示：

仅统计字数：

~$ python epub_file_convert_tool.py 李娟阿勒泰系列典藏合集.epub
rootfile:       OEBPS/content.opf
File:   Text/cover_page.xhtml
        word_count:     11
        char_count:     11
        chinese_word_count:     11
        english_word_count:     0
File:   Text/part0000.xhtml
        word_count:     25
        char_count:     25
        chinese_word_count:     25
        english_word_count:     0
File:   Text/part0001.xhtml
        word_count:     1
        char_count:     5
        chinese_word_count:     0
        english_word_count:     1

... ...

File:   Text/part0142.xhtml
        word_count:     2976
        char_count:     3458
        chinese_word_count:     2976
        english_word_count:     0
File:   Text/part0143.xhtml
        word_count:     1912
        char_count:     2234
        chinese_word_count:     1912
        english_word_count:     0


>>> total count of `李娟阿勒泰系列典藏合集.epub` <<<

word_count:     468396
char_count:     537405
chinese_word_count:     468380
english_word_count:     16

如果要导出txt文件，同时搜索文本（以“下雪了”这个词为例）

~$ python epub_file_convert_tool.py 李娟阿勒泰系列典藏合集.epub  1  下雪了
rootfile:       OEBPS/content.opf
File:   Text/cover_page.xhtml
        word_count:     11
        char_count:     11
        chinese_word_count:     11
        english_word_count:     0
File:   Text/part0000.xhtml
        word_count:     25
        char_count:     25
        chinese_word_count:     25
        english_word_count:     0
File:   Text/part0001.xhtml
        word_count:     1
        char_count:     5
        chinese_word_count:     0
        english_word_count:     1

... ...

File:   Text/part0142.xhtml
        word_count:     2976
        char_count:     3458
        chinese_word_count:     2976
        english_word_count:     0
File:   Text/part0143.xhtml
        word_count:     1912
        char_count:     2234
        chinese_word_count:     1912
        english_word_count:     0
Text file is saved as `李娟阿勒泰系列典藏合集.epub.txt`.


===> Text finding result: <===
dict_keys(['Text/part0018.xhtml', 'Text/part0038.xhtml', 'Text/part0059.xhtml', 'Text/part0064.xhtml', 'Text/part0079.xhtml', 'Text/part0096.xhtml', 'Text/part0097.xhtml', 'Text/part0119.xhtml'])
See more details in `text_finding_result.json`


>>> total count of `李娟阿勒泰系列典藏合集.epub` <<<

word_count:     468396
char_count:     537405
chinese_word_count:     468380
english_word_count:     16

其导出的txt文件为 李娟阿勒泰系列典藏合集.epub.txt ，打开以后的内容是这样的：

另外，我们还找到了包含“下雪了”这个词的多个章节文件，包括part0018.xhtml,part0038.xhtml等。查找结果的详细输出在text_finding_result.json文件当中，我们可以打开看一下这个文件的内容：

可以看到，这些章节有《住在山野》《九篇雪》《唯一的水》等，我们可以基于此进一步定位想要寻找的文章段落。

多说一句：这个工具可以配合shell脚本实现批量提取epub文件的文本内容，从而可以用于大模型训练或者大模型的知识库的构建。

以上。