【转载+修改】一个PC端的离线翻译程序

发表于 2023-11-03 更新于 2024-01-14 分类于计算机

一个基于pytorch、transformers等基础库的离线翻译程序

参考:

https://www.cnblogs.com/weskynet/p/16740041.html

https://huggingface.co/Helsinki-NLP

0. 引言

文本翻译是我们在科研生活中常常要碰到的事情。
诸如Google translate、百度翻译、chatGPT等在内的在线翻译工具（包括大语言模型）都能够很好的完成翻译工作。
然而，当网络环境不佳时，选用离线翻译程序就成了一个自然的需求。

虽然包括有道翻译在内的许多翻译软件都提供了离线翻译插件的功能，但本着“自己动手，丰衣足食”的原则，我们打算从开源模型出发搭建一个翻译程序。

HuggingFace上托管了赫尔辛基大学开发的一系列自然语言翻译模型（ Helsinki-NLP ），涵盖1440对语言之间的转换，包括中译英和英译中。我们可以基于此，开发一个离线翻译工具

1. 环境配置

需要提前安装好python3。此外，还应安装好下列python模块：

transformers

（transformer是谷歌团队提出的一种机器学习算法，利用注意力机制进行序列的编码和解码处理，常被用于自然语言处理。chatGPT和GPT-4所依赖的底层技术也是transformer）

推荐使用conda完成上述模块的安装：

1	conda install -c conda-forge transformers

2. 代码与原理

以下代码修改自文章《手把手搭建基于Hugging Face模型的离线翻译系统，并通过C#代码进行访问》。这篇文章使用Helsinki-NLP模型搭建了一个windows服务器，并嵌入到了.NET应用程序当中。我们并不需要这么麻烦，只需要开发一个控制台版本的程序即可。

原理很简单。Transformer模型很复杂，但是具体的实现细节已经封装在了Transformers模块当中。我们加载预训练好的翻译模型，然后调用对应的函数接口，即可使用相关的功能。

下面是我们的代码，复制粘贴到一个文本文件中，并将文本文件命名为HelsinkiTranslation.py即可。在第三小节中我们将介绍如何使用。

#!/usr/bin/env python
#coding=utf-8
import os
import sys
from transformers import pipeline, AutoModelWithLMHead, AutoTokenizer
import warnings
warnings.filterwarnings('ignore')
print("+------------+")
print("|  pyNLP-MT  |")
print("+------------+")
print('Offline translation program')

# 加载必要的模型。此处我们选择中译英和英译中两个模型
# 下面的代码将从HuggingFace网站上下载对应的模型，并存储在`$HOME/.cache/huggingface/hub/`文件夹下。
try:
    print('Loading model: Chinese -> English ...')
    model_zh2en = AutoModelWithLMHead.from_pretrained('Helsinki-NLP/opus-mt-zh-en')
    token_zh2en =       AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-zh-en')
    trans_zh2en = pipeline('translation_zh_to_en', 
                            model     = model_zh2en, 
                            tokenizer = token_zh2en)
    print('Loading model: English -> Chinese ...')
    model_en2zh = AutoModelWithLMHead.from_pretrained('Helsinki-NLP/opus-mt-en-zh')
    token_en2zh =       AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-zh')
    trans_en2zh = pipeline('translation_en_to_zh', 
                            model     = model_en2zh, 
                            tokenizer = token_en2zh)
except:
    print('A exceptence occurred when loading model...')
    sys.exit(1)

# 定义一个函数，通过字符串中的ASCII字符比例，判断语句是中文还是英文
def isChinese(text):
    l = len(text)
    n = 0
    for c in text:
        if(ord(c)<256):n+=1
    if(n>0.5*l): #ASCII字符占了所有字符的50%以上比例，判定为英文
        return False
    else:
        return True

# 定义一个翻译过程的函数，传入要翻译的字符串和翻译模式，返回翻译后的字符串
# 参数`mod`代表翻译模式，默认为自动检测。仅支持中译英和英译中。
def translate(text, mod="auto"): # 参数mod的可选值："auto"(default),"zh2en","en2zh"
    if(mod == 'zh2en'):
        result = trans_zh2en(text, max_length=81920)[0]['translation_text']
        return result
    if(mod == 'en2zh'):
        result = trans_en2zh(text, max_length=81920)[0]['translation_text']
        return result
    if(mod == 'auto'):
        if(isChinese(text)):
            result = trans_zh2en(text, max_length=81920)[0]['translation_text']
        else:
            result = trans_en2zh(text, max_length=81920)[0]['translation_text']
        return result

# 设置一个函数，用于调整翻译模式
Mod = "auto"
def setmod(mod):
    global Mod
    available_mod = ["auto","zh2en","en2zh"]
    if(mod in available_mod):
        Mod = mod
    else:
        print("Unavailable mod: {}\n".format(mod))

# “帮助”函数
def help():
    help_text="""
All available commands:
    /?      print this information
    /help   print this information
    /exit   quit program
    /quit   quit program
    /mod [mode]
            set translation mode. 
            Available value: 
                "auto"(default): Automatically select translation mode.
                "zh2en": Chinese translate into English.
                "en2zh": English translate into Chinese.
    /clear  clean screen

If you want to translate any sentence, just type it after prompt.
    """
    print(help_text)

# 解析并执行指令的函数
def command(cmd):
    if  (cmd=="/exit"):     sys.exit(0)
    elif(cmd=="/quit"):     sys.exit(0)
    elif(cmd=="/help"):     help()
    elif(cmd=="/?"   ):     help()
    elif(cmd[0:4]=="/mod"): setmod(cmd[5:])
    elif(cmd=="/clear"):
        if(sys.platform=="win32"):os.system("cls")
        else: os.system("clear")
    else:
        print("{} is not recognized as an command.".format(cmd))
        help()

# 主函数。
# 我们构建的是一个交互式的翻译程序，进入程序后可以连续输入待翻译字符串，直到用户输入`/exit`指令退出。
if(__name__=='__main__'):
    print("input `/help` for help")
    while(True):
        word = input('({}) >>> '.format(Mod))
        if(len(word)==0):continue
        if(word[0]=='/'):
            command(word)
        else:
            try:
                text = translate(word, mod=Mod)
                print("\nTranslation:")
                print(text)
            except:
                print("Cannot translate!")
            finally:
                print("\n\n")

3. 使用方法展示

3.1 激活conda环境

要注意，激活的环境是装了transformers模块的那个环境。

1	conda activate

3.2 第一次运行代码

在终端或命令提示符程序中运行如下指令：

1	python HelsinkiTranslation.py

第一次运行代码时需要联网（必要时需要连上代理服务器），程序会从huggingface上下载对应的翻译模型，这些模型会保存在本地，未来再运行时可以直接调用。

程序输出如下：

+------------+
|  pyNLP-MT  |
+------------+
Offline translation program

Loading model: Chinese -> English ...
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.39k/1.39k [00:00<00:00, 562kB/s]
Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 312M/312M [00:14<00:00, 22.2MB/s]
Downloading (…)neration_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 293/293 [00:00<00:00, 192kB/s]
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 44.0/44.0 [00:00<00:00, 38.5kB/s]
Downloading (…)olve/main/source.spm: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 805k/805k [00:00<00:00, 2.70MB/s]
Downloading (…)olve/main/target.spm: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 807k/807k [00:00<00:00, 3.11MB/s]
Downloading (…)olve/main/vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1.62M/1.62M [00:00<00:00, 5.01MB/s]
Loading model: English -> Chinese ...
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.40k/1.40k [00:00<00:00, 898kB/s]
Downloading pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 312M/312M [00:14<00:00, 22.2MB/s]
Downloading (…)neration_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 293/293 [00:00<00:00, 254kB/s]
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 44.0/44.0 [00:00<00:00, 28.2kB/s]
Downloading (…)olve/main/source.spm: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 806k/806k [00:00<00:00, 3.15MB/s]
Downloading (…)olve/main/target.spm: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 805k/805k [00:00<00:00, 2.71MB/s]
Downloading (…)olve/main/vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1.62M/1.62M [00:00<00:00, 4.99MB/s]
input `/help` for help
(auto) >>>

当看到所有模型全部下载成功，并且加载出交互式程序提示符(auto) >>>时，代表激活成功。

3.3 日常使用

在终端或命令提示符程序中运行指令python HelsinkiTranslation.py即可进入交互式的翻译程序界面

程序主界面如下：

+------------+
|  pyNLP-MT  |
+------------+
Offline translation program
(created by Wesky, modified by WZ on Dec 9th, 2022)

Loading model: Chinese -> English ...
Loading model: English -> Chinese ...
input `/help` for help
(auto) >>>

3.1 帮助界面

在交互式模式下，输入/help即可得到帮助。

(auto) >>> /help

All available commands:
    /?      print this information
    /help   print this information
    /exit   quit program
    /quit   quit program
    /mod [mode]
            set translation mode.
            Available value:
                "auto"(default): Automatically select translation mode.
                "zh2en": Chinese translate into English.
                "en2zh": English translate into Chinese.
    /clear  clean screen

If you want to translate any sentence, just type it after prompt.

3.2 翻译与切换翻译模式

自动检测语言并翻译（默认）：

(auto) >>> Sleep is an integral part of our daily routine, and, among the myriad functions attributed to this state, emotional processing stands out as a crucial aspect. Everyone has experienced at least once how a poor night’s sleep can wreak havoc on our emotions.

Translation:
睡眠是我们日常工作不可分割的一部分,在这种状态造成的众多功能中,情感处理是一个至关重要的方面。 每个人至少都经历过一次晚上睡得不好会给我们的情绪带来破坏。



(auto) >>> 现在我们知道，每个文明的历程都是这样：从一个狭小的摇篮世界中觉醒，蹒跚地走出去，飞起来，越飞越快，越飞越远，最后与宇宙的命运融为一体。

Translation:
Now we know that every civilization goes through this: awakening from a tiny cradle world, walking out, flying, flying faster, flying further away, and finally integrating with the fate of the universe.

将翻译模式改为仅英译中：

(auto) >>> /mod en2zh
(en2zh) >>> For markets to achieve biodiversity conservation, biodiversity must endure. A “buy and forget” model, where credits exist forever regardless of whether the biodiversity persists, would lead to perverse outcomes.

Translation:
为了实现生物多样性保护,市场必须保持生物多样性。 一种“买和忘”模式 — — 不论生物多样性是否持续存在 — — 永远存在信贷,会导致错误的结果。

将翻译模式改为仅中译英：

(auto) >>> /mod zh2en
(zh2en) >>> 由于水体失去约束大量蒸发，小宇宙中云雾迷漫，太阳在云后朦胧地照耀着，出现了一道横跨宇宙的绚丽彩虹。

Translation:
Due to the loss of water bodies, which binds large amounts of evaporation, clouds in the small universe evaporate, the sun glitters behind them, and there is a beautiful rainbow across the universe.

3.3 退出程序

在交互式模式下，输入/exit或/quit即可退出。