如何使用Tensorflow检查处理自然语言拼写错误？（附实例教程&源代码）

2018-08-13 · Ryan · 发表评论

项目背景与目标

在机器学习项目中，数据质量至关重要。自然语言处理（NLP）项目通常需要处理人类书写的文本，而拼写错误是常见问题。例如，从社交媒体或论坛收集的数据集可能包含大量拼写错误，这会影响模型的训练效果。因此，构建一个拼写检查器是一个有价值的项目，有助于提升数据质量。

本项目将使用序列到序列（seq2seq）模型，并利用 TensorFlow 实现。我们将重点介绍如何为模型准备数据，并讨论模型的一些功能。项目使用 Python 3 和 TensorFlow 1.x 版本（请注意，当前 TensorFlow 2.x 已广泛使用，但核心概念相通）。数据源来自古腾堡项目的二十本流行书籍，如需提高模型准确性，可以扩展使用更多书籍。

完整代码可在 GitHub 查看：https://github.com/Currie32/Spell-Checker

模型能力预览

以下是一些模型纠正拼写错误的示例：

输入： Spellin is difficult, whch is wyh you need to study everyday.
输出： Spelling is difficult, which is why you need to study everyday.
输入： The first days of her existence in th country were vrey hard for Dolly.
输出： The first days of her existence in the country were very hard for Dolly.
输入： Thi is really something impressiv thaat we should look into right away!
输出： This is really something impressive that we should look into right away!

数据加载与预处理

加载书籍文本

首先，将所有书籍文件放在名为“books”的文件夹中。以下是加载单本书籍的函数：

def load_book(path):
    input_file = os.path.join(path)
    with open(input_file) as f:
        book = f.read()
    return book

获取书籍文件列表：

path = './books/'
book_files = [f for f in listdir(path) if isfile(join(path, f))]
book_files = book_files[1:]

将所有书籍文本加载到列表中：

books = []
for book in book_files:
    books.append(load_book(path+book))

如需查看每本书的单词数，可使用以下代码：

for i in range(len(books)):
    print("There are {} words in {}.".format(len(books[i].split()), book_files[i]))

注意： 如果代码中不使用 .split()，则返回的是字符数而非单词数。

文本清洗

由于模型以字符为输入，我们无需进行词干提取或去除停用词，只需移除不想要的字符和多余空格：

def clean_text(text):
    '''Remove unwanted characters and extra spaces from the text'''
    text = re.sub(r'n', ' ', text)
    text = re.sub(r'[{}@_*>()\#%+=[]]','', text)
    text = re.sub('a0','', text)
    text = re.sub("'92t","'t", text)
    text = re.sub("'92s","'s", text)
    text = re.sub("'92m","'m", text)
    text = re.sub("'92ll","'ll", text)
    text = re.sub("'91",'', text)
    text = re.sub("'92",'', text)
    text = re.sub("'93",'', text)
    text = re.sub("'94",'', text)
    text = re.sub('.','. ', text)
    text = re.sub('!','! ', text)
    text = re.sub('?','? ', text)
    text = re.sub(' +',' ', text) # Removes extra spaces
    return text

词汇表构建

词汇表构建是标准步骤，具体代码可在 GitHub 找到。模型输入包含的字符如下（共78个）：

[' ', '!', '"', '$', '&', "'", ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '', '', '', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

可以进一步删除特殊字符或统一转为小写，但为了保持拼写检查器的实用性，我们保留了当前字符集。

句子分割与处理

数据在输入模型前需组织成句子。我们以句点加空格（“. ”）为分隔符进行分割。但需注意，有些句子以问号或感叹号结尾，这可能导致分割问题。不过，只要合并后的句子长度不超过最大限制，模型仍能处理。

示例：

Today is a lovely day. I want to go to the beach. （将被拆分为两个句子）
Is today a lovely day? I want to go to the beach. （将作为一个长句子处理）

分割代码如下：

sentences = []
for book in clean_books:
    for sentence in book.split('. '):
        sentences.append(sentence + '.')

句子长度筛选

为控制训练时间，我们筛选长度在指定范围内的句子：

max_length = 92
min_length = 10

good_sentences = []

for sentence in int_sentences:
    if len(sentence) <= max_length and len(sentence) >= min_length:
        good_sentences.append(sentence)

数据集划分

将数据划分为训练集和测试集，测试集占比15%：

training, testing = train_test_split(good_sentences,
                                     test_size = 0.15,
                                     random_state = 2)

按长度排序

按句子长度排序可以提升训练效率，因为同一批次内的句子长度相近，所需填充较少：

training_sorted = []
testing_sorted = []

for i in range(min_length, max_length+1):
    for sentence in training:
        if len(sentence) == i:
            training_sorted.append(sentence)
    for sentence in testing:
        if len(sentence) == i:
            testing_sorted.append(sentence)

生成含错误的训练数据

本项目的一个关键部分是生成含有拼写错误的句子作为模型输入。错误类型包括：

交换相邻两个字符（例如：hlelo → hello）
插入一个额外字母（例如：heljlo → hello）
删除一个字符（例如：helo → hello）

每种错误发生的概率相等，且每个字符有5%的概率发生错误。平均每20个字符会出现一个错误。

定义字母列表：

letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
           'n','o','p','q','r','s','t','u','v','w','x','y','z',]

错误生成函数：

def noise_maker(sentence, threshold):
    noisy_sentence = []
    i = 0
    while i < len(sentence):
        random = np.random.uniform(0,1,1)
        if random < threshold:
            noisy_sentence.append(sentence[i])
        else:
            new_random = np.random.uniform(0,1,1)
            if new_random > 0.67:
                if i == (len(sentence) - 1):
                    continue
                else:
                    noisy_sentence.append(sentence[i+1])
                    noisy_sentence.append(sentence[i])
                    i += 1
            elif new_random < 0.33:
                random_letter = np.random.choice(letters, 1)[0]
                noisy_sentence.append(vocab_to_int[random_letter])
                noisy_sentence.append(sentence[i])
            else:
                pass
        i += 1
    return noisy_sentence

批次数据生成

与许多项目不同，我们在训练过程中动态生成输入数据。每个批次的目标句子（正确句子）都会通过 noise_maker 函数生成新的含错误输入。这种方法极大地扩展了训练数据的多样性。

def get_batches(sentences, batch_size, threshold):

    for batch_i in range(0, len(sentences)//batch_size):
        start_i = batch_i * batch_size
        sentences_batch = sentences[start_i:start_i + batch_size]

        sentences_batch_noisy = []
        for sentence in sentences_batch:
            sentences_batch_noisy.append(
                noise_maker(sentence, threshold))

        sentences_batch_eos = []
        for sentence in sentences_batch:
            sentence.append(vocab_to_int[''])
            sentences_batch_eos.append(sentence)

        pad_sentences_batch = np.array(
            pad_sentence_batch(sentences_batch_eos))
        pad_sentences_noisy_batch = np.array(
            pad_sentence_batch(sentences_batch_noisy))

        pad_sentences_lengths = []
        for sentence in pad_sentences_batch:
            pad_sentences_lengths.append(len(sentence))

        pad_sentences_noisy_lengths = []
        for sentence in pad_sentences_noisy_batch:
            pad_sentences_noisy_lengths.append(len(sentence))

        yield (pad_sentences_noisy_batch,
               pad_sentences_batch,
               pad_sentences_noisy_lengths,
               pad_sentences_lengths)

总结与展望

本项目展示了一个基于 seq2seq 模型的拼写检查器实现。虽然结果令人鼓舞，但模型仍有改进空间，例如使用更先进的架构（如 CNN 或 Transformer）或扩展训练数据。欢迎社区贡献改进想法或代码。

希望本文能帮助你理解如何使用 TensorFlow 构建拼写检查模型。如有疑问或建议，欢迎讨论。