博客 / Others/ 如何使用Tensorflow检查处理自然语言拼写错误?(附实例教程&源代码)

如何使用Tensorflow检查处理自然语言拼写错误?(附实例教程&源代码)

如何使用Tensorflow检查处理自然语言拼写错误?(附实例教程&源代码)

项目背景与目标

在机器学习项目中,数据质量至关重要。自然语言处理(NLP)项目通常需要处理人类书写的文本,而拼写错误是常见问题。例如,从社交媒体或论坛收集的数据集可能包含大量拼写错误,这会影响模型的训练效果。因此,构建一个拼写检查器是一个有价值的项目,有助于提升数据质量。

本项目将使用序列到序列(seq2seq)模型,并利用 TensorFlow 实现。我们将重点介绍如何为模型准备数据,并讨论模型的一些功能。项目使用 Python 3 和 TensorFlow 1.x 版本(请注意,当前 TensorFlow 2.x 已广泛使用,但核心概念相通)。数据源来自古腾堡项目的二十本流行书籍,如需提高模型准确性,可以扩展使用更多书籍。

完整代码可在 GitHub 查看:https://github.com/Currie32/Spell-Checker

模型能力预览

以下是一些模型纠正拼写错误的示例:

  • 输入: Spellin is difficult, whch is wyh you need to study everyday.
    输出: Spelling is difficult, which is why you need to study everyday.
  • 输入: The first days of her existence in th country were vrey hard for Dolly.
    输出: The first days of her existence in the country were very hard for Dolly.
  • 输入: Thi is really something impressiv thaat we should look into right away!
    输出: This is really something impressive that we should look into right away!

数据加载与预处理

加载书籍文本

首先,将所有书籍文件放在名为“books”的文件夹中。以下是加载单本书籍的函数:

def load_book(path):
    input_file = os.path.join(path)
    with open(input_file) as f:
        book = f.read()
    return book

获取书籍文件列表:

path = './books/'
book_files = [f for f in listdir(path) if isfile(join(path, f))]
book_files = book_files[1:]

将所有书籍文本加载到列表中:

books = []
for book in book_files:
    books.append(load_book(path+book))

如需查看每本书的单词数,可使用以下代码:

for i in range(len(books)):
    print("There are {} words in {}.".format(len(books[i].split()), book_files[i]))

注意: 如果代码中不使用 .split(),则返回的是字符数而非单词数。

文本清洗

由于模型以字符为输入,我们无需进行词干提取或去除停用词,只需移除不想要的字符和多余空格:

def clean_text(text):
    '''Remove unwanted characters and extra spaces from the text'''
    text = re.sub(r'n', ' ', text)
    text = re.sub(r'[{}@_*>()\#%+=[]]','', text)
    text = re.sub('a0','', text)
    text = re.sub("'92t","'t", text)
    text = re.sub("'92s","'s", text)
    text = re.sub("'92m","'m", text)
    text = re.sub("'92ll","'ll", text)
    text = re.sub("'91",'', text)
    text = re.sub("'92",'', text)
    text = re.sub("'93",'', text)
    text = re.sub("'94",'', text)
    text = re.sub('.','. ', text)
    text = re.sub('!','! ', text)
    text = re.sub('?','? ', text)
    text = re.sub(' +',' ', text) # Removes extra spaces
    return text

词汇表构建

词汇表构建是标准步骤,具体代码可在 GitHub 找到。模型输入包含的字符如下(共78个):

[' ', '!', '"', '$', '&', "'", ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '', '', '', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

可以进一步删除特殊字符或统一转为小写,但为了保持拼写检查器的实用性,我们保留了当前字符集。

句子分割与处理

数据在输入模型前需组织成句子。我们以句点加空格(“. ”)为分隔符进行分割。但需注意,有些句子以问号或感叹号结尾,这可能导致分割问题。不过,只要合并后的句子长度不超过最大限制,模型仍能处理。

示例:

  • Today is a lovely day. I want to go to the beach. (将被拆分为两个句子)
  • Is today a lovely day? I want to go to the beach. (将作为一个长句子处理)

分割代码如下:

sentences = []
for book in clean_books:
    for sentence in book.split('. '):
        sentences.append(sentence + '.')

句子长度筛选

为控制训练时间,我们筛选长度在指定范围内的句子:

max_length = 92
min_length = 10

good_sentences = []

for sentence in int_sentences:
    if len(sentence) <= max_length and len(sentence) >= min_length:
        good_sentences.append(sentence)

数据集划分

将数据划分为训练集和测试集,测试集占比15%:

training, testing = train_test_split(good_sentences,
                                     test_size = 0.15,
                                     random_state = 2)

按长度排序

按句子长度排序可以提升训练效率,因为同一批次内的句子长度相近,所需填充较少:

training_sorted = []
testing_sorted = []

for i in range(min_length, max_length+1):
    for sentence in training:
        if len(sentence) == i:
            training_sorted.append(sentence)
    for sentence in testing:
        if len(sentence) == i:
            testing_sorted.append(sentence)

生成含错误的训练数据

本项目的一个关键部分是生成含有拼写错误的句子作为模型输入。错误类型包括:

  1. 交换相邻两个字符(例如:hlelo → hello)
  2. 插入一个额外字母(例如:heljlo → hello)
  3. 删除一个字符(例如:helo → hello)

每种错误发生的概率相等,且每个字符有5%的概率发生错误。平均每20个字符会出现一个错误。

定义字母列表:

letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
           'n','o','p','q','r','s','t','u','v','w','x','y','z',]

错误生成函数:

def noise_maker(sentence, threshold):
    noisy_sentence = []
    i = 0
    while i < len(sentence):
        random = np.random.uniform(0,1,1)
        if random < threshold:
            noisy_sentence.append(sentence[i])
        else:
            new_random = np.random.uniform(0,1,1)
            if new_random > 0.67:
                if i == (len(sentence) - 1):
                    continue
                else:
                    noisy_sentence.append(sentence[i+1])
                    noisy_sentence.append(sentence[i])
                    i += 1
            elif new_random < 0.33:
                random_letter = np.random.choice(letters, 1)[0]
                noisy_sentence.append(vocab_to_int[random_letter])
                noisy_sentence.append(sentence[i])
            else:
                pass
        i += 1
    return noisy_sentence

批次数据生成

与许多项目不同,我们在训练过程中动态生成输入数据。每个批次的目标句子(正确句子)都会通过 noise_maker 函数生成新的含错误输入。这种方法极大地扩展了训练数据的多样性。

def get_batches(sentences, batch_size, threshold):

    for batch_i in range(0, len(sentences)//batch_size):
        start_i = batch_i * batch_size
        sentences_batch = sentences[start_i:start_i + batch_size]

        sentences_batch_noisy = []
        for sentence in sentences_batch:
            sentences_batch_noisy.append(
                noise_maker(sentence, threshold))

        sentences_batch_eos = []
        for sentence in sentences_batch:
            sentence.append(vocab_to_int[''])
            sentences_batch_eos.append(sentence)

        pad_sentences_batch = np.array(
            pad_sentence_batch(sentences_batch_eos))
        pad_sentences_noisy_batch = np.array(
            pad_sentence_batch(sentences_batch_noisy))

        pad_sentences_lengths = []
        for sentence in pad_sentences_batch:
            pad_sentences_lengths.append(len(sentence))

        pad_sentences_noisy_lengths = []
        for sentence in pad_sentences_noisy_batch:
            pad_sentences_noisy_lengths.append(len(sentence))

        yield (pad_sentences_noisy_batch,
               pad_sentences_batch,
               pad_sentences_noisy_lengths,
               pad_sentences_lengths)

总结与展望

本项目展示了一个基于 seq2seq 模型的拼写检查器实现。虽然结果令人鼓舞,但模型仍有改进空间,例如使用更先进的架构(如 CNN 或 Transformer)或扩展训练数据。欢迎社区贡献改进想法或代码。

希望本文能帮助你理解如何使用 TensorFlow 构建拼写检查模型。如有疑问或建议,欢迎讨论。

发表评论

您的邮箱不会公开。必填项已用 * 标注。