学习笔记CB010:递归神经网络、LSTM、自动抓取字幕

mikyou 发布于2019-06-26 15:54 / 1721人阅读

摘要：递归神经网络可存储记忆神经网络，是其中一种，在领域应用效果不错。时间递归神经网络。神经网络结构设计。深度学习运用到聊天机器人中，神经网络结构选择组合优化。自动清理空目录脚本第六步清理非字幕文件。

递归神经网络可存储记忆神经网络，LSTM是其中一种，在NLP领域应用效果不错。

递归神经网络（RNN），时间递归神经网络（recurrent neural network），结构递归神经网络（recursive neural network）。时间递归神经网络神经元间连接构成有向图，结构递归神经网络利用相似神经网络结构递归构造更复杂深度网络。两者训练属同一算法变体。

时间递归神经网络。传统神经网络FNN(Feed-Forward Neural Networks)，前向反馈神经网络。RNN引入定向循环，神经元为节点组成有向环，可表达前后关联关系。隐藏层节点间构成全连接，一个隐藏层节点输出可作另一个隐藏层节点或自己的输入。U、V、W是变换概率矩阵，x是输入，o是输出。RNN关键是隐藏层，隐藏层捕捉序列信息，记忆能力。RNN中U、V、W参数共享，每一步都在做相同事情，输入不同，降低参数个数和计算量。RNN在NLP应用较多，语言模型在已知已出现词情况下预测下一个词概率，是时序模型，下一个词出现取决于前几个词，对应RNN隐藏层间内部连接。

RNN的训练方法。用BP误差反向传播算法更新训练参数。从输入到输出经过步骤不确定，利用时序方式做前向计算，假设x表示输入值，s表示输入x经过U矩阵变换后值，h表示隐藏层激活值，o表示输出层值, f表示隐藏层激活函数，g表示输出层激活函数。当t=0时，输入为x0, 隐藏层为h0。当t=1时，输入为x1, s1 = Ux1+Wh0, h1 = f(s1), o1 = g(Vh1)。当t=2时，s2 = Ux2+Wh1, h2 = f(s2), o2 = g(Vh2)。st = Uxt + Wh(t-1), ht = f(st), ot = g(Vht)。h=f(现有的输入+过去记忆总结)，对RNN记忆能力全然体现。
UVW变换概率矩阵，x输入，s xU矩阵变换后值，f隐藏层激活函数，h隐藏层激活值，g输出层激活函数，o输出。时间、输入、变换(输入、前隐藏)、隐藏(变换)、输出(隐藏)。输出(隐藏(变换(时间、输入、前隐藏)))。反向修正参数，每一步输出o和实际o值误差，用误差反向推导，链式求导求每层梯度，更新参数。

LSTM(Long Short Tem Momery networks)。RNN存在长序列依赖(Long-Term Dependencies)问题。下一个词出现概率和非常久远之前词有关，考虑到计算量，限制依赖长度。http://colah.github.io/posts/... 。传统RNN示意图，只包含一个隐藏层，tanh为激发函数，“记忆”体现在t滑动窗口，有多少个t就有多少记忆。

LSTM设计，神经网络层(权重系数和激活函数，σ表示sigmoid激活函数，tanh表示tanh激活函数)，矩阵运算(矩阵乘或矩阵加)。历史信息传递和记忆，调大小阀门(乘以一个0到1之间系数)，第一个sigmoid层计算输出0到1之间系数，作用到×门，这个操作表达上一阶段传递过来的记忆保留多少，忘掉多少。忘掉记忆多少取决上一隐藏层输出h{t-1}和本层的输入x{t}。上一层输出h{t-1}和本层的输入x{t}得出新信息，存到记忆。计算输出值Ct部分tanh神经元和计算比例系数sigmoid神经元（sigmoid取值范围是[0,1]作比例系数，tanh取值范围[-1,1]作一个输出值）。隐藏层输出h计算，考虑当前全部信息（上一时序隐藏层输出、本层输入x和当前整体记忆信息），本单元状态部分C通过tanh激活并做一个过滤(上一时序输出值和当前输入值通过sigmoid激活系数)。一句话词是不同时序输入x，在某一时间t出现词A概率可LSTM计算，词A出现概率取决前面出现过词，取决前面多少个词不确定，LSTM存储记忆信息C，得出较接近概率。

聊天机器人是范问答系统。

语料库获取。范问答系统，一般从互联网收集语料信息，比如百度、谷歌，构建问答对组成语料库。语料库分成多训练集、开发集、测试集。问答系统训练在一堆答案里找一个正确答案模型。训练过程不把所有答案都放到一个向量空间，做分组，在语料库里采集样本，收集每一个问题对应500个答案集合，500个里面有正向样本，随机选些负向样本，突出正向样本作用。

基于CNN系统设计，sparse interaction(稀疏交互)，parameter sharing(参数共享)，equivalent respresentation(等价表示)，适合自动问答系统答案选择模型训练。

通用训练方法。训练时获取问题词向量Vq(词向量可用google word2vec训练，和一个正向答案词向量Va+，和一个负向答案词向量Va-，比较问题和两个答案相似度，两个相似度差值大于一个阈值m更新模型参数，在候选池里选答案，小于m不更新模型。参数更新，梯度下降、链式求导。测试数据，计算问题和候选答案cos距离，相似度最大是正确答案预测。

神经网络结构设计。HL hide layer隐藏层，激活函数z = tanh(Wx+B)，CNN 卷积层，P 池化层，池化步长 1，T tanh层，P+T输出是向量表示，最终输出两个向量cos相似度。HL或CNN连起来表示共享相同权重。CNN输出维数取决做多少卷积特征。论文《Applying Deep Learning To Answer Selection- A Study And An Open Task》。

深度学习运用到聊天机器人中，1. 神经网络结构选择、组合、优化。2. 自然语言处理，机器识别词向量。3. 相似或匹配关系考虑相似度计算，典型方法 cos距离。4. 文本序列全局信息用CNN或LSTM。5. 精度不高可加层。6. 计算量过大，参数共享和池化。

聊天机器人学习，需要海量聊天语料库。美剧字幕。外文电影或电视剧字幕文件是天然聊天语料，对话比较多美剧最佳。字幕库网站www.zimuku.net。

自动抓取字幕。抓取器代码(https://github.com/warmheartl...。在subtitle下创建目录result，scrapy.Request
方法调用时增加传参 dont_filter=True：

# coding:utf-8

import sys
import importlib
importlib.reload(sys)

import scrapy
from subtitle_crawler.items import SubtitleCrawlerItem

class SubTitleSpider(scrapy.Spider):
    name = "subtitle"
    allowed_domains = ["zimuku.net"]
    start_urls = [
            "http://www.zimuku.net/search?q=&t=onlyst&ad=1&p=20",
            "http://www.zimuku.net/search?q=&t=onlyst&ad=1&p=21",
            "http://www.zimuku.net/search?q=&t=onlyst&ad=1&p=22",
    ]

    def parse(self, response):
        hrefs = response.selector.xpath("//div[contains(@class, "persub")]/h1/a/@href").extract()
        for href in hrefs:
            url = response.urljoin(href)
            request = scrapy.Request(url, callback=self.parse_detail, dont_filter=True)
            yield request

    def parse_detail(self, response):
        url = response.selector.xpath("//li[contains(@class, "dlsub")]/div/a/@href").extract()[0]
        print("processing: ", url)
        request = scrapy.Request(url, callback=self.parse_file, dont_filter=True)
        yield request

    def parse_file(self, response):
        body = response.body
        item = SubtitleCrawlerItem()
        item["url"] = response.url
        item["body"] = body
        return item

# -*- coding: utf-8 -*-

class SubtitleCrawlerPipeline(object):
    def process_item(self, item, spider):
        url = item["url"]
        file_name = url.replace("/","_").replace(":","_")+".rar"
        fp = open("result/"+file_name, "wb+")
        fp.write(item["body"])
        fp.close()
        return item

ls result/|head -1 , ls result/|wc -l , du -hs result/ 。

字幕文件解压，linux直接执行unzip file.zip。linux解压rar文件，http://www.rarlab.com/downloa... 。wget http://www.rarlab.com/rar/rar... 。tar zxvf rarlinux-x64-5.4.0.tar.gz
./rar/unrar 。解压命令，unrar x file.rar 。linux解压7z文件，http://downloads.sourceforge.... 下载源文件，解压执行make编译 bin/7za可用，用法 bin/7za x file.7z。

程序和脚本在https://github.com/warmheartl... 。第一步：爬取影视剧字幕。第二步：压缩格式分类。文件多无法ls、文件名带特殊字符、文件名重名误覆盖、扩展名千奇百怪，python脚本mv_zip.py：

import glob
import os
import fnmatch
import shutil
import sys

def iterfindfiles(path, fnexp):
    for root, dirs, files in os.walk(path):
        for filename in fnmatch.filter(files, fnexp):
            yield os.path.join(root, filename)

i=0
for filename in iterfindfiles(r"./input/", "*.ZIP"):
    i=i+1
    newfilename = "zip/" + str(i) + "_" + os.path.basename(filename)
    print(filename + " <===> " + newfilename)
    shutil.move(filename, newfilename)
    #sys.exit(-1)

扩展名根据压缩文件修改.rar、.RAR、.zip、.ZIP。第三步：解压。根据操作系统下载不同解压工具，建议unrar和unzip，脚本来实现批量解压：

i=0; for file in `ls`; do mkdir output/${i}; echo "unzip $file -d output/${i}";unzip -P abc $file -d output/${i} > /dev/null; ((i++)); done
i=0; for file in `ls`; do mkdir output/${i}; echo "${i} unrar x $file output/${i}";unrar x $file output/${i} > /dev/null; ((i++)); done

第四步：srt、ass、ssa字幕文件分类整理。字幕文件类型srt、lrc、ass、ssa、sup、idx、str、vtt。第五步：清理目录。自动清理空目录脚本clear_empty_dir.py ：

import glob
import os
import fnmatch
import shutil
import sys

def iterfindfiles(path, fnexp):
    for root, dirs, files in os.walk(path):
        if 0 == len(files) and len(dirs) == 0:
            print(root)
            os.rmdir(root)

iterfindfiles(r"./input/", "*.srt")

第六步：清理非字幕文件。批量删除脚本del_file.py ：

import glob
import os
import fnmatch
import shutil
import sys

def iterfindfiles(path, fnexp):
    for root, dirs, files in os.walk(path):
        for filename in fnmatch.filter(files, fnexp):
            yield os.path.join(root, filename)

for suffix in ("*.mp4", "*.txt", "*.JPG", "*.htm", "*.doc", "*.docx", "*.nfo", "*.sub", "*.idx"):
    for filename in iterfindfiles(r"./input/", suffix):
        print(filename)
        os.remove(filename)

第七步：多层解压缩。第八步：舍弃剩余少量文件。无扩展名、特殊扩展名、少量压缩文件，总体不超过50M。第九步：编码识别与转码。utf-8、utf-16、gbk、unicode、iso8859，统一utf-8，get_charset_and_conv.py :

import chardet
import sys
import os

if __name__ == "__main__":
    if len(sys.argv) == 2:
        for root, dirs, files in os.walk(sys.argv[1]):
            for file in files:
                file_path = root + "/" + file
                f = open(file_path,"r")
                data = f.read()
                f.close()
                encoding = chardet.detect(data)["encoding"]
                if encoding not in ("UTF-8-SIG", "UTF-16LE", "utf-8", "ascii"):
                    try:
                        gb_content = data.decode("gb18030")
                        gb_content.encode("utf-8")
                        f = open(file_path, "w")
                        f.write(gb_content.encode("utf-8"))
                        f.close()
                    except:
                        print("except:", file_path)

第十步：筛选中文。extract_sentence_srt.py ：

# coding:utf-8
import chardet
import os
import re

cn=ur"([u4e00-u9fa5]+)"
pattern_cn = re.compile(cn)
jp1=ur"([u3040-u309F]+)"
pattern_jp1 = re.compile(jp1)
jp2=ur"([u30A0-u30FF]+)"
pattern_jp2 = re.compile(jp2)

for root, dirs, files in os.walk("./srt"):
    file_count = len(files)
    if file_count > 0:
        for index, file in enumerate(files):
            f = open(root + "/" + file, "r")
            content = f.read()
            f.close()
            encoding = chardet.detect(content)["encoding"]
            try:
                for sentence in content.decode(encoding).split("n"):
                    if len(sentence) > 0:
                        match_cn =  pattern_cn.findall(sentence)
                        match_jp1 =  pattern_jp1.findall(sentence)
                        match_jp2 =  pattern_jp2.findall(sentence)
                        sentence = sentence.strip()
                        if len(match_cn)>0 and len(match_jp1)==0 and len(match_jp2) == 0 and len(sentence)>1 and len(sentence.split(" ")) < 10:
                            print(sentence.encode("utf-8"))
            except:
                continue

第十一步：字幕中句子提取。

# coding:utf-8
import chardet
import os
import re

cn=ur"([u4e00-u9fa5]+)"
pattern_cn = re.compile(cn)
jp1=ur"([u3040-u309F]+)"
pattern_jp1 = re.compile(jp1)
jp2=ur"([u30A0-u30FF]+)"
pattern_jp2 = re.compile(jp2)

for root, dirs, files in os.walk("./ssa"):
    file_count = len(files)
    if file_count > 0:
        for index, file in enumerate(files):
            f = open(root + "/" + file, "r")
            content = f.read()
            f.close()
            encoding = chardet.detect(content)["encoding"]
            try:
                for line in content.decode(encoding).split("n"):
                    if line.find("Dialogue") == 0 and len(line) < 500:
                        fields = line.split(",")
                        sentence = fields[len(fields)-1]
                        tag_fields = sentence.split("}")
                        if len(tag_fields) > 1:
                            sentence = tag_fields[len(tag_fields)-1]
                        match_cn =  pattern_cn.findall(sentence)
                        match_jp1 =  pattern_jp1.findall(sentence)
                        match_jp2 =  pattern_jp2.findall(sentence)
                        sentence = sentence.strip()
                        if len(match_cn)>0 and len(match_jp1)==0 and len(match_jp2) == 0 and len(sentence)>1 and len(sentence.split(" ")) < 10:
                            sentence = sentence.replace("N", "")
                            print(sentence.encode("utf-8"))
            except:
                continue

第十二步：内容过滤。过滤特殊unicode字符、关键词、去除字幕样式标签、html标签、连续特殊字符、转义字符、剧集信息：

# coding:utf-8
import sys
import re
import chardet

if __name__ == "__main__":
    #illegal=ur"([u2000-u2010]+)"
    illegal=ur"([u0000-u2010]+)"
    pattern_illegals = [re.compile(ur"([u2000-u2010]+)"), re.compile(ur"([u0090-u0099]+)")]
    filters = ["字幕", "时间轴:", "校对:", "翻译:", "后期:", "监制:"]
    filters.append("时间轴：")
    filters.append("校对：")
    filters.append("翻译：")
    filters.append("后期：")
    filters.append("监制：")
    filters.append("禁止用作任何商业盈利行为")
    filters.append("http")
    htmltagregex = re.compile(r"<[^>]+>",re.S)
    brace_regex = re.compile(r"{.*}",re.S)
    slash_regex = re.compile(r"w",re.S)
    repeat_regex = re.compile(r"[-=]{10}",re.S)
    f = open("./corpus/all.out", "r")
    count=0
    while True:
        line = f.readline()
        if line:
            line = line.strip()

            # 编码识别，不是utf-8就过滤
            gb_content = ""
            try:
                gb_content = line.decode("utf-8")
            except Exception as e:
                sys.stderr.write("decode error:  ", line)
                continue

            # 中文识别，不是中文就过滤
            need_continue = False
            for pattern_illegal in pattern_illegals:
                match_illegal = pattern_illegal.findall(gb_content)
                if len(match_illegal) > 0:
                    sys.stderr.write("match_illegal error: %sn" % line)
                    need_continue = True
                    break
            if need_continue:
                continue

            # 关键词过滤
            need_continue = False
            for filter in filters:
                try:
                    line.index(filter)
                    sys.stderr.write("filter keyword of %s %sn" % (filter, line))
                    need_continue = True
                    break
                except:
                    pass
            if need_continue:
                continue

            # 去掉剧集信息
            if re.match(".*第.*季.*", line):
                sys.stderr.write("filter copora %sn" % line)
                continue
            if re.match(".*第.*集.*", line):
                sys.stderr.write("filter copora %sn" % line)
                continue
            if re.match(".*第.*帧.*", line):
                sys.stderr.write("filter copora %sn" % line)
                continue

            # 去html标签
            line = htmltagregex.sub("",line)

            # 去花括号修饰
            line = brace_regex.sub("", line)

            # 去转义
            line = slash_regex.sub("", line)

            # 去重复
            new_line = repeat_regex.sub("", line)
            if len(new_line) != len(line):
                continue

            # 去特殊字符
            line = line.replace("-", "").strip()

            if len(line) > 0:
                sys.stdout.write("%sn" % line)
            count+=1
        else:
            break
    f.close()
    pass

参考资料：

《Python 自然语言处理》

http://www.shareditor.com/blo...

欢迎推荐上海机器学习工作机会，我的微信：qingxingfengzi

专线服务私有云自动抓取网页数据笔记本开机后自动关机学习笔记学习笔记一

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/18380.html

「正经字幕」太无聊？「神经玩笑机」就可以生成逗你笑的趣味字幕

摘要：最后，我们显示了若干张图像中所生成的趣味字幕。图所提出的有趣字幕生成的体系结构。我们将所提出的方法称为神经玩笑机器，它是与预训练模型相结合的。用户对已发布的字幕的趣味性进行评估，并为字幕指定一至三颗星。可以毫不夸张地说，笑是一种特殊的高阶功能，且只有人类才拥有。那么，是什么引起人类的笑声表达呢？最近，日本东京电机大学（Tokyo Denki University）和日本国家先进工业科学和技...

lastSeries 2019-04-25 18:27 评论0 收藏0
学习笔记CB011:lucene搜索引擎库、IKAnalyzer中文切词工具、检索服务、查询索引、导

摘要：开源免费搜索引擎库，语言开发。，开源中文切词工具。中文需转发送，端读取按解析，启动方法聊天界面。在所有页面公共代码部分增加庞大语料库运用，训练，中文语料转成算法识别向量形式，最强大工具。影视剧字幕聊天语料库特点，把影视剧说话内容一句一句以回车换行罗列三千多万条中国话，相邻第二句很可能是第一句最好回答。一个问句有很多种回答，可以根据相关程度以及历史聊天记录所有回答排序，找到最优，是一个...

pf_miles 2019-06-26 15:55 评论0 收藏0
难以置信！LSTM和GRU的解析从未如此清晰

摘要：作为解决方案的和和是解决短时记忆问题的解决方案，它们具有称为门的内部机制，可以调节信息流。随后，它可以沿着长链序列传递相关信息以进行预测，几乎所有基于递归神经网络的技术成果都是通过这两个网络实现的。和采用门结构来克服短时记忆的影响。短时记忆RNN 会受到短时记忆的影响。如果一条序列足够长，那它们将很难将信息从较早的时间步传送到后面的时间步。因此，如果你正在尝试处理一段文本进行预测，RNN...

MrZONT 2019-04-25 18:30 评论0 收藏0
学习笔记CB012: LSTM 简单实现、完整实现、torch、小说训练word2vec lstm机

摘要：和分别是样本输入和输出二进制值第位，对于每个样本有两个值，分别是和对应第位。最简单实现，没有考虑偏置变量，只有两个神经元。存储神经元状态，包括，是内部状态矩阵记忆，是隐藏层神经元输出矩阵。表示当前时序表示时序记忆单元。下载甄环传小说原文。真正掌握一种算法，最实际的方法，完全手写出来。 LSTM（Long Short Tem Memory）特殊递归神经网络，神经元保存历史记忆，解决自然...

NickZhou 2019-07-30 16:26 评论0 收藏0
递归神经网络不可思议的有效性

摘要：递归神经网络有一些不可思议的地方，有些时候，模型与你期望的相差甚远，许多人认为是非常难训练，那么究竟是什么呢就有这篇文章来带给大家。递归神经网络有一些不可思议的地方。但是我们正在不断超越自己那么究竟是什么呢递归神经网络序列。递归神经网络有一些不可思议的地方，有些时候，模型与你期望的相差甚远，许多人认为是RNNS非常难训练，那么RNNs究竟是什么呢？就有这篇文章来带给大家。递归神经网络（RN...

Drinkey 2019-04-25 18:00 评论0 收藏0