浅谈deep stacking network --- 一种比较实用的deep learning算法

wenhai.he 发布于2019-04-25 17:57 / 2765人阅读

摘要：分享一下组会的讲稿。现在的应用主要在于和语言以及图像的分类和回归。而却由的命名可以看到的核心思想是做。这源自于在年提出的的思想。如下图其作为一种而广为使用。这是因为的比较少，一般都是二分类问题，减轻了的传递效应。

分享一下组会的讲稿。
附组会的ppt
http://vdisk.weibo.com/s/zfic-IP2yagqu
涉及到的原论文大概10几篇，我打印出来看的，有人需要的话我回去好好对着打印出的论文再打出来标题。。。。大概就是DEEP STACKING NETWORKS FOR INFORMATION RETRIEVAL这样的

正文
deep stacking network 是 Li Deng 提出的一种判别模型。现在的应用主要在于CTR IR和语言以及图像的分类和回归。
大体的结构如下图

1.简要介绍

Why dsn
话说 dnn 已经比较好用了，各种包也很多了，那为什么还要用 dsn 呢？
很大一个原因是因为 dnn 在 fine tuning phase 用的是 stochastic gradient descent，对其做 parallelize across machines 比较困难。
而 dsn 却 attacks the learning Scalability problem

Central Idea - Stacking
由 dsn 的命名可以看到 dsn 的核心思想是做 stacking。这源自于 Wolpert 在1992年提出的 stacked generalization 的思想。
如下图

Level-0 models are based on different learning models and use original data (level-0 data)
Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called “generalizer”
其作为一种 ensemble method 而广为使用。

Central Idea - DCN VS DSN
比较有意思的是dsn最初提出名字是叫做dcn的，也就是Deep Convex Network
Deng 老师的解释是这样的：两个名字强调的地方不同
Deep Convex Network------accentuates the role of convex optimization
Deep Stacking Network----the key operation of stacking is emphasized

2.算法细节
在算法细节没有太多公式推导，一方面是因为公式实在比较简单，另一方面则是想讲工程中实践更看重的东西，比如超参数的选择，weight初始化之类的。
主要为下面这些部分input output, W&U, fine-tuning, hyper-parameter, regularization, over-fitting

Input
对于 dsn 结构来说是如下图红圈圈出来的地方

主要的输入以 image speech 以及 Semantic utterance classification 这三个方面的应用为例说明。

对于有些时候我们不需要做 feature selection 的情况：
对于 Image 来说可以是 a number of pixels or extracted features，或者是 values based at least in part upon intensity values, RGB values (or the like corresponding to the respective pixels)
对Speech来说可以是samples of speech waveform或者是the extracted features from speech waveforms(such as power spectra or cepstral coefficients)
Note the use of speech waveform as the raw features to a speech recognizer isnot a crazy idea

对于有些情况我们需要做feature selection：
比如说如下
对于 Acoustic models 其具有 9 standard MFCC features 和 millions of frames as training samples 当然就不需要做 feature selection
而 Semantic utterance classification 有 as many as 125,000 unique trigrams as potential features
却只有 16,000 utterances 这样形成了一个sparse space，就需要做feature selection

具体来说用什么 feature selection 的算法区别不大，这里说下用 boosting classifier 的方法
Semantic Classification
-The input feature space is shrunk using the n-grams selected by the Boosting classifier.
-The weights coming with the decision stump are ignored, only binary features indicating absence or presence are used.(decision stump, which is a single node decision tree)

Output

根据需要解决的问题定义来定
-representative of the values
0, 1, 2, 3, and so forth up to
9 with a 0-1 coding scheme
-representative of phones,
-HMM states of phones
-context-dependent HMM states of phones.
等等

Learning
先以下图所示的最底层讲述如何 learning

（豆瓣编辑太麻烦直接帖图好了）
单层计算便是如此，其实挺简单的

但是具体计算的时候涉及到很多问题
1.W 怎么初始化
2.超参数怎么调
3.什么时候需要做 regularization
4.解剧透问题的时候 overfitting 的情况

Setting Weight Matrices W
在前面说了单层的计算方法，对于多层来说计算结构如下

如同整体的结构图可以看到，把 output 是加入了下一层的 input里面了的。其余计算步骤和单层一样。
但是对于 W 的初始化其实还是挺有趣的，有如下4种方法：
1.Take the same W from the immediately lower module already adjusted via fine tuning.
2.Take a copy of the RBM that initialized W at bottom module.
3.Use random numbers, making the full W maximally random before fine tuning.
4.Mix the above three choices with various weighting and with randomized order or otherwise.
当然要注意的是the sub-matrix of W corresponding to the output units from the lower modules is always initialized with random numbers.

大家可能发现这4种方法差别还挺大的，但是有趣的是 with sufficient efforts put to adjust all other hyper-parameters, all four strategies above eventually gave similar classification accuracy。
正所谓调得一手好参数，再烂的结构也不怕。不过对于随机赋值的策略来说虽然在 classification accuracy 没有什么损失，但是 it takes many more modules and fine-tuning iterations than other strategies.

尤其要注意的是对于最底层的那个 module 不能随机赋值，还是要上 RBM

Fine-tuning
用 batch-mode gradient descent，没什么好说的，直接帖公式

Regularization
在做图片和语言的时候不需要做 regularization，在做IR的时候需要做。这是因为 IR 的 output 比较少，一般都是二分类问题，减轻了 stacking 的传递效应。
做法为对 U 做 L2 regularization，给 W 加个 data reconstruction error term

Hyper-parameter
算法的超参数在于隐层选择多少个神经元。
这个是靠把数据集分为 train data、 test data 以及 development data 来做实验的。看着差不多就行了，一般大于 input 几倍就好。
大体的感觉如下：

相当于是input 784个 feature，hidden 3000，output 10的节奏。

Over-fitting
最后说下到底要 stacking 几层的问题。
层数少了效果差，层数多了 over fitting。
一般来说是这样的，如果你的特征选的比较好，参数初始化比较巧妙，超参数调的也比较好，那么层数就需要少一点；如果你上面这些东西都做的不好，那么层数就需要的多一些。（想象一下极端情况，特征非常巧妙，那一层就够了）
调层数也是靠比较 train error 和 test error 来实现。
一般来说，层数越多 train error 越小，但是多到一定程度的时候 test error 反而会增加，这时候我们就认为出现了 over fitting，把结构的层数定义在拐点处，如下图，分别是两个应用场景，其实用的层数都不多。

3.总结
其实 dsn 对于大 input 大 output 的分类问题效果挺好的，而且比很多 state of art 的方法做分类略好一点点。最重要的是比 dnn 好调，工程实践起来方便。

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/4278.html

Deep Learning深度学习相关入门文章汇摘

摘要：深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征，以发现数据的分布式特征表示。深度学习的概念由等人于年提出。但是自年以来，机器学习领域，取得了突破性的进展。深度学习是机器学习研究中的一个新的领域，其动机在于建立、模拟人脑进行分析学习的神经网络，它模仿人脑的机制来解释数据，例如图像，声音和文本。深度学习是无监督学习的一种。深度学习的概念源于人工神经网络的研究。含多隐层的多层感知...

Riddler 2019-04-25 17:57 评论0 收藏0
提高深度学习性能的四种方式

摘要：可以参见以下相关阅读创造更多数据上一小节说到了有了更多数据，深度学习算法通常会变的更好。导语我经常被问到诸如如何从深度学习模型中得到更好的效果的问题，类似的问题还有：我如何提升准确度如果我的神经网络模型性能不佳，我能够做什么？对于这些问题，我经常这样回答，我并不知道确切的答案，但是我有很多思路，接着我会列出了我所能想到的所有或许能够给性能带来提升的思路。为避免一次次罗列出这样一个简单的列表...

JessYanCoding 2019-04-25 18:06 评论0 收藏0
机器学习——深度学习(Deep Learning)

摘要：有监督学习与无监督学习，分类回归，密度估计聚类，深度学习，，有监督学习和无监督学习给定一组数据，为，。由于不需要事先根据训练数据去聚类器，故属于无监督学习。 Deep Learning是机器学习中一个非常接近AI的领域，其动机在于建立、模拟人脑进行分析学习的神经网络，最近研究了机器学习中一些深度学习的相关知识，本文给出一些很有用的资料和心得。Key Words：有监督学习与无监督学习，分类...

Guakin_Huang 2019-04-25 17:57 评论0 收藏0