原文链接-IJCAI 2018
可供参考的博客
将个人资料基于聊天系统，以获得连贯一致的对话生成
通过人格保持一致性COHERENCE。引入键值对的profile表达形式。
本文中personality=profile=identity

为了给聊天机器人设定身份，文章使用有监督的属性检测器 (Profile Detector) 判断用户的问题是否提及聊天机器人的属性设定，以及具体哪一条预设的属性值；然后为了生成包含属性值的一致的回复，文章以属性值为分割点，从属性值到回复句的结尾训练forward decoder，从属性值到回复句的开头训练backward decoder，组成一个双向解码器 (Bidirectional Decoder) ，从而解码得到一个包含属性值的完整回复；此外，为了消除训练数据与预设属性值不一致的问题，文章使用无监督的属性值定位器 (Position Detector) 来帮助模型更好地训练。

0.Abstract

目标：使用社交媒体上未标注说话人信息的通用聊天数据生成与profile相关的回复。
中心思想：使用profile detector检测在响应用户post时，是否需要使用profile。如果需要，从profile中选择一个键值对，通过双向decoder生成前向和后向的response，从而生成与人格相关的（个性化的）回复。
为了使用通用对话数据训练双向decoder，设计position detector来预测给定profile值后，从哪个词的位置开始解码。
主要工作：

引入键值对列表的形式，可以明确并显式指定人格 显式：生成与预先设置的人格一致的回复*
3个组件
- profile detector
- 双向decoder
- position detector
  3.Model
  3.1 Task Definition
  输入：post $X = x_1x_2…x_n$,定义为一组键值对的显式profile${|i=1,2,…,K}$。personality中包括的属性有姓名、性别、年龄、体重、位置、星座。
  输出：生成与profile一致的回复$Y=y_1y_2…y_m$

生成过程：

$P(Y|X),{<k_i,v_i>} = P(z=0|X)·P^{fr}(Y|X) + P(z=1|X)·P^{bi}(Y|X,{k_i,v_i})$

$P(z|X)$指的是给定post $X$后使用profile的概率($P(z|X)$ is the probability of using the profile
given post $X$)，由Profile Detector计算得到；
$P^{fr}(Y|X)=\prod^m_{t=1}P^{fr}(y_t|Y_{<t},X)$由普通的前向decoder计算得到；
Bidirectional Decoder计算得$P^{bi}(Y|X,{k_i,v_i})$

post/response对$$从社交媒体收集得到，而profile值可能不会出现在response $Y$里。这导致训练和测试存在差异。
Profile Detector决定是否使用profile；Bidirectional Decoder决定使用X+profile生成Y；

3.2 Overview

给定post，profile detector预测是否需要使用profile。如果不需要，用通用seq2seq decoder生成回复；如果需要，profile detector进一步选择一个合适的profile key and value.从所选的profile value开始，使用bidirectional decoder前后向生成一个回复。为了在通用对话数据上训练bidirectional decoder，使用position detector预测给定选择的profile value后，解码开始的词语的位置。position detector在测试阶段不使用。

3.3 Encoder

$h_t = GRU(h_{t-1}, x_t)$

3.4 Profile Detector

两个作用

检测post是否要用profile回应
$P(z|X)(z \in {0,1})$,$z=1$意味着使用profile。e.g.“你今天怎么样”$P(z|X)≈0$.“你今年几岁了”$P(z|X)≈1$。$P(z|X)$是一个用标注数据训练的二分类器。更正式的： $P(z|X)=P(z|\widetilde{h})=\sigma(W_p\widetilde{h})$ 其中$W_p$是分类器参数，$\widetilde{h}=\sum_j h_j$,即所有隐状态的和。其他elaborated(详尽阐述?)的方法，如基于attention的模型也可以。
选择一个特定键值对，用于回应。 $\beta_i=MLP([\widetilde{h},k_i,v_i])=softmax(W·[\widetilde{h};k_i;v_i])$ 其中$W$是权重，$k_i/v_i$分别是profile key/value的嵌入。$\widetilde{h}=\sum_j h_j$是post的表示。$\beta$是profile keys上的概率分布。
具有最大可能性的被选择为最优profile value:$\widetilde{v}=v_j$ ,其中 $j=argmax_i(\beta_i)$.只要获得了profile value$\widetilde{v}$，解码过程可以由双向解码器确定: $P^{bi}(Y|X,{<k_i,v_i>})=P^{bi}(Y|X,\widetilde{v})$

3.5 Bidirectional Decoder

目标是生成使用profile value的response。设计一个包含前向decoder和后向decoder的双向decoder，最重要的差异之处在于使用position detector预测起始编码位置。
假设生成response $Y=(Y^b,\widetilde{v},Y^f)=(y_1^b,…,y^b_{t-1},\widetilde{v},y_{t-1}^f,…,y^f_m)$，其中$\widetilde{v}$是被选中的profile value。双向decoder向后生成$Y^b$，向前生成$Y^f$.
backward decoder$(P^b)$从给定profile value值$\widetilde{v}$到response的起点，生成$Y^b$；
forward decoder$(P^f)$(注意这个decoder不同于$P^{fr}(y_t|Y_{<t},X)$，fr指的是不使用profile的情况)以已经生成的开始一半（即$P^b$的输出）为输入，从$\widetilde{v}$到response的结尾，生成$Y^f$。
前向和后向都是以中心词为起点而言，故后向生成前半，前向生成后半。
过程正式定义：

$P^{bi}(Y|X,\widetilde{v})=P^b(Y^b|X,\widetilde{v})*P^f(Y^f|Y^b,X,\widetilde{v})\\ P^b(Y^b|X,\widetilde{v})=\prod^1_{j=t-1}P^b(y^b_j|Y^b_{>j},X,\widetilde{v})\\ P^f(Y^f|Y^b,X,\widetilde{v})=\prod^m_{j=t+1}P^f(y^f_j|Y^f_{<j},Y^b,X,\widetilde{v})$

注意$P^b$输入有两部分，而$P^f$输入有三部分，即多了$P^b$。

为了在前向解码器中编码更多上下文，生成response的开始一半$Y^b$、以及profile value $\widetilde{v}$,作为前向decoder的起始输入。$P^b$和$P^f$的概率计算：

$P^b(y^b_j|Y^b_{>j},X,\widetilde{v})\propto MLP([s^b_j;y^b_{j+1};c^b_j])$ $P^f(y^f_j|Y^f_{<j},Y^b,X,\widetilde{v})\propto MLP([s^f_j;y^f_{j-1};c^f_j])$

其中$s_j^{()}$是相应decoder的状态，$c_j^{()}$是上下文向量，$* \in {b,f}$,
$b$指的是后向decoder，$f$指的是前向decoder。向量根据下面式子更新：

$s_j^{(*)}=GRU(s_{j+l}^{(*)},[y_{j+l}^{(*)};c_j^{(*)}])$ $c_j^{(*)}=\sum^n_{t=1}\alpha^{(*)}_{j,t}h_t$

其中$\alpha^{()}_{j,t} \propto MLP([s_{j+l}^{()},h_t])$可以视为decoder状态$s_{j+l}^{()}$和encoder隐状态$h_t$之间的相似度。(注意力可看作编&解码器隐状态之间的相似度。)
$=b$(backward)时$l=1$，$=f$(forward)时$l=-1$.(即对索引的处理，根据后或前向设置+或-1。)*
这些MLPs与(3.4节$\beta_i$)

$\beta_i=MLP([\widetilde{h},k_i,v_i])=softmax(W·[\widetilde{h};k_i;v_i])$

形式相同，参数不同。

3.6 Position Detector

用于为双向decoder提供更多监督。在训练过程中给decoder提供一个起始解码位置。
用于确定中心词。
目的：找到合适的可以用profile value替代的位置。
e.g.

post X='你1-有2-什么3-特长4'
response Y = '我1-非常2-擅长3-小提琴4'
profile k-v: <特长，钢琴>

position detector预测 response中的"小提琴4"语法上可以被profile value "钢琴"代替，故"小提琴4"被传递到decoder，称为起始解码位置。

在测试阶段，同样是从profile value起双向预测整个序列，但在训练集中，这个value可能很少或从未在response里出现，造成了训练与测试不一致（曝光偏差?）。
e.g.

profile k-v: <特长，冰球>
冰球可能很少出现在训练集中。

换句话说，即使有训练实例$(X,Y,)$，value $v$ 可能完全没在$Y$中出现过。因此，双向decoder不知道应该以哪个词为中心词(response里必须包括哪个词)。
这导致了训练和测试的不一致：训练时，decoder不知道编码起始位置；但测试时，编码起始位置已经指明。

$P(j|y_1y_2…y_m,),1≤j≤m$表示词语$y_j$被profile value $v$替代的概率。
思想：相似度越高越可能被替代

$P(j|Y,<k,v>) \propto cos(y_j,v)$

$cos(y_j,v)$即response中的一个词与profile value的余弦相似度。

3.7 Loss Function and Training

定义两个损失函数：

生成概率
$D^(c)$是只使用问答对的，$D(pr)$是需要使用profile value的。即两部分的负对数似然函数之和。 $\begin{align} \mathcal{L}_1(\theta,D^{(c)},D^{(x,y)}) &=-\sum_{(X,Y)\in D^{(c)}∪ D^{(pr)}} logP(Y|X,{<k_i,v_i>})\\ &=(-\sum_{(X,Y)\in D^{(c)}}logP^{fr}(Y|X))+(-\sum_{(X,Y)\in D^{(pr)}}logP^{bi}(Y|X,\widetilde{v}))\\ &=-\sum_{(X,Y)\in D^{(c)}}logP^{fr}(Y|X)-\sum_{(X,Y)\in D^{(pr)}}logP^{bi}(Y|X,\widetilde{v}) \end{align}$
profile detector
L2 是 profile detector 预测是否用档案及用哪个关键词的，根据前面定义过的$P(z|x)$和$β_i$

4.Experiment

Appendix:双向Decoder

本文中双向解码参考了论文《Sequence to Backward and Forward Sequences A Content-Introducing Approach to Generative Short-Text Conversation》，有几篇笔记可供参考：
https://github.com/xwzhong/papernote/blob/master/chatbot/Sequence%20to%20Backward%20and%20Forward%20Sequences%20A%20Content-Introducing%20Approach%20to%20Generative%20Short-Text%20Conversation.md

https://chuansongme.com/n/1613737151949

http://www.xuwei.io/2019/02/20/%E3%80%8Asequence-to-backward-and-forward-sequences-a-content-introducing-approach-to-generative-short-text-conversation%E3%80%8B%E8%AE%BA%E6%96%87%E7%AC%94%E8%AE%B0/