seeking专题

sft是mean-seeking rl是mode-seeking

原文链接 KL散度是D(P||Q),P和Q谁在前谁在后是有讲究的,P在前,就从P采样。 D K L ( P ∣ ∣ Q ) = E x − p ( x ) ( l o g ( P ( x ) / Q ( x ) ) ) D_{KL}(P||Q)=E_{x-p(x)}(log(P(x)/Q(x))) DKL​(P∣∣Q)=Ex−p(x)​(log(P(x)/Q(x)))想象一下,如果某个x的Q=