竞标步奏_一切都有其价格—在线广告竞标中如何定价单词和短语

本文主要是介绍竞标步奏_一切都有其价格—在线广告竞标中如何定价单词和短语，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

竞标步奏

This article sketches an NLP approach to pricing natural language words or phrases. It leverages creatively (1) the model word2vec, which learns the context and associations between words from a given corpus; (2) the Mondovo dataset, which provides basic building blocks for us to further bootstrap our application. This solution will have interesting applications in fields such as online ad bidding, online marketing, search engine optimization, etc. This article serves as an illustration of an initial baseline solution to the pricing problem and readers eager to learn more about how I do it in practice and a more in-depth treatment of the topic are welcome to tune in for my followup publication.

本文介绍了一种NLP方法来定价自然语言单词或短语。 它创造性地利用了(1)word2vec模型，该模型从给定语料库中学习上下文和单词之间的关联； (2)Mondovo数据集，它为我们提供了进一步构建应用程序的基本构建块。 该解决方案将在在线广告竞标，在线营销，搜索引擎优化等领域中得到有趣的应用。本文旨在说明定价问题的初始基准解决方案，读者希望了解更多有关如何进行定价的信息。欢迎练习和对该主题进行更深入的处理，以收看我的后续出版物。

People are quantifying everything. When we are unable to do that to something, we call it either worthless or mysterious, or dismiss it adroitly as hallucination; such is the case with things like love, loyalty, honesty etc.

人们正在量化一切。当我们无法做到这一点时，我们称其为“毫无价值”或“神秘”，或巧妙地将其称为幻觉。爱，忠诚，诚实等都是这种情况。

The online ad bidding industry is definitely not an exception, and one of their biggest problem is how to come up with accurate bid prices for their chosen ad keywords or phrases to secure some hot ad spots on the publishers’ websites. The quandary goes like this: if the bid price is set too high, you may be sure to get the ad spot, but you will also have to pay the hefty price you bid at; if you set the bid price too low, chances are that you will have a hard time getting that ad spot at all. Apparently, this delicate trade-off here calls for creative solutions to the problem of quantifying words/phrases into prices.

在线广告竞标行业绝对不是例外，他们最大的问题之一是如何为他们选择的广告关键字或短语提供准确的竞标价格，以确保发布者网站上的某些热门广告位。困惑是这样的：如果出价设置得太高，您可能会确定获得广告位，但是您还必须付出高昂的出价；如果您将出价设置得太低，则很有可能根本很难获得该广告位。显然，这种微妙的取舍需要针对将词/短语量化为价格的问题提出创造性的解决方案。

Fortunately, we can rest assured of the resounding good news: words can be priced too! For this problem, we might not have the luxury of a well-crafted recipe like the Black-Scholes model for options pricing, but there are multiple ways by which we can take a crack at it.

幸运的是，我们可以放心听到好消息： 言语也可以定价！ 对于这个问题，我们可能没有像Black-Scholes模型那样的精心设计的期权定价奢侈，但是我们可以通过多种方式对此加以破解。

In this article, I will sketch up a simple solution to the keywords pricing problem that makes basic use of a natural language processing technique called word2vec. The following sections will show how to handle the data, where to employ word2vec, how to transform our problem to a regression task, and finally the performances of the whole pipeline.

在本文中，我将草拟关键字定价问题的简单解决方案，该解决方案基本使用了称为word2vec的自然语言处理技术。以下各节将说明如何处理数据，在何处使用word2vec，如何将我们的问题转化为回归任务，最后转化为整个流程的绩效。

Let us get started.

让我们开始吧。

Image for post — Photo by John King on Unsplash

word2vec简介 (Brief Intro to word2vec)

It might be helpful to trace back the evolution of the statistical language models. At the beginning, we have the naive bag-of-words model, in which we treat each word in the corpus discretely; no context, no dependency, just independent words. And the best you can do with such a model is come up with a chart of word frequencies.

追溯统计语言模型的演变可能会有所帮助。最初，我们有朴素的词袋模型 ，其中我们离散地处理语料库中的每个词；没有上下文，没有依赖，只有独立的词。使用这种模型可以做的最好的事情就是拿出一个单词频率图表。

Next comes the n-gram model. Unigrams, namely indiviudal words, are not that powerful, but we can extend to bigrams, trigrams, quadrograms and beyond, in which every N (2, 3, 4 or more) consecutive words are considered as a whole (as an individual word). Arguably, such models will be able to capture word context of size N, and enable us to do more sophisticated predictions and inferences. For example, we can easily build more powerful probabilistic state transition models such as Markov chain, which supports daily applications such as word autosuggestion or autocomplete.

接下来是n-gram模型 。字母组合词(即单个词)的功能不那么强大，但是我们可以扩展到二元组，三联词，四边形图等，其中每N(2、3、4个或更多)连续的单词被视为一个整体(单个单词) 。可以说，这样的模型将能够捕获大小为N的单词上下文，并使我们能够进行更复杂的预测和推断。例如，我们可以轻松地构建功能更强大的概率状态转换模型(例如Markov链)，该模型支持日常应用，例如单词自动提示或自动完成。

In contrast, word embedding is a family of language models where words or phrases from the vocabulary are described/represented using vectors and word2vec is one of the most popular techniques to do that. Generally speaking, it uses neural networks to learn word associations/relations from a given corpus, and uses vectors of a given length to represent each word such that the semantic similarity between words will correlate to the vector similarity between their vector representations. The Wikipedia page will provide a good initial pointer and for more in-depth treatment of this topic, please stay tuned for my future posts.

相反， 词嵌入是一族语言模型，其中使用矢量描述/表示词汇中的词或短语，而word2vec是最流行的技术之一。一般而言，它使用神经网络从给定的语料库中学习单词的关联/关系，并使用给定长度的向量表示每个单词，以使单词之间的语义相似度与其向量表示之间的向量相似度相关。 Wikipedia页面将提供一个很好的初始指针，并且对于这个主题的更深入的处理，请继续关注我将来的帖子。

数据处理 (Data Processing)

This is an extremely important step. In order for us to come up with any model, we will need data first. Further, in order for our model to learn any meaningful relationship among the data, we want the data to contain sample mappings from natural language words to prices. Unfortunately there are many such datasets available on the Internet and the one I was able to find is from Mondovo. This specific dataset contains the top 1000 most asked questions on Google and their associated global cost-per-clicks, which, although a fairly small dataset, provides the basic ingredients we need: words and their prices.

这是非常重要的一步。为了使我们能够提出任何模型，我们首先需要数据。此外，为了使我们的模型学习数据之间的任何有意义的关系，我们希望数据包含从自然语言单词到价格的示例映射。不幸的是，互联网上有很多这样的数据集，而我能够找到的是Mondovo 。这个特定的数据集包含Google上最常见的1000个问题及其相关的全球每次点击费用，尽管这是一个相当小的数据集，但它提供了我们需要的基本要素：单词及其价格。

It is fairly easy to wrap the 1000 rows of data into a pandas dataframe with two columns: keyword and price, and let us call this dataframe df from now on.

将1000行数据包装到具有两列的pandas数据框中是很容易的： keyword和price ，从现在开始，我们将其称为df 。

Then let us do the following step to insure the order of the data is indeed randomized:

然后让我们执行以下步骤，以确保数据的顺序确实是随机的：

df = df.sample(frac=1).reset_index(drop=True)

That is it for our data preprocessing.

这就是我们的数据预处理。

模型导入 (Model Import)

Now let us concern ourselves a bit with word2vec. In this task, instead of learning a word vector representation from our own corpus, namely the 1000 phrases, we will rely on some ready-to-use vector representation. The following code snippet will introduce an out-of-box solution from Google:

现在让我们稍微关注一下word2vec。在此任务中，我们将依靠一些现成的矢量表示，而不是从我们自己的语料库中学习单词矢量表示，即1000个短语。以下代码段将介绍Google提供的现成解决方案：

import gensim.downloader as apiwv = api.load('word2vec-google-news-300')

According to this source, the model was built on ‘pre-trained Google News corpus (3 billion running words), (and contains) word vector model (3 million 300-dimension English word vectors)’.

根据该消息来源，该模型基于“经过预先训练的Google新闻语料库(30亿个运行词)，(并包含)词向量模型(300万个300维英语词向量)”构建。

从单词到句子 (From Word to Sentence)

Here is the catch: the model word2vec contains only vector representations of individual words, but we need vector representations of short sentences/phrases like those in our dataset.

这里是要注意的地方： 模型word2vec仅包含单个单词的向量表示，但是我们需要像数据集中的短句子/短语的向量表示 。

There are at least three ways to get around this problem:

至少有三种方法可以解决此问题：

(1) Take the average of the vectors of all words in the short sentence;

(1)取短句中所有单词的向量的平均值；

(2) Similarly, take the average, but weight each vector using the idf (inverse document frequency) score of the word;

(2)同样，取平均值，但使用单词的idf(反文档频率)得分对每个向量加权；

(3) Use doc2vec, instead of word2vec.

(3)使用doc2vec代替word2vec。

Here I am curious to see how a baseline model would perform, so let us go with (1) for now and leave the other options for future explorations.

在这里，我很好奇看到基准模型将如何执行，因此，让我们暂时选择(1)，然后将其他选项留给以后的探索。

The following code snippet will provide a straightforward example to implement the averaging function:

下面的代码片段将提供一个简单的示例来实现平均功能：

def get_avg(phrase, wv):
    vec_result = []
    tokens = phrase.split(' ')    for t in tokens:
        if t in wv:
            vec_result.append(wv[t].tolist())
        else:
            #300 is the dimension of the Google wv model
            vec_result.append([0.0]*300)    return np.average(vec_result, axis=0)

Please note the if condition is necessary in that certain ‘stop-words’ (those extremely common and generally uninformative words in a given language. In English, think of ‘the’, ‘it’, ‘which’, etc) have been excluded from the Google model. In the snippet above, I took some leeway to skip dealing in detail with the topic of missing words or stop-words. More in-depth discussion will follow in my future posts. Please stay tuned in!

请注意，是否有必要以某些“停用词”(某种给定语言中的那些极为常见且通常不具信息性的词。在英语中，请考虑“ the”，“ it”，“ which”等)是否为必要条件来自Google模型。在上面的代码段中，我花了一些时间来跳过对遗漏单词或停用词主题的详细处理。我以后的帖子中将进行更深入的讨论。请继续关注！

回归问题设置 (Regression Problem Setup)

Remember that fundamentally, almost all machine learning algorithms expect numerical inputs: e.g., in imaging processing problem, black-white pictures are fed to algorithms as matrices of 0–1, and colored pictures as RGB tensors. Our problem is no exception and that is why we took all the trouble to introduce word2vec.

请记住，从根本上讲，几乎所有的机器学习算法都期望数值输入 ：例如，在图像处理问题中，黑白图片以0–1的矩阵形式馈入算法，彩色图片以RGB张量的形式馈给算法。我们的问题也不例外，这就是为什么我们全力以赴引入word2vec。

With that in mind, let us our feature matrix and target vector for use in machine learning algorithms:

考虑到这一点，让我们将特征矩阵和目标向量用于机器学习算法：

X = np.array([get_avg(phrase, wv) for phrase in df['keyword']])y = df['price']

And since we are predicting some numerical values, this is a regression problem. Let us choose some handy regression algorithm for this task:

而且由于我们正在预测一些数值，所以这是一个回归问题。让我们为该任务选择一些方便的回归算法：

from sklearn.ensemble import RandomForestRegressor#leaving out all params tuning to show absolute baseline performance
reg = RandomForestRegressor(random_state=0)

性能 (Performance)

Now we are finally able to see how our absolute baseline model performs. Let us set up a 10-fold cross validation scheme as follows:

现在，我们终于可以看到我们的绝对基准模型如何运行。让我们建立一个10倍交叉验证方案，如下所示：

from sklearn.model_selection import KFoldfrom sklearn.metrics import mean_absolute_error#set up 10-fold Cross Validation:
kf = KFold(n_splits=10)#loop over each fold and retrieve result
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]    reg.fit(X_train, y_train)
    
    print(mean_absolute_error(y_test, reg.predict(X_test)))

In my experiment, running the code above gave MAE scores of 1.53, 0.98, 1.06, 1.23, 1.02, 1.01, 1.06, 1.19, 0.96 and 0.96, leading to an average MAE of 1.1, which means on average our estimated price could deviate $1.1 from the true value.

在我的实验中，运行上述代码获得的MAE得分分别为1.53、0.98、1.06、1.23、1.02、1.01、1.06、1.19、0.96和0.96，平均MAE为1.1，这意味着我们的平ASP格可能会偏离$ 1.1从真正的价值。

Given the scanty data available, the lack in word redundancy in training data, the sparsity of in-sample data points, and our absolute baseline assumptions without any parameter optimization, I am really impressed with how far we have been able to push with our current methodology. It is not hard to imagine that some avid readers doing their own experiments are certain to achieve better results.

鉴于可用的数据很少，训练数据中缺少单词冗余，样本中数据点的稀疏性以及我们在没有任何参数优化的情况下的绝对基线假设，我对我们能够将当前数据推向多大的距离印象深刻。方法。不难想象，一些狂热的读者在做自己的实验肯定会取得更好的结果。