竞标步奏_一切都有其价格—在线广告竞标中如何定价单词和短语

2023-11-03 03:59

本文主要是介绍竞标步奏_一切都有其价格—在线广告竞标中如何定价单词和短语,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

竞标步奏

This article sketches an NLP approach to pricing natural language words or phrases. It leverages creatively (1) the model word2vec, which learns the context and associations between words from a given corpus; (2) the Mondovo dataset, which provides basic building blocks for us to further bootstrap our application. This solution will have interesting applications in fields such as online ad bidding, online marketing, search engine optimization, etc. This article serves as an illustration of an initial baseline solution to the pricing problem and readers eager to learn more about how I do it in practice and a more in-depth treatment of the topic are welcome to tune in for my followup publication.

本文介绍了一种NLP方法来定价自然语言单词或短语。 它创造性地利用了(1)word2vec模型,该模型从给定语料库中学习上下文和单词之间的关联; (2)Mondovo数据集,它为我们提供了进一步构建应用程序的基本构建块。 该解决方案将在在线广告竞标,在线营销,搜索引擎优化等领域中得到有趣的应用。本文旨在说明定价问题的初始基准解决方案,读者希望了解更多有关如何进行定价的信息。欢迎练习和对该主题进行更深入的处理,以收看我的后续出版物。

People are quantifying everything. When we are unable to do that to something, we call it either worthless or mysterious, or dismiss it adroitly as hallucination; such is the case with things like love, loyalty, honesty etc.

人们正在量化一切。 当我们无法做到这一点时,我们称其为“毫无价值”或“神秘”,或巧妙地将其称为幻觉。 爱,忠诚,诚实等都是这种情况。

The online ad bidding industry is definitely not an exception, and one of their biggest problem is how to come up with accurate bid prices for their chosen ad keywords or phrases to secure some hot ad spots on the publishers’ websites. The quandary goes like this: if the bid price is set too high, you may be sure to get the ad spot, but you will also have to pay the hefty price you bid at; if you set the bid price too low, chances are that you will have a hard time getting that ad spot at all. Apparently, this delicate trade-off here calls for creative solutions to the problem of quantifying words/phrases into prices.

在线广告竞标行业绝对不是例外,他们最大的问题之一是如何为他们选择的广告关键字或短语提供准确的竞标价格,以确保发布者网站上的某些热门广告位。 困惑是这样的:如果出价设置得太高,您可能会确定获得广告位,但是您还必须付出高昂的出价; 如果您将出价设置得太低,则很有可能根本很难获得该广告位。 显然,这种微妙的取舍需要针对将词/短语量化为价格的问题提出创造性的解决方案。

Fortunately, we can rest assured of the resounding good news: words can be priced too! For this problem, we might not have the luxury of a well-crafted recipe like the Black-Scholes model for options pricing, but there are multiple ways by which we can take a crack at it.

幸运的是,我们可以放心听到好消息: 言语也可以定价! 对于这个问题,我们可能没有像Black-Scholes模型那样的精心设计的期权定价奢侈,但是我们可以通过多种方式对此加以破解。

In this article, I will sketch up a simple solution to the keywords pricing problem that makes basic use of a natural language processing technique called word2vec. The following sections will show how to handle the data, where to employ word2vec, how to transform our problem to a regression task, and finally the performances of the whole pipeline.

在本文中,我将草拟关键字定价问题的简单解决方案,该解决方案基本使用了称为word2vec的自然语言处理技术 以下各节将说明如何处理数据,在何处使用word2vec,如何 将我们的问题转化为回归任务,最后转化为整个流程的绩效。

Let us get started.

让我们开始吧。

Image for post
Photo by John King on Unsplash
约翰·金 ( John King)在 Unsplash上 摄

word2vec简介 (Brief Intro to word2vec)

It might be helpful to trace back the evolution of the statistical language models. At the beginning, we have the naive bag-of-words model, in which we treat each word in the corpus discretely; no context, no dependency, just independent words. And the best you can do with such a model is come up with a chart of word frequencies.

追溯统计语言模型的演变可能会有所帮助。 最初,我们有朴素的词袋模型 ,其中我们离散地处理语料库中的每个词; 没有上下文,没有依赖,只有独立的词。 使用这种模型可以做的最好的事情就是拿出一个单词频率图表。

Next comes the n-gram model. Unigrams, namely indiviudal words, are not that powerful, but we can extend to bigrams, trigrams, quadrograms and beyond, in which every N (2, 3, 4 or more) consecutive words are considered as a whole (as an individual word). Arguably, such models will be able to capture word context of size N, and enable us to do more sophisticated predictions and inferences. For example, we can easily build more powerful probabilistic state transition models such as Markov chain, which supports daily applications such as word autosuggestion or autocomplete.

接下来是n-gram模型 。 字母组合词(即单个词)的功能不那么强大,但是我们可以扩展到二元组,三联词,四边形图等,其中每N(2、3、4个或更多)连续的单词被视为一个整体(单个单词) 。 可以说,这样的模型将能够捕获大小为N的单词上下文,并使我们能够进行更复杂的预测和推断。 例如,我们可以轻松地构建功能更强大的概率状态转换模型(例如Markov链),该模型支持日常应用,例如单词自动提示或自动完成。

In contrast, word embedding is a family of language models where words or phrases from the vocabulary are described/represented using vectors and word2vec is one of the most popular techniques to do that. Generally speaking, it uses neural networks to learn word associations/relations from a given corpus, and uses vectors of a given length to represent each word such that the semantic similarity between words will correlate to the vector similarity between their vector representations. The Wikipedia page will provide a good initial pointer and for more in-depth treatment of this topic, please stay tuned for my future posts.

相反, 词嵌入是一族语言模型,其中使用矢量描述/表示词汇中的词或短语,而word2vec是最流行的技术之一。 一般而言,它使用神经网络从给定的语料库中学习单词的关联/关系,并使用给定长度的向量表示每个单词,以使单词之间的语义相似度与其向量表示之间的向量相似度相关。 Wikipedia页面将提供一个很好的初始指针,并且对于这个主题的更深入的处理,请继续关注我将来的帖子。

数据处理 (Data Processing)

This is an extremely important step. In order for us to come up with any model, we will need data first. Further, in order for our model to learn any meaningful relationship among the data, we want the data to contain sample mappings from natural language words to prices. Unfortunately there are many such datasets available on the Internet and the one I was able to find is from Mondovo. This specific dataset contains the top 1000 most asked questions on Google and their associated global cost-per-clicks, which, although a fairly small dataset, provides the basic ingredients we need: words and their prices.

这是非常重要的一步。 为了使我们能够提出任何模型,我们首先需要数据。 此外,为了使我们的模型学习数据之间的任何有意义的关系,我们希望数据包含从自然语言单词到价格的示例映射。 不幸的是,互联网上有很多这样的数据集,而我能够找到的是Mondovo 。 这个特定的数据集包含Google上最常见的1000个问题及其相关的全球每次点击费用,尽管这是一个相当小的数据集,但它提供了我们需要的基本要素:单词及其价格。

It is fairly easy to wrap the 1000 rows of data into a pandas dataframe with two columns: keyword and price, and let us call this dataframe df from now on.

将1000行数据包装到具有两列的pandas数据框中是很容易的: keywordprice ,从现在开始,我们将其称为df

Then let us do the following step to insure the order of the data is indeed randomized:

然后让我们执行以下步骤,以确保数据的顺序确实是随机的:

df = df.sample(frac=1).reset_index(drop=True)

That is it for our data preprocessing.

这就是我们的数据预处理。

模型导入 (Model Import)

Now let us concern ourselves a bit with word2vec. In this task, instead of learning a word vector representation from our own corpus, namely the 1000 phrases, we will rely on some ready-to-use vector representation. The following code snippet will introduce an out-of-box solution from Google:

现在让我们稍微关注一下word2vec。 在此任务中,我们将依靠一些现成的矢量表示,而不是从我们自己的语料库中学习单词矢量表示,即1000个短语。 以下代码段将介绍Google提供的现成解决方案:

import gensim.downloader as apiwv = api.load('word2vec-google-news-300')

According to this source, the model was built on ‘pre-trained Google News corpus (3 billion running words), (and contains) word vector model (3 million 300-dimension English word vectors)’.

根据该消息来源 ,该模型基于“经过预先训练的Google新闻语料库(30亿个运行词),(并包含)词向量模型(300万个300维英语词向量)”构建。

从单词到句子 (From Word to Sentence)

Here is the catch: the model word2vec contains only vector representations of individual words, but we need vector representations of short sentences/phrases like those in our dataset.

这里是要注意的地方: 模型word2vec仅包含单个单词的向量表示,但是我们需要像数据集中的短句子/短语的向量表示

There are at least three ways to get around this problem:

至少有三种方法可以解决此问题:

(1) Take the average of the vectors of all words in the short sentence;

(1)取短句中所有单词的向量的平均值;

(2) Similarly, take the average, but weight each vector using the idf (inverse document frequency) score of the word;

(2)同样,取平均值,但使用单词的idf(反文档频率)得分对每个向量加权;

(3) Use doc2vec, instead of word2vec.

(3)使用doc2vec代替word2vec。

Here I am curious to see how a baseline model would perform, so let us go with (1) for now and leave the other options for future explorations.

在这里,我很好奇看到基准模型将如何执行,因此,让我们暂时选择(1),然后将其他选项留给以后的探索。

The following code snippet will provide a straightforward example to implement the averaging function:

下面的代码片段将提供一个简单的示例来实现平均功能:

def get_avg(phrase, wv):
vec_result = []
tokens = phrase.split(' ') for t in tokens:
if t in wv:
vec_result.append(wv[t].tolist())
else:
#300 is the dimension of the Google wv model
vec_result.append([0.0]*300) return np.average(vec_result, axis=0)

Please note the if condition is necessary in that certain ‘stop-words’ (those extremely common and generally uninformative words in a given language. In English, think of ‘the’, ‘it’, ‘which’, etc) have been excluded from the Google model. In the snippet above, I took some leeway to skip dealing in detail with the topic of missing words or stop-words. More in-depth discussion will follow in my future posts. Please stay tuned in!

请注意, 是否有必要以某些“停用词”(某种给定语言中的那些极为常见且通常不具信息性的词。在英语中,请考虑“ the”,“ it”,“ which”等)是否为必要条件来自Google模型。 在上面的代码段中,我花了一些时间来跳过对遗漏单词或停用词主题的详细处理。 我以后的帖子中将进行更深入的讨论。 请继续关注!

Image for post
Photo by Mika Baumeister on Unsplash
Mika Baumeister在 Unsplash上 拍摄的照片

回归问题设置 (Regression Problem Setup)

Remember that fundamentally, almost all machine learning algorithms expect numerical inputs: e.g., in imaging processing problem, black-white pictures are fed to algorithms as matrices of 0–1, and colored pictures as RGB tensors. Our problem is no exception and that is why we took all the trouble to introduce word2vec.

请记住,从根本上讲,几乎所有的机器学习算法都期望数值输入 :例如,在图像处理问题中,黑白图片以0–1的矩阵形式馈入算法,彩色图片以RGB张量的形式馈给算法。 我们的问题也不例外,这就是为什么我们全力以赴引入word2vec。

With that in mind, let us our feature matrix and target vector for use in machine learning algorithms:

考虑到这一点,让我们将特征矩阵和目标向量用于机器学习算法:

X = np.array([get_avg(phrase, wv) for phrase in df['keyword']])y = df['price']

And since we are predicting some numerical values, this is a regression problem. Let us choose some handy regression algorithm for this task:

而且由于我们正在预测一些数值,所以这是一个回归问题。 让我们为该任务选择一些方便的回归算法:

from sklearn.ensemble import RandomForestRegressor#leaving out all params tuning to show absolute baseline performance
reg = RandomForestRegressor(random_state=0)

性能 (Performance)

Now we are finally able to see how our absolute baseline model performs. Let us set up a 10-fold cross validation scheme as follows:

现在,我们终于可以看到我们的绝对基准模型如何运行。 让我们建立一个10倍交叉验证方案,如下所示:

from sklearn.model_selection import KFoldfrom sklearn.metrics import mean_absolute_error#set up 10-fold Cross Validation:
kf = KFold(n_splits=10)#loop over each fold and retrieve result
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index] reg.fit(X_train, y_train)

print(mean_absolute_error(y_test, reg.predict(X_test)))

In my experiment, running the code above gave MAE scores of 1.53, 0.98, 1.06, 1.23, 1.02, 1.01, 1.06, 1.19, 0.96 and 0.96, leading to an average MAE of 1.1, which means on average our estimated price could deviate $1.1 from the true value.

在我的实验中,运行上述代码获得的MAE得分分别为1.53、0.98、1.06、1.23、1.02、1.01、1.06、1.19、0.96和0.96,平均MAE为1.1,这意味着我们的平ASP格可能会偏离$ 1.1从真正的价值。

Given the scanty data available, the lack in word redundancy in training data, the sparsity of in-sample data points, and our absolute baseline assumptions without any parameter optimization, I am really impressed with how far we have been able to push with our current methodology. It is not hard to imagine that some avid readers doing their own experiments are certain to achieve better results.

鉴于可用的数据很少,训练数据中缺少单词冗余,样本中数据点的稀疏性以及我们在没有任何参数优化的情况下的绝对基线假设,我对我们能够将当前数据推向多大的距离印象深刻。方法。 不难想象,一些狂热的读者在做自己的实验肯定会取得更好的结果。

翻译自: https://towardsdatascience.com/everything-has-its-price-how-to-price-words-for-ad-bidding-etc-7df38e1d152

竞标步奏


http://www.taodudu.cc/news/show-8138890.html

相关文章:

  • 绝了,竞标降薪!
  • ubuntu有线网卡图标消失解决办法
  • Ubuntu18.04不能识别有线网卡
  • 树莓派有线网络设置_(11)树莓派3 有线网卡静态IP设置
  • FPGA逻辑笔试题(六)
  • 基于Xilinx ZYNQ和7 Serises FPGA的MIPI DPHY 接口实现分享
  • iOS减小包大小
  • APP开发中如何减小apk安装包大小
  • Android性能优化----减小Apk大小
  • ios 如何减小app的大小
  • linux 减小根分区大小_Centos/Linux下调整分区大小(以home和根分区为例)
  • 如何压缩图片大小?减小图片大小方法来啦
  • java处理图像减小大小不改变像素_为什么调整大小会导致图像大小减小
  • 个人电脑做深度学习(系统和配置问题)
  • 送你一份配置清单:机器学习、深度学习电脑显卡配置方案
  • Python培训分享:如何做数据分析,数据分析流程是什么?
  • 视频配音乐怎么制作?教你简单好用的配乐方法
  • java程序翻译_【后端开发】Java实现英文文本单词翻译器功能的简单实例
  • 怎样快速将多个文件的文件名简体中文翻译成英文
  • 【网络爬虫】Requests库详解
  • .net web 入门实战教程(夺宝项目) 1- 项目构思
  • 加入新语言的设计要点
  • 随身WiFi 一、随身WiFi刷Debian系统+准备工作
  • 随身wifi刷Debian系统记录
  • 在Coursera观看视频的正确姿势
  • 在线免费视频和文档
  • IOS 远程推送 学习笔记 更新于2016年9月9日
  • 2016年12月16日 - charles软件如何破解以及如何使用charles进行抓包
  • python元组创建_Python中如何创建一个元组呢?
  • 解决Codeblocks无法创建项目问题之一
  • 这篇关于竞标步奏_一切都有其价格—在线广告竞标中如何定价单词和短语的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



    http://www.chinasem.cn/article/335449

    相关文章

    水位雨量在线监测系统概述及应用介绍

    在当今社会,随着科技的飞速发展,各种智能监测系统已成为保障公共安全、促进资源管理和环境保护的重要工具。其中,水位雨量在线监测系统作为自然灾害预警、水资源管理及水利工程运行的关键技术,其重要性不言而喻。 一、水位雨量在线监测系统的基本原理 水位雨量在线监测系统主要由数据采集单元、数据传输网络、数据处理中心及用户终端四大部分构成,形成了一个完整的闭环系统。 数据采集单元:这是系统的“眼睛”,

    电力系统中的A类在线监测装置—APView400

    随着电力系统的日益复杂和人们对电能质量要求的提高,电能质量在线监测装置在电力系统中得到广泛应用。目前,市场上的在线监测装置主要分为A类和B类两种类型,A类和B类在线监测装置主要区别在于应用场景、技术参数、通讯协议和扩展性。选择时应根据实际需求和应用场景综合考虑,并定期维护和校准。电能质量在线监测装置是用于实时监测电力系统中的电能质量参数的设备。 APView400电能质量A类在线监测装置以其多核

    JavaFX应用更新检测功能(在线自动更新方案)

    JavaFX开发的桌面应用属于C端,一般来说需要版本检测和自动更新功能,这里记录一下一种版本检测和自动更新的方法。 1. 整体方案 JavaFX.应用版本检测、自动更新主要涉及一下步骤: 读取本地应用版本拉取远程版本并比较两个版本如果需要升级,那么拉取更新历史弹出升级控制窗口用户选择升级时,拉取升级包解压,重启应用用户选择忽略时,本地版本标志为忽略版本用户选择取消时,隐藏升级控制窗口 2.

    Go Playground 在线编程环境

    For all examples in this and the next chapter, we will use Go Playground. Go Playground represents a web service that can run programs written in Go. It can be opened in a web browser using the follow

    每日一练7:简写单词(含链接)

    1.链接 简写单词_牛客题霸_牛客网 2.题目 3.代码1(错误经验) #include <iostream>#include <string>using namespace std;int main() {string s;string ret;int count = 0;while(cin >> s)for(auto a : s){if(count == 0){if( a <=

    12C 新特性,MOVE DATAFILE 在线移动 包括system, 附带改名 NID ,cdb_data_files视图坏了

    ALTER DATABASE MOVE DATAFILE  可以改名 可以move file,全部一个命令。 resue 可以重用,keep好像不生效!!! system照移动不误-------- SQL> select file_name, status, online_status from dba_data_files where tablespace_name='SYSTEM'

    css选择器和xpath选择器在线转换器

    具体前往:Css Selector(选择器)转Xpath在线工具

    C/C++ 网络聊天室在线聊天系统(整理重传)

    知识点: TCP网络通信 服务端的流程: 1.创建socket套接字 2.给这个socket绑定一个端口号 3.给这个socket开启监听属性 4.等待客户端连接 5.开始通讯 6.关闭连接 解释: socket:类似于接口的东西,只有通过这个才能跟对应的电脑通信。 每一台电脑都有一个IP地址,一台电脑上有多个应用,每个应用都会有一个端口号。 socket一般分为两种类型,一种是通讯,一种是监听

    【最新华为OD机试E卷-支持在线评测】机器人活动区域(100分)多语言题解-(Python/C/JavaScript/Java/Cpp)

    🍭 大家好这里是春秋招笔试突围 ,一枚热爱算法的程序员 ✨ 本系列打算持续跟新华为OD-E/D卷的三语言AC题解 💻 ACM金牌🏅️团队| 多次AK大厂笔试 | 编程一对一辅导 👏 感谢大家的订阅➕ 和 喜欢💗 🍿 最新华为OD机试D卷目录,全、新、准,题目覆盖率达 95% 以上,支持题目在线评测,专栏文章质量平均 94 分 最新华为OD机试目录: https://blog.

    动态规划---单词拆分

    题目: 给你一个字符串 s 和一个字符串列表 wordDict 作为字典。如果可以利用字典中出现的一个或多个单词拼接出 s 则返回 true。 注意:不要求字典中出现的单词全部都使用,并且字典中的单词可以重复使用。 思路:本题属于完全背包问题,字符串s的长度为背包容量,字符串列表wordDict中的每一个元素相当于物品。 动态规划五部曲: 1.确定dp数组及含义 dp数组为元素类型是布