本文主要是介绍机器学习模型 非线性模型_将您的机器学习模型带入现实世界,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
机器学习模型 非线性模型
机器学习系统 (Machine Learning Systems)
An exciting development in the past few years is the proliferation of machine learning in the product. We now see state-of-the-art computer vision models being deployed on mobile phones. State-of-the-art natural language processing models are being used to improve search.
在过去的几年中一个令人兴奋的发展是机器学习的在产品的扩散。 现在,我们看到了在手机上部署的最新计算机视觉模型。 最新的自然语言处理模型正在用于改善 搜索 。
While there have been many articles on new, exciting machine learning algorithms, there isn’t as many on productizing machine learning models. There has been upcoming interest in the engineering of these systems, such as the ScaledML conference, and the birth of MLOps. However, productionizing machine learning models is a skill that is still in short supply.
尽管有许多关于新颖的,令人兴奋的机器学习算法的文章,但是关于产品化机器学习模型的文章却很少。 这些系统的工程设计引起了人们的兴趣,例如ScaledML会议和MLOps的诞生。 但是,生产机器学习模型是一项仍然短缺的技能。
In my career so far, I’ve gotten a few hard-earned lessons in productionizing machine learning models. Here are some of them.
到目前为止,在我的职业生涯中,我已经在生产机器学习模型方面获得了一些来之不易的经验。 这里是其中的一些。
建立还是购买? (Build or Buy?)
Twitter recently has been all abuzz about GPT-3. Primed with the right seed, it turns out that GPT-3 can generate do many amazing things such as generating buttons on webpages and React code. Spurred by these exciting developments, your stakeholder has an idea: what if you build a GPT-3 equivalent to help speed up the development of the frontend of the company you work at using GPT-3?
Twitter最近一直在热议GPT-3 。 事实证明,使用正确的种子可以使GPT-3生成许多令人惊奇的事情,例如在网页上生成按钮和React代码 。 在这些激动人心的发展激励下,您的利益相关者有一个主意:如果您构建等效的GPT-3以帮助加快使用GPT-3的公司前端的发展,该怎么办?
Should you train your own GPT-3 model?
Well, no! GPT-3 is a 175 billion parameter model that requires the resources most companies outside of the likes of Google and Microsoft would be hard-pressed to develop. Since OpenAI is trialling a beta for its GPT-3 API, it would be more prudent to try it out, and see if the model solves your use case, and the API scales to your use case.
好吧,不! GPT-3是一个1750亿的参数模型,需要Google和Microsoft等公司之外的大多数公司来开发这些资源。 由于OpenAI正在为其GPT-3 API试用一个beta版,因此,尝试一下并查看该模型是否可以解决您的用例,以及将API扩展到您的用例,将更为明智。
So then, to buy (more like rent) or build? It depends.
那么,要购买(更像是租金)还是建造? 这取决于。
It may look like a cop-out answer, but it is an age-old question in the software industry. It is also such a large and complicated topic, that it deserves a post of its own.
它可能看起来像一个解决方案,但这在软件行业中是一个古老的问题。 它也是一个庞大而复杂的主题,因此值得一职。
However, here I can offer some guidelines that can help you discuss buying vs. building question with a stakeholder. The first is
但是,在这里,我可以提供一些准则,以帮助您与利益相关者讨论购买与构建问题。 首先是
Is this use case not a core competence? Is it generic enough for an existing/potential AWS¹ offering?
这个用例不是核心能力吗? 对于现有的/潜在的AWS¹产品而言,它是否足够通用?
The underlying principle of this question is to ask yourself if what you’re about to work on is part of the company’s core competence. If your stakeholders can afford it, it may be worth buying an off-the-shelf solution.
该问题的基本原理是问自己,您要从事的工作是否属于公司核心竞争力的一部分 。 如果您的利益相关者能够负担得起,则可能值得购买现成的解决方案。
Are resources besides data available? E.g., talent and infrastructure?
除数据外,还有资源吗? 例如人才和基础设施?
At its heart, building and deploying a reliable machine learning system is still mostly software engineering, with an added twist: data management. If the team doesn’t have experience building models as service APIs, with collecting, cleaning, labelling and storing data, then the amount of time needed can be prohibitive, say about 1 to 2 years (or even more). Instead, deploying a third party solution could take 3 to 6 months since it’s mostly about integrating the solution into the product. Business owners would find this savings of time and cost extremely attractive.
从本质上讲,构建和部署可靠的机器学习系统仍然主要是软件工程,还有一个额外的变化:数据管理 。 如果团队没有建立模型作为服务API的经验,没有收集,清理,标记和存储数据的经验,那么所需的时间可能会很长,例如大约1至2年(甚至更长)。 相反,部署第三方解决方案可能需要3到6个月,因为这主要是关于将解决方案集成到产品中。 企业主会发现节省时间和成本极具吸引力。
On the other hand, if it’s related to the company’s core competence, then investing in building up a team that can design and deploy models based on the company’s proprietary data is the better way to go. It allows much more control and understanding of the process, though it does come with a high initial investment overhead.
另一方面,如果与公司的核心竞争力有关,那么投资建立一个可以根据公司专有数据设计和部署模型的团队是更好的方法。 尽管它确实带来了很高的初始投资开销,但它允许对过程进行更多的控制和理解。
Just remember
All that matters to a business owner is: how will the output of your model be useful in the context of the business?
Broadly speaking, machine learning is a next step up in automation. So in thinking this way, your model helps to increase revenue, or improves business efficiency by reducing costs.
广义上讲,机器学习是自动化的下一步。 因此,以这种方式思考,您的模型有助于增加收入,或通过降低成本来提高业务效率。
For instance, integrating a better recommendations system should lead to higher user satisfaction and engagement. Higher satisfaction leads to more subscriptions and organic promotion, which leads to higher revenue and lower churn.
例如,集成更好的推荐系统应该会导致更高的用户满意度和参与度。 更高的满意度导致更多的订阅和有机促销,从而带来更高的收入和更低的客户流失率。
As a data scientist, you have to help the business owner make the best decision on whether to build or buy, and that means assessing the specific pros and cons of either approach in the context of the business.
The points in this article are very useful in approaching the “build vs. buy” question, and it also applies to machine learning production. There’s a case for building it yourself, as argued here, if it is a core business competence.
本文中的要点在解决“构建与购买”问题时非常有用,它也适用于机器学习产品。 还有为自己建造它,因为主张的情况下在这里 ,如果它是一个核心业务竞争力。
一切都从源头开始 (It all starts from the source)
Consider the search results for a user who searched for photos of puppies. Here we have an example from Unsplash, one from Shutterstock, and finally one from EyeEm, this time on a mobile device.
考虑搜索小狗照片的用户的搜索结果。 在这里,我们有一个来自Unsplash的示例,一个来自Shutterstock的示例,最后是一个来自EyeEm的示例,这次是在移动设备上。
We can see that while each user interface(UI) displays images in a wide selection. Suppose we want to use clicks on the image to train a model that gives better search results.
我们可以看到,虽然每个用户界面(UI)都以多种选择显示图像。 假设我们要使用图片上的点击来训练一个模型,以提供更好的搜索结果。
Which image would you be drawn to first if you were a user?
In the Shutterstock example, you may first be drawn to puppy on the right most side of the window, but in the EyeEm example, it could be the image in the center.
It is clear that there are differences between each UI to cause differences in the distribution of the data collected.
Information retrieval literature has demonstrated that a positional bias² and presentation bias³ exists when displaying results. It is entirely possible for users to click on the first ranked result, even if it’s not relevant to their search query, simply because it’s the first thing they see, and they might be curious about the result!
信息检索文献表明,显示结果时存在位置 偏差 ²和显示偏差 ³。 用户完全有可能单击排名第一的结果,即使这与他们的搜索查询无关,也仅是因为这是他们看到的第一件事,并且他们可能会对结果感到好奇!
We can see that the click data used to train search ranking models will have these biases in them, driven by how the UI was designed. So,
我们可以看到,用于训练搜索排名模型的点击数据将在其中受到这些偏见,这取决于用户界面的设计方式。 所以,
Know where and how the model will ultimately be used in the product
A data scientist and machine learning engineer cannot build models in isolation from where the model will ultimately be used. A consequence of this is that a data scientist has to work with all manners of specialties who are involved in the product: the product manager, the designer, the engineers (frontend and backend) and quality assurance engineers.
数据科学家和机器学习工程师无法独立于最终使用模型的地方来构建模型 。 其结果是,数据科学家必须与产品所涉及的各种专业领域进行合作:产品经理,设计师,工程师(前端和后端)和质量保证工程师。
The cross-functional team work provides benefits, including briefing all team members about the limitations of the model, how that affects UI design and vice versa. Another is to find simple, reliable ways to solve problems, for instance, not solving an inherent UI (and user experience) issue using machine learning. Designers can also come up with ways to make the presentation of results from the model look amazing, as well as constraining user actions in the UI to limit the space of inputs that goes into a model.
跨职能团队的工作可以带来很多好处,包括向所有团队成员简要介绍该模型的局限性,如何影响UI设计,反之亦然。 另一个是找到简单,可靠的方法来解决问题,例如,不使用机器学习解决固有的UI(和用户体验)问题。 设计人员还可以想出使模型的结果呈现令人惊叹的方法,以及限制UI中的用户操作以限制输入模型的空间。
A data scientist must build up domain knowledge about the product, including why certain UI decisions were made, and how they can impact your data downstream if product changes are made. The latter point is especially important, since it can mean your model is outdated very quickly when product changes are rolled out.
数据科学家必须建立有关产品的领域知识,包括为何做出某些UI决定,以及如果做出产品更改,它们如何影响下游的数据。 后一点特别重要,因为这可能意味着在推出产品更改时您的模型很快就会过时。
Moreover, how the results from the model are presented is important so as to minimize bias in future training data, as we shall see next.
数据收集仍然充满挑战 (Data collection is still challenging)
Analytics collection typically depend on engineers setting up analytics events in the services they are responsible for on various parts of the product. Sometimes, analytics are not collected, or collected with noise and bias in them. Analytics events can be dropped, resulting in missing data. Clicks on a button are not debounced, resulting in multiple duplicate events.
分析收集通常取决于工程师在他们负责产品各个部分的服务中设置分析事件。 有时,没有收集分析,或者收集分析时带有噪音和偏见。 可以丢弃分析事件,从而导致数据丢失。 对按钮的单击不会被消除抖动,从而导致多个重复事件。
Machine learning models are, unfortunately, typically downstream consumers of these analytics data, which means problems with measurements and data collection has a huge impact on the models.
If there’s one thing to remember about data it’s that
Data is a byproduct of measurement
This means that it is important to understand where the data is coming from, and how it is being collected. It is also then paramount that measurements are correctly done, with tests on the measurement system conducted periodically to ensure good data quality. It would also be prudent to work with an engineer and a data analyst to understand the issues first-hand because there’s usually a “gotcha” in collected data.
这意味着了解数据的来源和收集方式非常重要 。 然后,至关重要的是正确进行测量,并定期对测量系统进行测试以确保良好的数据质量。 与工程师和数据分析师一起第一手了解这些问题也是明智的,因为收集的数据中通常存在“陷阱”。
Stored data needs to be encoded to suit the business domain. Let’s take image storage for instance. In most domains, a lossy JPEG image may be fine to encode your stored images, but not in some domains where very high resolution is needed, such as using machine learning for detecting anomalies in MRI scans, for instance. It’s important that these issues be communicated up-front to engineers who are helping you collect the data.
需要对存储的数据进行编码以适合业务领域 。 让我们以图像存储为例。 在大多数域中,有损的JPEG图像可以很好地编码存储的图像,但在某些需要非常高分辨率的域中却不能,例如,使用机器学习来检测MRI扫描中的异常。 这些问题必须事先与帮助您收集数据的工程师沟通,这一点很重要。
Machine learning in research differs from machine learning in the product in that generally your biggest wins come from better data, not so much better algorithms. There are exceptions, though, such as deep learning improving performance metrics in many areas. Even then, these algorithms are still sensitive to the training data, and you need a whole lot more data.
研究中的机器学习与产品中的机器学习的不同之处在于, 通常您最大的胜利来自更好的数据,而不是更好的算法 。 但是,也有例外,例如深度学习可以在许多领域提高绩效指标。 即使这样,这些算法仍然对训练数据敏感,因此您需要更多的数据。
It is always worth convincing your stakeholder to continually invest in improving the data collection infrastructure. Sometimes, just even a new source of data to construct new features can boost model performance metrics. Not only does it help with better models, it helps with better tracking of the company’s key metrics and product features overall. The data analysts you work with will be very thankful for that.
让您的利益相关者不断投资以改善数据收集基础结构始终是值得的 。 有时,即使是用于构建新功能的新数据源也可以提高模型性能指标。 它不仅有助于建立更好的模型,而且还有助于更好地跟踪公司的关键指标和整体产品功能。 您与之合作的数据分析师将对此深表感谢。
As an ex-Dropbox data scientist told me recently
Your biggest improvements usually come from sitting down with an engineer to improve analytics
检查您对数据的假设 (Check your assumptions on the data)
Here is one story I experienced first-hand from the trenches.
During the early days of incorporating learning-to-rank models to improve image search at Canva, we planned to use a set of relevance judgments sourced from Mechanical Turk. Canva has a graphic design editor tool that allows a user to search over 50 million images to include into their design.
在早期整合学习排名模型以改善Canva的图像搜索的初期,我们计划使用来自Mechanical Turk的一组相关性判断。 Canva拥有图形设计编辑器工具,该工具可让用户搜索超过5000万张图像以纳入其设计。
Basically, we collected sets of pairwise judgements: for each search term, raters are given a pair of images, one of them which they give a thumbs up to. An example pairwise judgement rating for the term firefly is shown below. Although both are relevant, the image on the left is more preferred by the rater than the image on the right.
基本上,我们收集了成对的判断集:对于每个搜索词,给评分者一副图片,其中一张图片会给他们竖起大拇指。 萤火虫一词的成对判断示例如下所示。 尽管两者都是相关的,但评估者更喜欢左侧的图像,而不是右侧的图像。
We then trained a model using these pairwise judgements, and another model that solely relied on clicks from search logs. The latter has more data, but is of lower quality due to positional bias and errant clicks. Both models look great on offline ranking metrics, and a visual check of the results (i.e., manually, using our eyes).
然后,我们使用这些成对的判断训练了一个模型,而另一个模型仅依赖于搜索日志中的点击。 后者具有更多数据,但是由于位置偏差和错误的点击而导致质量较低。 两种模型在离线排名指标和结果的视觉检查( 即 ,用我们的眼睛手动进行)上看起来都很不错。
In online experiments, however, the model that relied on these sourced relevance judgements had a very poor performance, tanking business metrics, compared to the control and the other model.
What happened?
As it turned out, we sourced the judgements without the context of users searching for images to fit their design, not just on pure search relevance alone. This missing context contributed to a dataset that had a different distribution from the data distribution in the product.
事实证明,我们在没有用户搜索图像以适合其设计的上下文的情况下做出判断,而不仅仅是基于纯粹的搜索相关性。 这种丢失的上下文导致数据集的分布与产品中的数据分布不同。
Needless to say, we were wrong in our assumption.
Data distribution skew is a very real problem: even the team that implemented the Quick Access functionality in Google Drive at Google faced it⁴. In this case, the distribution of data collected for their development environment did not match the data distribution of the final deployed environment, as the training data was not collected from a production service.
数据分配偏差是一个非常现实的问题:即使是在Google的Google云端硬盘中实现了快速访问功能的团队也面临这种情况⁴。 在这种情况下,为他们的开发环境收集的数据分布与最终部署环境的数据分布不匹配,因为培训数据不是从生产服务中收集的。
The data used to train your models should match the final environment where your model will be deployed
A sample of other assumptions to check for are
the appearance of a concept drift, where a significant shift in the data distribution happens (a very recent example is the COVID-19 outbreak messing with prediction models),
概念漂移的出现 ,其中数据分布发生了重大变化( 最近的一个例子是COVID-19暴发使预测模型混乱 ),
the predictive power of features going into the model as some features decay in predictive power over time, and
功能的预测能力 随着某些功能的预测能力随时间衰减而进入模型,并且
product deprecations, resulting in features disappearing.
产品过时 ,导致功能消失。
These are partially solvable by having monitoring systems in place to check data quality and model performance metrics, as well as flag anomalies. Google has provided an excellent checklist on scoring your current production machine learning system here, including data issues to watch out for.
通过安装监控系统以检查数据质量和模型性能指标以及标记异常,可以部分解决这些问题。 Google为您在这里为当前生产机器学习系统打分提供了出色的清单,其中包括需要注意的数据问题。
Remember, if a team at Google, whose maturity in deploying machine learning systems is one of the best in the world, had data assumption problems, you should certainly double-check your assumptions about your training data.
包装全部 (Wrapping it all up)
Machine learning systems are very powerful systems. They provide companies with new options in creating new product features, improving existing product features, and improving automation.
机器学习系统是非常强大的系统。 它们为公司提供了创建新产品功能,改善现有产品功能以及改善自动化的新选择。
However, these systems are still fragile, as there are more considerations about how it’s designed, and how data is collected and used to train models.
I still have more to say on the topic, but these will do for now.
[1] Substitute with your favorite cloud provider.
[2] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay, Accurately Interpreting Clickthrough Data As Implicit Feedback, In Proc. of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp.154–161 (2005).
[2] Thorsten Joachims,Laura Granka,Bing Pan,Helene Hembrooke和Geri Gay, 准确将点击数据解释为隐式反馈 ,正在执行中。 第28届国际ACM SIGIR信息检索研究与开发会议(SIGIR),第154-161页(2005年)。
[3] Yisong Yue, Rajan Patel, and Hein Roehrig, Beyond Position Bias: Examining Result Attractiveness As a Source of Presentation Bias in Clickthrough Data, In Proc. of the 19th International Conference on World Wide Web (WWW), pp. 1011–1018 (2010).
[3] Yisong Yue,Rajan Patel和Hein Roehrig, 超越位置偏差:将结果吸引力作为点击数据中呈现偏差的来源 ,Proc。 第19届国际互联网会议(WWW)的第1011-1018页(2010年)。
[4] Sandeep Tata, Alexandrin Popescul, Marc Najork, Mike Colagrosso, Julian Gibbons, Alan Green, Alexandre Mah, Michael Smith, Divanshu Garg, Cayden Meyer, Reuben Kan, Quick Access: Building a Smart Experience for Google Drive, KDD ’17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.1643–1651 (2017)
[4] Sandeep Tata,Alexandrin Popescul,Marc Najork,Mike Colagrosso,Julian Gibbons,Alan Green,Alexandre Mah,Michael Smith,Divanshu Garg,Cayden Meyer,Reuben Kan, 快速访问:为Google云端硬盘构建智能体验 ,KDD '17 :第23届ACM SIGKDD国际知识发现和数据挖掘会议论文集,第1643–1651页(2017)
翻译自: https://towardsdatascience.com/getting-your-machine-learning-model-out-to-the-real-world-30c550876174
机器学习模型 非线性模型
这篇关于机器学习模型 非线性模型_将您的机器学习模型带入现实世界的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!