OpenShift 4 - 在 Jupyter Notebook 中使用 Elyra 执行 AI 处理流水线

本文主要是介绍OpenShift 4 - 在 Jupyter Notebook 中使用 Elyra 执行 AI 处理流水线，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

《OpenShift / RHEL / DevSecOps 汇总目录》
说明：本文已经在 OpenShift 4.14 + RHODS 2.50 的环境中验证

说明：请先根据《OpenShift 4 - 管理和使用 OpenShift AI 运行环境》一文完成 MinIO 的安装。
注意：如无特殊说明，和 OpenShift AI 相关的 Blog 均无需 GPU。

什么是 Elyra

Elyra 是 JupyterLab Notebook 的扩展插件。它提供了一个可视化管道编辑器，用于从 Python 和 R 脚本以及 Jupyter Notebook 构建管道，从而有助于简化将多个文件转换为批处理作业或工作流的过程。

Elyra 的管道由节点组成，节点之间相互连接，以定义执行依赖关系。为了组装管道，Elyra 的可视化管道编辑器支持将文件拖放到画布上并定义其依赖关系。在组装好管道并准备运行后，Elyra 编辑器会生成 Tekton YAML 定义，并将其提交给 OpenShift Pipeline 执行。

准备 Elyra 管道运行环境

使用定制镜像

用管理员登录 OpenShift AI 控制台，然后进入 Settings > Notebook image settings。
在 Notebook image settings 页面点击 Import new image 按钮，然后在 Import notebook image 窗口按以下配置导入镜像。
Repository: quay.io/mmurakam/workbenches:fraud-detection-v1.0.1
Name: Fraud detection workbench

准备模型

在 MinIO 中创建名为 fraud-detection 的 Bucket。
从 https://github.com/mamurak/os-mlops-artefacts/tree/fraud-detection-model-v0.1/models/fraud-detection 和 https://github.com/mamurak/os-mlops-artefacts/tree/fraud-detection-model-v0.1/data/fraud-detection 分别下载 model-latest.onnx 和 live-data.csv 文件，然后上传到 MinIO 的 fraud-detection 路径下。

配置 Workbench、 Data Connection 和 Cluster storage

使用一般用户在 OpenShift AI 控制台中创建名为 pipelines-example 的 Data Science Project。
在 pipelines-example 项目下按照以下配置创建名为 fraud-detection-workbench 的 Workbench。
将 Name 设置 fraud-detection-workbench
在 Notebook image 中为 Notebook image 选择 Fraud detection workbench
在 Deployment size 中为 Container size 选择 Small
分别为 fraud-detection 应用和运行的 Pipeline 创建名为 fraud-detection 和 offline-scoring-data-volume 的 Cluster storage，都为 5 GB。
按照以下配置创建 Data Connection，用来访问对象存储中的模型。
Name: fraud-detection
Access key: minio
Secret key: minio123
Endpoint: http://minio-service.minio.svc.cluster.local:9000
Bucket: fraud-detection
完成配置以后 fraud-detection 项目下的配置如下图：

创建 Elyra Server 运行环境

配置和运行 Elyra 管道的生命周期需要一个 Elyra Server 环境，Elyra Serve 包含以下组件和功能：

运行 Pipeline Server 的容器
用来保存 Pipeline 定义和运行结果的 MariaDB
用来定时运行 Pipeline 的调度器
A Persistent Agent to record the set of containers that executed as well as their inputs and outputs.
用来记录容器集执行的输入和输出的持久代理。

配置 Elyra Server 的过程如下：

在 pipelines-example 项目下按下图创建一个 Data Connection。其中名为 fraud-detection-pipelines 的 Bucket 会被自动创建。
Name: fraud-detection-pipelines
Access key: minio
Secret key: minio123
Endpoint: http://minio-service.minio.svc.cluster.local:9000
Bucket: fraud-detection-pipelines
在 pipelines-example 项目下点击 Create a pipeline server 按钮。
在 Configure pipeline server 窗口中选择其所使用的 Data Connection 为 fraud-detection-pipelines。
在配置好 Pipeline server 后，可以在 OpenShift 开发者视图中看到部署的相关资源如下图。

配置 Elyra 管道

打开 fraud-detection-workbench，然后在 Jupyter Notebook 中导入 https://github.com/RedHatQuickCourses/rhods-qc-apps.git。
进入 rhods-qc-apps/5.pipelines/elyra 目录，然后在 Jupyter Notebook 的 Launcher 中点击 Elyra 区域的 Pipeline Editor，然后将创建的文件改名为 offline-scoring.pipeline。
打开的 offline-scoring.pipeline 文件，然后将 rhods-qc-apps/5.pipelines/elyra 目录中的 5 个 py 文件拖拽到 offline-scoring.pipeline 文件的编辑区，然后按下图顺序将其连接起来。
进入左侧工具条的 Runtime Images，此时界面如下图。然后点击 “+” 来创建新的运行时镜像。
在 Add new Runtime Image runtime image 界面中按以下配置创建，然后 Save & Close。
Display Name: fraud detection runtime
Image Name: quay.io/mmurakam/runtimes:fraud-detection-v0.2.0
打开 Open Panel，然后在右侧进入 PIPELINE PROPERTIES。
在 Generic Node Defaults 区域将 Runtime Image 设为 fraud detection runtime。
增加 4 个 Kubernetes Secrets。将每个 Secret 的 Environment Variable 和 Secret Key 设为相同内容，分别为以下 4 组，并且将所有 Secret Name 设为 aws-connection-fraud-detection。
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_S3_ENDPOINT
AWS_S3_BUCKET
在 Pipeline 中选择 model_loading.py 节点，此时界面右侧区域会切换至 NODE PROPERTIES，然后将 Output Files 设为 model.onnx。
在 Pipeline 中依次选择 data_ingrestion.py、preprocessing.py、scoring.py、results_upload.py节点，然后在 NODE PROPERTIES 中按以下配置为每个节点增加一个 Data Volume。
Mount Path: /data
Persistent Volume Claim Name: offline-scoring-data-volume

运行Elyra 管道

在 offline-scoring.pipeline 界面的工具条中点击 Run pipeline 图标运行管道，然后在提示窗口接受缺省选项，最后点击 OK。
管道运行后，在下面成功提示的提示框点击 OK。注意：如果管道没有运行起来，可以先关闭再重启 Workbench。
在 OpenShift AI 控制台进入 Data Science Pipelines > Pipelines，可以看到管道的运行状态，如下图。
点击上图 Runs 下的 offline-scoring-1223032941 的链接，可以看到此次管道运行的详细情况，如下图。
另外也可在 OpenShift 开发者视图中的管道中看到 OpenShift Pipeline 的运行情况。注意：下图中名为 pipeline/runid 的标签和上图的 Run ID 有相同的内容。
在 MinIO 中查看 fraud-detection 和 fraud-detection-pipelines 存储桶生成的内容，分别是模型和管道生成的数据。