ceph rgw reshard (by quqi99)

2024-08-31 19:20
文章标签 ceph quqi99 rgw reshard

本文主要是介绍ceph rgw reshard (by quqi99),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

作者:张华 发表于:2024-08-31
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明(http://blog.csdn.net/quqi99)

问题

今天执班遇到一个ceph问题,一个osd device坏了,导致整个client中断。后来这个就解决了,也顺序学习了一下。

原因

bucket里的object太多了,导致它的metadata omap index太多(包含6 million+ keys, 比建议的多了60倍, objectcacher用于在客户端缓存omap - https://blog.csdn.net/quqi99/article/details/140525441), 进而造成超时,进而OSD之间无法发heartbean, 这些OSD报告失败,并且OSD反复down and up

#ceph-mon logs
2024-07-14T20:09:49.618+0000'OSD::osd_op_tp thread 0x7f7d25ec3640' had timed out after 15.000000954s
...
#slow op comes from the pool xxx.rgw.buckets.index (id=31)
2024-07-14T20:18:08.389+0000 7f7d26ec5640  0 bluestore(/var/lib/ceph/osd/ceph-131) log_latency_fn slow operation observed for _remove, latency = 373.801788330s, lat = 6m cid =31.7c_head oid =#31:3e20eb48:::.dir.255acf83-1053-45db-8646-a1f05dee5002.1125451.6.8:head#
...
2024-07-14T08:41:02.450403+0000 osd.181 (osd.181) 159371 : cluster [WRN] Large omap object found. Object: 31:ff7b6861:::.dir.255acf83-1053-45db-8646-a1f05dee5002.1125451.4.4:head PG: 31.8616deff (31.ff) Key count: 2407860 Size (bytes): 851728316
...
2024-07-14T20:53:51.562499+0000 osd.38 (osd.38) 21985 : cluster [WRN] 3 slow requests (by type [ 'delayed' : 3 ] most affected pool [ 'xxx.rgw.buckets.index' : 3 ])

这种情况应该做reshard让object在各个bucket中均匀分布,从L版支持自动reshard, 默认rgw dynamic resharding 是开启的。但是在开启了Multisite的情况下,一旦对bucket进行了reshard操作,则会破坏原有的元数据对应规则,导致对应的bucket无法进行数据同步。所以L 版在后面的一个pr 禁用了multisite 下自动的reshard1。multisite 对于omap 过大的问题,需要手动reshard,生产环境上有很大风险。(见:https://www.cnblogs.com/dengchj/p/11424644.html)。所以对此问题的处理办法是:

  • 在 index pool上禁用deep-scrub - ceph osd pool set {pool-name} nodeep-scrub= 1
  • 人工做reshard
  • 升级ceph到reef版本,并打开multisite dynamic resharding特性今后自动做reshard
  • 加硬件,对RGW metadata用nvme or ssd osds

人工做reshard的流程

1. On a node within the master zone of the master zone group, execute the following command:
# radosgw-admin bucket sync disable --bucket=BUCKET_NAME
Wait for sync status on all zones to report that data synchronization is up to date.2. Stop ALL ceph-radosgw daemons in ALL zones.3. On a node within the master zone of the master zone group, reshard the bucket.
# radosgw-admin bucket reshard --bucket=BUCKET_NAME --num-shards=NEW_SHARDS_NUMBER4. Start ceph-radosgw daemons in the master zone to restore the client services.//Important: Please do note that step 5 will delete the whole bucket on the secondary zone, make sure to confirm with the customer if they have any data that only exist in secondary zone, if so, sync them to primary zone first, otherwise there will be data loss. 
5. On EACH secondary zone, execute the following:
# radosgw-admin bucket rm --purge-objects --bucket=BUCKET_NAME6. Start ceph-radosgw daemons in secondary zone.7. On a node within the master zone of the master zone group, execute the following command:
# radosgw-admin bucket sync enable --bucket=BUCKET_NAME
The metadata synchronization process will fetch the updated bucket entry point and bucket instance metadata. The data synchronization process will perform a full synchronization.

Upgrade ceph to reef

# Confirm source is set to distro
juju config ceph-mon source=distro
juju config ceph-osd source=distro
juju config ceph-radosgw source=distro
# Update monitors/managers to Reef channel
juju refresh ceph-mon --channel reef/stable
# Change to Reef installation source
juju config ceph-mon source=cloud:jammy-bobcat
# Monitors/managers will upgrade and restart one at a time. 
# Set 'noout' on the cluster
juju ssh ceph-mon/leader "sudo ceph osd set noout"
# Update OSDs to Reef channel, OSD restarts are possible due to bug.
juju refresh ceph-osd --channel reef/stable
# Update OSDs to Reef
juju config ceph-osd source=cloud:jammy-bobcat
# OSDs will restart
# Unset 'noout'
juju ssh ceph-mon/leader "sudo ceph osd unset noout"
# Update RGWs, this will cause a service interruption.
juju refresh ceph-radosgw--channel reef/stable
juju config ceph-radosgw source=cloud:jammy-bobcat
# Restore and test the DNS bucket certs and configuration.

加硬件让RGW metadata用专门的nvme

0) Stop OSDs on host
systemctl stop ceph-osd.target1) Remove the caching devices from the bcache.
#!/bin/bash
# Disable the cache on each bcache
for i in $(ls -d /sys/block/sd*)
do
echo "Disabling caching on ${i}"
echo 1 > ${i}/bcache/detach
done#!/bin/bash
# Wait for cache to drain for each bcache
echo "Waiting for cache devices to drain."
for i in $(ls -d /sys/block/sd*)
do
while [ "$(cat ${i}/bcache/state)" != "no cache" ]
do
echo "Cache still dirty on ${i}."
sleep 5
done
done2) Unregister the caching devices
#!/bin/bash
# Unregister cache sets
for i in $(ls -d /sys/fs/bcache/*)
do
echo "Unregistering ${i}"
echo 1 > ${i}/unregister
done#!/bin/bash
# Double check with wipefs
for i in $(ls -d /sys/block/nvme0n1/nvme*)
do
dev=$(echo ${i} | cut -d '/' -f 5)
echo "Wiping ${dev}"
wipefs -a /dev/${dev}
done
for i in $(ls -d /sys/block/nvme1n1/nvme*)
do
dev=$(echo ${i} | cut -d '/' -f 5)
echo "Wiping ${dev}"
wipefs -a /dev/${dev}
doneecho "Wiping /dev/nvme0n1"
wipefs -a /dev/nvme0n1
echo "Wiping /dev/nvme1n1"
wipefs -a /dev/nvme1n1
echo "Ready to reformat!"3) Reformat NVMes to 4k blocks, which wipes them.
nvme format --lbaf=1 /dev/nvme0n1
nvme format --lbaf=1 /dev/nvme1n14) Leave OSDs "bcached" with no cache.

part 2 - CRUSH changes

0) Delete the ceph-benchmarking pool.
sudo ceph ceph config set mon mon_allow_pool_delete true
sudo ceph osd pool delete ceph-benchmarking --yes-i-really-really-mean-it
sudo ceph ceph config set mon mon_allow_pool_delete false1) Tweak CRUSH rules to target the HDDs to prevent NVMes from being used when added.
# EC
ceph osd erasure-code-profile set ssd-only k=5 m=3 crush-failure-domain=host crush-device-class=ssd
ceph osd crush rule create-erasure rgwdata ssd-only
ceph osd pool set default.rgw.buckets.data crush_rule rgwdata
ceph osd pool set xxx-backup.rgw.buckets.data crush_rule rgwdata
ceph osd pool set dev.rgw.buckets.data crush_rule rgwdata
ceph osd pool set velero.rgw.buckets.data crush_rule rgwdata2) Add /dev/nvme*n1 devices to Juju OSD disks.
juju run-action ceph-osd/X zap-disk osd-devices="/dev/nvme0n1"
juju run-action ceph-osd/X zap-disk osd-devices="/dev/nvme1n1"
juju run-action ceph-osd/X add-disk osd-devices="/dev/nvme0n1"
juju run-action ceph-osd/X add-disk osd-devices="/dev/nvme1n1"3) Confirm addition of NVMe OSDs as NVMe
ceph osd tree
ceph osd crush tree --show-shadow4) Alter CRUSH rules to map the metadata pools onto the NVMes.
ceph osd crush rule create-replicated replicated-nvme default host nvme# Replicated
ceph osd crush rule create-replicated replicated-nvme default host ssd
ceph osd pool set .mgr crush_rule replicated-nvme
ceph osd pool set default.rgw.control crush_rule replicated-nvme
ceph osd pool set default.rgw.data.root crush_rule replicated-nvme
ceph osd pool set default.rgw.gc crush_rule replicated-nvme
ceph osd pool set default.rgw.log crush_rule replicated-nvme
ceph osd pool set default.rgw.intent-log crush_rule replicated-nvme
ceph osd pool set default.rgw.meta crush_rule replicated-nvme
ceph osd pool set default.rgw.otp crush_rule replicated-nvme
ceph osd pool set default.rgw.usage crush_rule replicated-nvme
ceph osd pool set default.rgw.users.keys crush_rule replicated-nvme
ceph osd pool set default.rgw.users.email crush_rule replicated-nvme
ceph osd pool set default.rgw.users.swift crush_rule replicated-nvme
ceph osd pool set default.rgw.users.uid crush_rule replicated-nvme
ceph osd pool set default.rgw.buckets.extra crush_rule replicated-nvme
ceph osd pool set default.rgw.buckets.index crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.control crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.data.root crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.gc crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.log crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.intent-log crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.meta crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.otp crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.usage crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.users.keys crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.users.email crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.users.swift crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.users.uid crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.buckets.extra crush_rule replicated-nvme
ceph osd pool set xxx-backup.rgw.buckets.index crush_rule replicated-nvme
ceph osd pool set ceph-benchmarking crush_rule replicated-nvme
ceph osd pool set .rgw.root crush_rule replicated-nvme
ceph osd pool set velero.rgw.log crush_rule replicated-nvme
ceph osd pool set velero.rgw.control crush_rule replicated-nvme
ceph osd pool set velero.rgw.meta crush_rule replicated-nvme
ceph osd pool set dev.rgw.log crush_rule replicated-nvme
ceph osd pool set dev.rgw.control crush_rule replicated-nvme
ceph osd pool set dev.rgw.meta crush_rule replicated-nvme
ceph osd pool set dev.rgw.data.root crush_rule replicated-nvme
ceph osd pool set dev.rgw.gc crush_rule replicated-nvme
ceph osd pool set dev.rgw.intent-log crush_rule replicated-nvme
ceph osd pool set dev.rgw.otp crush_rule replicated-nvme
ceph osd pool set dev.rgw.usage crush_rule replicated-nvme
ceph osd pool set dev.rgw.users.keys crush_rule replicated-nvme
ceph osd pool set dev.rgw.users.email crush_rule replicated-nvme
ceph osd pool set dev.rgw.users.swift crush_rule replicated-nvme
ceph osd pool set dev.rgw.users.uid crush_rule replicated-nvme
ceph osd pool set dev.rgw.buckets.extra crush_rule replicated-nvme
ceph osd pool set dev.rgw.buckets.index crush_rule replicated-nvme
ceph osd pool set velero.rgw.data.root crush_rule replicated-nvme
ceph osd pool set velero.rgw.gc crush_rule replicated-nvme
ceph osd pool set velero.rgw.intent-log crush_rule replicated-nvme
ceph osd pool set velero.rgw.otp crush_rule replicated-nvme
ceph osd pool set velero.rgw.usage crush_rule replicated-nvme
ceph osd pool set velero.rgw.users.keys crush_rule replicated-nvme
ceph osd pool set velero.rgw.users.email crush_rule replicated-nvme
ceph osd pool set velero.rgw.users.swift crush_rule replicated-nvme
ceph osd pool set velero.rgw.users.uid crush_rule replicated-nvme
ceph osd pool set velero.rgw.buckets.extra crush_rule replicated-nvme
ceph osd pool set velero.rgw.buckets.index crush_rule replicated-nvme

enable multisite resharding

radosgw-admin zonegroup modify --rgw-zonegroup=xxx --enable-feature=resharding
radosgw-admin period update --commit
radosgw-admin zone modify --rgw-zone=xxx --enable-feature=resharding
radosgw-admin period update --commit
radosgw-admin zone modify --rgw-zone=xxx-backup --enable-feature=resharding
radosgw-admin period update --commit

这篇关于ceph rgw reshard (by quqi99)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/1124784

相关文章

2.1ceph集群部署准备-硬件及拓扑

硬件配置及建议 时至今日,ceph可以运行在各种各样的硬件平台上,不管是传统的x86架构平台(intel 至强系列、基于amd的海光系列等),还是基于arm的架构平台(比如华为鲲鹏),都可以完美运行ceph集群,展现了其强大的适应能力。 ceph的不同组件对硬件的需求有些许不同,下面是官方推荐的硬件要求: 组件资源最低配置要求OSD处理器最少1 core每200-500 MB/s最少1 co

Ceph-deploy搭建ceph集群

Ceph介绍及安装 一、Ceph介绍1.1 ceph说明1.2 Ceph架构1.3 Ceph逻辑组织架构1.3.1 Pool1.3.2 PG1.3.3 PGP 二、部署Ceph集群2.1 部署方式:2.2 服务器准备`monitor、mgr、radosgw`:`MDS`(相对配置更高一个等级)`OSD节点 CPU`:`OSD 节点内存`: 2.3 部署环境**2.3.1、四台服务器作

ceph-iscsi 手动安装过程中的一些问题记录以及解决办法

ceph-iscsi 手动安装教程 安装教程,建议直接看官方文档,猛戳传送门。官方教程是英文版的(不知道有没有中文版),都是一些基础英语,问题不大,实在不行找个翻译软件帮帮忙,哈哈哈。 多啰嗦一点,官方教程里面全部是通过git 一个一个安装的,比较麻烦。可以使用如下命令,比较省事(以ubuntu系统作为示例): `sudo apt update // 更新apt 数据库 sudo apt -

Ceph RBD使用

CephRBD使用 一、RBD架构说明二、RBD相关操作1、创建存储池2、创建img镜像2.1 创建镜像2.1.2 查看镜像详细信息2.1.3 镜像其他特性2.1.4 镜像特性的启用和禁用 3、配置客户端使用RBD3.1 客户端配置yum源3.2 客户端使用admin用户挂载并使用RBD3.2.1 同步admin账号认证文件3.2.2 客户端映射 3.3 客户端使用普通用户挂载并使用RBD

ceph中pg与pool关系

在Ceph中,PG(Placement Group)和Pool是非常重要的概念,它们在Ceph的存储架构中扮演着关键角色。理解这些概念有助于更好地管理和优化Ceph集群。下面详细介绍这两个概念及其相互关系。 Pool(存储池) 定义: Pool(存储池)是Ceph中逻辑上的存储单元,用于组织和管理数据。用户可以通过创建不同的Pool来为不同的应用程序或用途分配存储空间。 类型: Pool可以

了解ceph scrub deep-scrub

目的 了解 ceph scrub, deep-scrub 作用了解 ceph scrub, deep-scrub 相关配置 参考告警 $ ceph -scluster:id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxhealth: HEALTH_WARN434 pgs not deep-scrubbed in time <------

ceph-rgw zipper的设计理念(1)

0.前言 RGW在ceph存储中的主要作用是提供S3和Swift的协议访问支持。Zipper工作主要是将RGW分为协议部分和后端部分。协议部分还是支持S3和Swift协议,包括身份认证、协议参数解析和op操作解析等等;后端部分主要是对接不同的存储,比如rados(zipper中变换为RadosStore,主要通过RADOS访问ceph集群)、DBstore(可以用 SQL,特别是本地 SQLi

【ceph学习】S3权限认证部分

认证过程简介 认证的过程是一个对用户信息进行解析并且判断前后得到的秘钥是否一致的过程。 auth_regitry的创建 在rgw_main.cc:main()中进行初始化auth_registry对象 /*rgw_main.cc*//* Initialize the registry of auth strategies which will coordinate* the dynam

【ceph学习】rados bench性能测试工具介绍

rados bench性能测试工具介绍 radosbench介绍 Ceph 包含 rados bench 命令,用于在 RADOS 存储群集上执行性能基准测试。命令将执行写入测试,以及两种类型的读测试。在测试读取和写入性能时,–no-cleanup 选项非常重要。默认情况下,rados bench 命令会删除它写入存储池的对象。保留这些对象后,可以使用两个读取测试来测量顺序读取和随机读取的性能

【ceph学习】rgw网关进程如何启动

rgw 网关进程启动 主要在rgw_main.cc的main函数中,主要涉及一些关键线程启动、前端服务器(beast等)启动、后端存储模块启动(rados)、perf和log启动等。 流程图关键节点如下: 1、beast的启动 2、rados的启动