CephFS 常用命令以及问题分析

CephFS

最近公司的生产环境已经开始使用 CephFS 作为文件系统存储，记录一下使用过程中遇到的问题，已经一些常用的命令。

1. 常用命令

1.1 ceph daemon mds.xxx help

ceph daemon 是一个很常用的命令，可以用来查看 Ceph 的各个守护进程的状态，这个 help 命令可以看到 MDS daemon 都支持哪些子命令：

$ sudo ceph daemon mds.cephfs-master1 help
{"cache status": "show cache status","config diff": "dump diff of current config and default config","config diff get": "dump diff get <field>: dump diff of current and default config setting <field>","config get": "config get <field>: get the config value","config help": "get config setting schema and descriptions","config set": "config set <field> <val> [<val> ...]: set a config variable","config show": "dump current config settings","dirfrag ls": "List fragments in directory","dirfrag merge": "De-fragment directory by path","dirfrag split": "Fragment directory by path","dump cache": "dump metadata cache (optionally to a file)","dump loads": "dump metadata loads","dump tree": "dump metadata cache for subtree","dump_blocked_ops": "show the blocked ops currently in flight","dump_historic_ops": "show slowest recent ops","dump_historic_ops_by_duration": "show slowest recent ops, sorted by op duration","dump_mempools": "get mempool stats","dump_ops_in_flight": "show the ops currently in flight","export dir": "migrate a subtree to named MDS","flush journal": "Flush the journal to the backing store","flush_path": "flush an inode (and its dirfrags)","force_readonly": "Force MDS to read-only mode","get subtrees": "Return the subtree map","get_command_descriptions": "list available commands","git_version": "get git sha1","help": "list available commands","log dump": "dump recent log entries to log file","log flush": "flush log entries to log file","log reopen": "reopen log file","objecter_requests": "show in-progress osd requests","ops": "show the ops currently in flight","osdmap barrier": "Wait until the MDS has this OSD map epoch","perf dump": "dump perfcounters value","perf histogram dump": "dump perf histogram values","perf histogram schema": "dump perf histogram schema","perf reset": "perf reset <name>: perf reset all or one perfcounter name","perf schema": "dump perfcounters schema","scrub_path": "scrub an inode and output results","session evict": "Evict a CephFS client","session ls": "Enumerate connected CephFS clients","status": "high-level status of MDS","tag path": "Apply scrub tag recursively","version": "get ceph version"
}

1.2 ceph daemon mds.xxx cache status

这个命令是用来查看 Ceph MDS 缓存的使用情况，默认的配置是使用 1G 内存作为缓存，不过这不是一个固定的上限，实际用量可能突破配置。

$ sudo ceph daemon mds.cephfs-master1 cache status
{"pool": {"items": 321121429,"bytes": 25797208658}
}

1.3 ceph mds stat

查看 MDS 组件状态，下面的例子输出的结果表示只有一个 MDS，而且 MDS 已经处于正常工作状态。

$ ceph mds stat
cephfs-1/1/1 up  {0=cephfs-master1=up:active}

1.4 ceph daemon mds.xxx perf dump mds

查看 MDS 的性能指标。

$ sudo ceph daemon mds.cephfs-master1 perf dump mds
{"mds": {"request": 4812776,"reply": 4812772,"reply_latency": {"avgcount": 4812772,"sum": 4018.941028931,"avgtime": 0.000835057},"forward": 0,"dir_fetch": 170753,"dir_commit": 3253,"dir_split": 9,"dir_merge": 6,"inode_max": 2147483647,"inodes": 9305913,"inodes_top": 1617338,"inodes_bottom": 7688575,"inodes_pin_tail": 0,"inodes_pinned": 6995430,"inodes_expired": 13937,"inodes_with_caps": 6995443,"caps": 7002958,"subtrees": 2,"traverse": 5076658,"traverse_hit": 4835068,"traverse_forward": 0,"traverse_discover": 0,"traverse_dir_fetch": 91030,"traverse_remote_ino": 0,"traverse_lock": 109,"load_cent": 5356538,"q": 1,"exported": 0,"exported_inodes": 0,"imported": 0,"imported_inodes": 0}
}

1.5 ceph daemon mds.xxx dirfrag ls /

这个命令是用来查看文件系统某个目录下是否有脏数据。

$  sudo ceph daemon mds.cephfs-master1 dirfrag ls /
[{"value": 0,"bits": 0,"str": "0/0"}
]

1.6

该命令是用来查看 CephFS 的 session 连接。

$ sudo ceph daemon mds.cephfs-master1 session ls[{"id": 9872,"num_leases": 0,"num_caps": 1,"state": "open","replay_requests": 0,"completed_requests": 0,"reconnecting": false,"inst": "client.9872 192.168.250.1:0/1887245819","client_metadata": {"entity_id": "k8s.training.cephfs-teamvolume-aaaaaa-pvc","hostname": "GPU-P100","kernel_version": "4.9.107-0409107-generic","root": "/prod/training/cephfs-teamvolume-aaaaaa-pvc"}},......
]

2. 问题分析

2.1 Client cephfs-master1 failing to respond to cache pressure client_id: 9807

正巧是我修改了 MDS cache 之后出现了这个告警，所以一开始怀疑是是不是因为改大了 cache 造成了这个问题，但当我恢复了 cache 的默认值之后，问题依然存在。于是在 Ceph 的邮件列表中搜索类似问题，发现该问题一般都是 inode_max 这个数值设置的不够大造成的，于是查看了一下当前的 inode 和 inode_max 信息：

$ sudo ceph daemon mds.cephfs-master1 perf dump mds
{"mds": {"request": 404611246,"reply": 404611201,"reply_latency": {"avgcount": 404611201,"sum": 9613563.153437701,"avgtime": 0.023760002},......"inode_max": 2147483647,"inodes": 3907095,......
}

inodes 远小于 inode_max，所以这里的配置也没有问题。继续搜索发现不只是 inodes 的数量会造成这个问题，已经过期的 inodes 也是有影响的。

$ sudo ceph daemon mds.cephfs-master1 perf dump mds
{......"inodes_expired": 21999096501,......
}

果然，inodes_expired 的数值已经非常大了。进一步搜索发现，造成这个问题的主因是 cephfs 不会自动清理过期的 inodes，所以积累时间久了，就容易出现不够用的现象。解决方法如下：

$ sudo vim /etc/ceph/ceph.conf
……
[client]
client_try_dentry_invalidate = false
……$ sudo systemctl restart ceph-mds@cephfs-master1.service

2.2 MDS cache 配置

MDS 目前官方推荐的配置还是单活的，也就是说一个集群内只有一个提供服务的 MDS，虽然 Ceph MDS 性能很高，但毕竟是单点，再加上 MDS 运行的物理机上内存资源还是比较富裕的，自然想到通过使用内存作为缓存来提高 MDS 的性能。但是 MDS 的缓存配置项很多，一时还真不确定应该用哪个选项，而且配置成多大合适也拿不准。

经过进一步的整理后，把缓存配置进一步分解为以下四个小问题。

到底使用哪个选项配置缓存的大小
为什么大部分时间用不到配置的内存量
为什么有时 MDS 占用的内存远大于缓存的配置
应该将缓存配置成多大

2.2.1 到底使用哪个选项配置缓存的大小

相关的配置项主要有两个：
mds_cache_size 和 mds_cache_memory_limit，mds_cache_size 是老版本的配置参数，单位是 inode，目前的默认值是 0，表示没有限制；mds_cache_memory_limit 是建议使用的值，单位是 byte，默认值为 1G。所以要调整 cache 大小，当然是要改 mds_cache_memory_limit 。

2.2.2 为什么大部分时间用不到配置的内存量

例如将 mds_cache_memory_limit 配置为 30G（mds_cache_memory_limit = 32212254726），而实际运行时，看到的缓存用量却是这样的：

$ sudo ceph daemon mds.cephfs-master1 cache status
{"pool": {"items": 321121429,"bytes": 31197237046}
}

虽然差距不大，但为什么总是用不到配置的内存量呢？
原因在于这个参数：mds_cache_reservation，这个参数表示 MDS 预留一部分内存，没有具体的作用，就是为了留有余地。当 MDS 开始侵占这部分内存时，系统会自动释放掉超过配额的那部分。

mds_cache_reservation 的默认值是 5%，所以造成了我们看到的现象。

2.2.3 为什么有时 MDS 占用的内存远大于缓存的配置

但有时 MDS 占用的内存又远远大于配置的缓存，这个原因是 mds_cache_memory_limit 并非一个固定死不能突破的上限，程序运行时可能会在特定情况下突破配置的上限，所以建议不要把这个值配置的和系统内存总量太接近。不然有可能会占满整个服务器的内存资源。

2.2.4 应该将缓存配置成多大

官方文档有明确的说明，不推荐大于 64G，这里面的原因主要是 Ceph 的 bug，有很多使用者发现当高于 64g 时，MDS 有较高的概率占用远高于实际配置的内存，目前该 bug 还没有解决。

3. 参考文档

Why ceph status showing cephfs client failing to respond to cache pressure in RHCS
MDS CONFIG REFERENCE
Understanding MDS Cache Size Limits

CephFS 常用命令以及问题分析

1. 常用命令

1.1 ceph daemon mds.xxx help

1.2 ceph daemon mds.xxx cache status

1.3 ceph mds stat

1.4 ceph daemon mds.xxx perf dump mds

1.5 ceph daemon mds.xxx dirfrag ls /

1.6

2. 问题分析

2.1 Client cephfs-master1 failing to respond to cache pressure client_id: 9807

2.2 MDS cache 配置

2.2.1 到底使用哪个选项配置缓存的大小

2.2.2 为什么大部分时间用不到配置的内存量

2.2.3 为什么有时 MDS 占用的内存远大于缓存的配置

2.2.4 应该将缓存配置成多大

3. 参考文档

相关文章

怎样通过分析GC日志来定位Java进程的内存问题

Java 线程安全与 volatile与单例模式问题及解决方案

Redis出现中文乱码的问题及解决

MySQL中的表连接原理分析

全面解析MySQL索引长度限制问题与解决方案

Springboot如何正确使用AOP问题

Python中Tensorflow无法调用GPU问题的解决方法

解决未解析的依赖项:‘net.sf.json-lib:json-lib:jar:2.4‘问题

IDEA Maven提示:未解析的依赖项的问题及解决

python中Hash使用场景分析