本文主要是介绍ceph架构学习1,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
本次文档详细学习ceph官网架构文档,网址为:http://ceph.com/docs/master/architecture/
先关注一下ceph的API类型:
- librados是rados的 library,可以支持C,C++,Java,Python,Rubby和php;
- radosgw是基于通的REST网关,兼容S3和和Swift接口;
- rbd 是稳定的全分布式块设备,支持linux kernel client和QEMU/KVM driver;
- CEPH-fuse 是个人比较感兴趣的模块,支持fuse模块,后面我会详细比较ceph-fuse、moosefs、hdfs三种文件系统;
rados:代表了 reliable,autonomoc,distributed object store 。牛逼的地方是使用crush算法做到数据高可性,去掉了分布式文件系统的namenode。
下面的这句话是理解ceph架构的关键:
Storage cluster clients andeach Ceph OSD Daemon usethe CRUSH algorithm to efficiently compute information about data location,instead of having to depend on a central lookup table.(存储机器的clients和每个osd Daemon 使用CRUSH算法 高效计算出数据的location,替代查找一个中心的查询表)。 如果理解hdfs和moosefs系统架构,可以深入的体会这一点,hadoop2.0 的HDFSFederation以及前面的HA,都是弥补namenode单节点问题。ceph的底层采用librados,为上层服务提供统一interface;
所都数据最后都是以对象的形式存储到rados中的,osd 存储所有数据以对象的方式存储到扁平的namespace里面(不存在目录结构),一个对象包含:一个identifier,二进制数据和包含name、value的键值集合的metadata。例如:cephfs利用metadata存储文件的ower、create data ,last modified date。具体的存储格式如下:
在hdfs和moosefs里面 metadata 是放到namenode里面的。id在整个集群是唯一的。ceph 让osd和client 可以直接连接,这都是使用crush算法保证的。crush通过干净利落分布所有工作到集群中的client和osd保证集群规模,下面简要介绍一下crush map的工作机制,后面我会详细介绍crush算法。
crush map有五个种类:
- monitor map:[root@rados01 ~]# ceph mon dump
dumped monmap epoch 1
epoch 1
fsid 96aa2924-f83f-4145-94ad-0fdb71fa8afa
last_changed 2014-01-23 07:10:30.526764
created 2014-01-23 07:10:30.526764
0: 192.168.15.56:6789/0 mon.rados01
1: 192.168.15.57:6789/0 mon.rados02
2: 192.168.15.58:6789/0 mon.rados03 - osd map:[root@rados01 ~]# ceph osd dump
epoch 217
fsid 96aa2924-f83f-4145-94ad-0fdb71fa8afa
created 2014-01-23 07:10:44.930304
modified 2014-03-17 15:41:57.110964
flags
pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0
pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0
pool 3 '.rgw' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 13 owner 0
pool 4 '.rgw.control' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 14 owner 0
pool 5 '.rgw.gc' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 15 owner 0
pool 6 '.log' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 16 owner 0
pool 7 '.intent-log' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 17 owner 0
pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 18 owner 0
pool 9 '.users' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 19 owner 0
pool 10 '.users.email' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 20 owner 0
pool 11 '.users.swift' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 21 owner 0
pool 12 '.users.uid' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 22 owner 0
pool 13 '.rgw.buckets.index' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 23 owner 0
pool 14 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 24 owner 0
pool 15 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 owner 0
pool 16 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 160 owner 18446744073709551615
pool 18 'hadoop02' rep size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 183 owner 0
pool 19 'hadoop03' rep size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 185 owner 0
pool 22 'hadoop01' rep size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 214 owner
max_osd 3
osd.0 up in weight 1 up_from 168 up_thru 216 down_at 163 last_clean_interval [157,162) 192.168.15.56:6802/5328 192.168.15.56:6803/5328 192.168.15.56:6804/5328 192.168.15.56:6805/5328 exists,up 4332f60b-f55a-46da-9994-fe1e3cf1d3a7
osd.1 up in weight 1 up_from 209 up_thru 216 down_at 208 last_clean_interval [206,207) 192.168.15.57:6801/1398 192.168.15.57:6802/1398 192.168.15.57:6803/1398 192.168.15.57:6804/1398 exists,up 966ecd52-9cdd-44b8-8c68-ae3451d158b9
osd.2 up in weight 1 up_from 173 up_thru 216 down_at 169 last_clean_interval [149,169) 192.168.15.58:6801/30110 192.168.15.58:6802/30110 192.168.15.58:6803/30110 192.168.15.58:6804/30110 exists,up 7ab39fe2-64ff-4c9b-9ab6-5759887a1f95 - pg map:ceph pg dump
- crush map 后面详细研究。
- mds map:ceph mds dump。[root@rados01 ~]# ceph mds dump
dumped mdsmap epoch 84
epoch 84
flags 0
created 2014-01-23 07:10:44.930114
modified 2014-03-10 15:57:12.995183
tableserver 0
root 0
session_timeout 60
session_autoclose 300
last_failure 74
last_failure_osd_epoch 193
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
max_mds 1
in 0
up {0=5500}
failed
stopped
data_pools 0,17,21,22,30
metadata_pool 1
5500: 192.168.15.56:6800/4809 'rados01' mds.0.13 up:active seq 4346
6066: 192.168.15.57:6800/1217 'rados02' mds.-1.0 up:standby seq 1
多个moniter 会根据 Paxos实现消息传递一致性,后面我们会详细说明这个算法,个人感觉和hadoop zookeeper也实现这样的功能。后面在研究吧。。。。。见笑了。
ceph 授权osd daemon 检查附件的osd是否宕机,更新cluster map 并通知所有的ceph monitor,这样保证了monitor工作较轻负载。
为了保证数据的一致性和干净,ceph osd可以在一个PG里面scrub 所有对象。osd对比一个PG里面对象的元数据(与其他osd上面对象元数据对比)。获取系统和文件系统的错误,osd也会以bit方式比较备份 object。深度scrub会在light scrub中找到bad 数据段。
osd 使用crush算法备份和负载均衡object的,the primary osd 使用 crush map 找到 secondary and tertiary OSDs,在ceph中备份是以object为单位的,不是以PG 或者PG分散的数据为对象的。将对象复制到 secondary 和tertiary OSDs ,然后返回给客户端存储成功:
这样就减轻了client的压力,实现了数据的可用性和数据安全。
下面我们详细分析 crush 动态管理过程,是理解ceph机制的重要过程。下面我们分析crush是如何用云存储架构去放置数据、均在均衡机器和自动容错。
pools是存储对象逻辑分区,主要设置了如下参数:
-
- Ownership/Access to Objects
- The Number of Object Replicas
- The Number of Placement Groups
- The CRUSH Ruleset to Use.
每个osd有很多PG,PG是osd和client 的隔离层,这层再次保证了数据的负载均衡。如下图:
下面我们来看看osd 是如何获取PG id的:
- client输入 pool id 和 object id;(e.g., pool = “liverpool” and object-id = “john”)
- crush 获取 object id 然后取其hash值hashvalue;
- PG ID = hashvalue % osd个数 (e.g., 0x58);
- crush通过给的pool name 获取到pool ID; (e.g., “liverpool” = 4)
- crush 叠加 pool ID 和PG ID(e.g., 4.0x58).
兄弟们,简单吧,学过程序的同学都会写。。。
osd 之间有两个动作:
- 检查相互之间的heartbeats ,并报告给monitor;
- 相互之间做peering,意思就是将一个PG 里面的osd 同步这个PG里面所用对象的状态。这个过程是由Primary osd负责完成的;
- acting set 是一个PG里面的osd集合,up set 是一个PG里在up状态的osd集合;
下面看看 rebalancing过程是怎么做的
这篇关于ceph架构学习1的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!