ceph架构学习1

本文主要是介绍ceph架构学习1，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

本次文档详细学习ceph官网架构文档，网址为：http://ceph.com/docs/master/architecture/

先关注一下ceph的API类型：

librados是rados的 library，可以支持C，C++，Java,Python,Rubby和php；
radosgw是基于通的REST网关，兼容S3和和Swift接口；
rbd 是稳定的全分布式块设备，支持linux kernel client和QEMU/KVM driver；
CEPH-fuse 是个人比较感兴趣的模块，支持fuse模块，后面我会详细比较ceph-fuse、moosefs、hdfs三种文件系统；

rados：代表了 reliable，autonomoc，distributed object store 。牛逼的地方是使用crush算法做到数据高可性，去掉了分布式文件系统的namenode。

下面的这句话是理解ceph架构的关键：

Storage cluster clients andeach Ceph OSD Daemon usethe CRUSH algorithm to efficiently compute information about data location,instead of having to depend on a central lookup table.（存储机器的clients和每个osd Daemon 使用CRUSH算法高效计算出数据的location，替代查找一个中心的查询表）。如果理解hdfs和moosefs系统架构，可以深入的体会这一点，hadoop2.0 的HDFSFederation以及前面的HA，都是弥补namenode单节点问题。ceph的底层采用librados，为上层服务提供统一interface；

所都数据最后都是以对象的形式存储到rados中的，osd 存储所有数据以对象的方式存储到扁平的namespace里面（不存在目录结构），一个对象包含：一个identifier，二进制数据和包含name、value的键值集合的metadata。例如：cephfs利用metadata存储文件的ower、create data ,last modified date。具体的存储格式如下：

在hdfs和moosefs里面 metadata 是放到namenode里面的。id在整个集群是唯一的。ceph 让osd和client 可以直接连接，这都是使用crush算法保证的。crush通过干净利落分布所有工作到集群中的client和osd保证集群规模，下面简要介绍一下crush map的工作机制，后面我会详细介绍crush算法。

crush map有五个种类：

monitor map：[root@rados01 ~]# ceph mon dump
dumped monmap epoch 1
epoch 1
fsid 96aa2924-f83f-4145-94ad-0fdb71fa8afa
last_changed 2014-01-23 07:10:30.526764
created 2014-01-23 07:10:30.526764
0: 192.168.15.56:6789/0 mon.rados01
1: 192.168.15.57:6789/0 mon.rados02
2: 192.168.15.58:6789/0 mon.rados03
osd map：[root@rados01 ~]# ceph osd dump
epoch 217
fsid 96aa2924-f83f-4145-94ad-0fdb71fa8afa
created 2014-01-23 07:10:44.930304
modified 2014-03-17 15:41:57.110964
flags
pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0
pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 256 pgp_num 256 last_change 1 owner 0
pool 3 '.rgw' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 13 owner 0
pool 4 '.rgw.control' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 14 owner 0
pool 5 '.rgw.gc' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 15 owner 0
pool 6 '.log' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 16 owner 0
pool 7 '.intent-log' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 17 owner 0
pool 8 '.usage' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 18 owner 0
pool 9 '.users' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 19 owner 0
pool 10 '.users.email' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 20 owner 0
pool 11 '.users.swift' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 21 owner 0
pool 12 '.users.uid' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 22 owner 0
pool 13 '.rgw.buckets.index' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 23 owner 0
pool 14 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 24 owner 0
pool 15 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 owner 0
pool 16 '' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 160 owner 18446744073709551615
pool 18 'hadoop02' rep size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 183 owner 0
pool 19 'hadoop03' rep size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 185 owner 0
pool 22 'hadoop01' rep size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 214 owner
max_osd 3
osd.0 up in weight 1 up_from 168 up_thru 216 down_at 163 last_clean_interval [157,162) 192.168.15.56:6802/5328 192.168.15.56:6803/5328 192.168.15.56:6804/5328 192.168.15.56:6805/5328 exists,up 4332f60b-f55a-46da-9994-fe1e3cf1d3a7
osd.1 up in weight 1 up_from 209 up_thru 216 down_at 208 last_clean_interval [206,207) 192.168.15.57:6801/1398 192.168.15.57:6802/1398 192.168.15.57:6803/1398 192.168.15.57:6804/1398 exists,up 966ecd52-9cdd-44b8-8c68-ae3451d158b9
osd.2 up in weight 1 up_from 173 up_thru 216 down_at 169 last_clean_interval [149,169) 192.168.15.58:6801/30110 192.168.15.58:6802/30110 192.168.15.58:6803/30110 192.168.15.58:6804/30110 exists,up 7ab39fe2-64ff-4c9b-9ab6-5759887a1f95
pg map：ceph pg dump
crush map 后面详细研究。
mds map：ceph mds dump。[root@rados01 ~]# ceph mds dump
dumped mdsmap epoch 84
epoch 84
flags 0
created 2014-01-23 07:10:44.930114
modified 2014-03-10 15:57:12.995183
tableserver 0
root 0
session_timeout 60
session_autoclose 300
last_failure 74
last_failure_osd_epoch 193
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
max_mds 1
in 0
up {0=5500}
failed
stopped
data_pools 0,17,21,22,30
metadata_pool 1
5500: 192.168.15.56:6800/4809 'rados01' mds.0.13 up:active seq 4346
6066: 192.168.15.57:6800/1217 'rados02' mds.-1.0 up:standby seq 1

多个moniter 会根据 Paxos实现消息传递一致性，后面我们会详细说明这个算法，个人感觉和hadoop zookeeper也实现这样的功能。后面在研究吧。。。。。见笑了。

ceph 授权osd daemon 检查附件的osd是否宕机，更新cluster map 并通知所有的ceph monitor，这样保证了monitor工作较轻负载。

为了保证数据的一致性和干净，ceph osd可以在一个PG里面scrub 所有对象。osd对比一个PG里面对象的元数据（与其他osd上面对象元数据对比）。获取系统和文件系统的错误，osd也会以bit方式比较备份 object。深度scrub会在light scrub中找到bad 数据段。

osd 使用crush算法备份和负载均衡object的，the primary osd 使用 crush map 找到 secondary and tertiary OSDs，在ceph中备份是以object为单位的，不是以PG 或者PG分散的数据为对象的。将对象复制到 secondary 和tertiary OSDs ，然后返回给客户端存储成功：