ZooKeeper

Table of Contents

1 ZooKeeper简介

ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.

主要摘自官方文档:https://zookeeper.apache.org/doc/trunk/zookeeperOver.html

1.1 ZooKeeper is replicated

Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts.

zk_service.jpg

Figure 1: ZooKeeper Service

The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.

As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.

Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.

1.2 Data model and the hierarchical namespace

The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.

zk_namespace.jpg

Figure 2: ZooKeeper's Hierarchical Namespace

Unlike standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. We use the term znode to make it clear that we are talking about ZooKeeper data nodes.

1.3 Simple API

ZooKeeper provides a very simple programming interface.

Table 1: ZooKeeper Simple API
API Meaning
create creates a node at a location in the tree
delete deletes a node
exists tests if a node exists at a location
get data reads the data from a node
set data writes data to a node
get children retrieves a list of children of a node
sync waits for data to be propagated

2 ZooKeeper基本使用

2.1 Standalone Mode

为了简单起见,这里先介绍standalone mode(即只有一个server,如果这个server挂掉了就无法提供服务了,不要在生产系统中使用standalone mode)下ZooKeeper的配置和使用。

下载zookeeper-3.3.6.tar.gz(https://archive.apache.org/dist/zookeeper/zookeeper-3.3.6/),解压后结构如下:

cig01@MacBook-Pro ~/Downloads/zookeeper-3.3.6$ ls
CHANGES.txt              contrib                  src
LICENSE.txt              dist-maven               zookeeper-3.3.6.jar
NOTICE.txt               docs                     zookeeper-3.3.6.jar.asc
README.txt               ivy.xml                  zookeeper-3.3.6.jar.md5
bin                      ivysettings.xml          zookeeper-3.3.6.jar.sha1
build.xml                lib
conf                     recipes

2.1.1 准备配置文件

准备配置文件conf/zoo.cfg(会默认读取这个位置),其内容如下:

tickTime=2000
dataDir=/tmp/zookeeper
clientPort=2181

上面这三个配置参数的含义如表 2 所示。

Table 2: ZooKeeper配置参数(部分)及其含义
配置参数 含义
tickTime the basic time unit in milliseconds used by ZooKeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.
dataDir the location to store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database.
clientPort the port to listen for client connections

2.1.2 启动ZooKeeper server

启动ZooKeeper server:

$ ./bin/zkServer.sh start

2.1.3 启动ZooKeeper client

启动ZooKeeper client,连接到刚启动的Server,如:

$ ./bin/zkCli.sh -server localhost:2181           # 端口2181来自配置文件conf/zoo.cfg中的clientPort设置
[zk: localhost:2181(CONNECTED) 0]

2.1.4 测试基本命令

下面在client中测试几个基本命令,如ls, create, get, set, delete:

[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper]
[zk: localhost:2181(CONNECTED) 1] create /zk_test my_data
Created /zk_test
[zk: localhost:2181(CONNECTED) 2] ls /
[zookeeper, zk_test]
[zk: localhost:2181(CONNECTED) 3] get /zk_test
my_data
cZxid = 0x2
ctime = Sat Jun 17 09:49:42 CST 2017
mZxid = 0x2
mtime = Sat Jun 17 09:49:42 CST 2017
pZxid = 0x2
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 7
numChildren = 0
[zk: localhost:2181(CONNECTED) 4] set /zk_test my_new_data
cZxid = 0x2
ctime = Sat Jun 17 09:49:42 CST 2017
mZxid = 0x3
mtime = Sat Jun 17 09:50:05 CST 2017
pZxid = 0x2
cversion = 0
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 11
numChildren = 0
[zk: localhost:2181(CONNECTED) 5] get /zk_test
my_new_data
cZxid = 0x2
ctime = Sat Jun 17 09:49:42 CST 2017
mZxid = 0x3
mtime = Sat Jun 17 09:50:05 CST 2017
pZxid = 0x2
cversion = 0
dataVersion = 1
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 11
numChildren = 0
[zk: localhost:2181(CONNECTED) 6] delete /zk_test
[zk: localhost:2181(CONNECTED) 7] ls /
[zookeeper]

2.2 Replicated Mode

前面介绍的是ZooKeeper单机模式,这是介绍一下ZooKeeper集群的部署。

由于手头没有多余机器,这里在一台机器上部署3个ZooKeeper Server。

首先,创建3个文件夹(server1,server2,server3),并在每个文件夹中解压ZooKeeper的安装包。

~/Downloads/server1/zookeeper-3.3.6
~/Downloads/server2/zookeeper-3.3.6
~/Downloads/server3/zookeeper-3.3.6

然后,在每个文件夹中创建一个data目录,并在data目录中创建一个名为myid的文件,里面写入一个数字,比如server1对应myid文件就写入数字1,server2对应myid文件写入2,server3对应myid文件写入3。

$ cat ~/Downloads/server1/data/myid
1
$ cat ~/Downloads/server2/data/myid
2
$ cat ~/Downloads/server3/data/myid
3

2.2.1 配置conf/zoo.cfg

进入到每个server的ZooKeeper安装目录下,创建conf/zoo.cfg文件,内容如下:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/Users/cig01/Downloads/server1/data/
clientPort=2181
server.1=127.0.0.1:2888:3888
server.2=127.0.0.1:2889:3889
server.3=127.0.0.1:2890:3890

注1:由于3个server都运行在同一台机器上,所以clientPort应配置不同以避免端口冲突。
注2: server.X=serverhost:port1:port2中的X就是对应data/myid中的数字。每个server名字后面有两个端口,第一个端口port1用作集群成员的信息交换,第二个端口port2是在leader挂掉时用来进行重新选举leader所用。 由于3个server都运行在同一台机器上,这些端口都应该配置不同以避免端口冲突。

在我们例子中,3个server对应的配置文件分别为:

$ cat ~/Downloads/server1/zookeeper-3.3.6/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/Users/cig01/Downloads/server1/data/
clientPort=2181
server.1=127.0.0.1:2888:3888
server.2=127.0.0.1:2889:3889
server.3=127.0.0.1:2890:3890
$ cat ~/Downloads/server2/zookeeper-3.3.6/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/Users/cig01/Downloads/server2/data/
clientPort=2182
server.1=127.0.0.1:2888:3888
server.2=127.0.0.1:2889:3889
server.3=127.0.0.1:2890:3890
$ cat ~/Downloads/server3/zookeeper-3.3.6/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/Users/cig01/Downloads/server3/data/
clientPort=2183
server.1=127.0.0.1:2888:3888
server.2=127.0.0.1:2889:3889
server.3=127.0.0.1:2890:3890

2.2.2 启动ZooKeeper server

启动每一个ZooKeeper server:

$ cd ~/Downloads/server1/zookeeper-3.3.6/
$ ./bin/zkServer.sh start
$ cd ~/Downloads/server2/zookeeper-3.3.6/
$ ./bin/zkServer.sh start
$ cd ~/Downloads/server3/zookeeper-3.3.6/
$ ./bin/zkServer.sh start

2.2.3 启动ZooKeeper client

可以指定连接到某一台ZooKeeper server,如:

$ ./bin/zkCli.sh -server 127.0.0.1:2181

也可以指定多个ZooKeeper server,用逗号分开即可,如:

$ ./bin/zkCli.sh -server 127.0.0.1:2182,127.0.0.1:2183

这时,ZooKeeper client会从server list中随机选择一台server进行连接,连接不上时,会换一台server。如果连接上的server突然挂掉,也会换一台server重新连接。

3 ZooKeeper应用实例

利用ZooKeeper可以实现分布式Queue,分布式Barrier,分布式Lock等设施。

参考:
http://zookeeper.apache.org/doc/current/recipes.html
https://zookeeper.apache.org/doc/trunk/zookeeperTutorial.html

3.1 实现分布式Lock

利用ZooKeeper可以实现分布式Lock。

3.1.1 思路一(存在“惊群”现象)

思路一:加锁时,所有客户端尝试创建/lock节点(EPHEMERAL类型,即临时节点),只会有一个客户端创建成功,成功者获得锁,失败者利用ZooKeeper提供的watch机制等待/lock节点变化的通知(当发现/lock节点被删除后,失败者可以再次尝试创建/lock节点)。解锁时,只要删除/lock节点即可。
说明1:/lock节点必须是EPHEMERAL类型,如果获得锁的客户端发生crash,/lock节点会自动被删除,从而可以避免死锁。
说明2:失败者也可以不利用watch机制来知道/lock节点是否被删除,而是过一段时间后再测试。不过,很难知道要过“多久”重新测试才比较合适。
说明3:这种方法的不足在于:如果很多客户端都在竞争锁,会产生“惊群”现象。

3.1.2 思路二

思路二:加锁时,所有客户端创建“/lock-node/child-”(把节点类型设置为EPHEMERAL_SEQUENTIAL)。如果客户端发现自身创建节点序列号(即child-X中的X)是/lock-node/目录下序列号最小的节点,则获得锁。否则,监视比自己创建节点的序列号小的节点中那个具有最大序号的节点,进入等待。解锁时,将自身创建的节点删除即可。
说明:利用临时顺序节点(EPHEMERAL_SEQUENTIAL)来实现分布式锁相当于创建顺序排队。它能避免了“惊群”效应,多个客户端共同等待锁时,当锁释放时只有一个客户端会被唤醒。

在图 3 (图片摘自:https://nofluffjuststuff.com/blog/scott_leberknight/2013/07/distributed_coordination_with_zookeeper_part_5_building_a_distributed_lock) 的例子中,创建child-1的客户端会先获得锁,然后是创建child-2的客户端获得锁,依次类推。

zk_parent_lock_znode_child_znodes.png

Figure 3: Parent lock znode and child znodes

3.1.2.1 实现

3.1.3 不足之处

思路一和思路二的算法都存在潜在的timeout问题。

对于思路一,下面情况(摘自:https://fpj.me/2016/02/10/note-on-fencing-and-distributed-locks/) 会出现问题:

  1. Client C1 creates /lock
  2. Client C1 goes into a long GC pause
  3. The session of C1 expires
  4. Client C2 creates /lock and becomes the new lock holder
  5. Client C1 comes back from the GC pause and it still thinks it holds the lock

At step 5, we have both clients C1 and C2 thinking they hold the lock… Trouble ahead!

对于思路二,也有类似的timeout问题,参考:https://stackoverflow.com/questions/14275613/concerns-about-zookeepers-lock-recipe

4 ZooKeeper客户端框架:Curator

Curator是开源的Zookeeper客户端(源自Netflix公司),它实现了分布式Queue,分布式Barrier,分布式Lock等等设施,这大大简化了Zookeeper客户端的开发工作量。

参考:
Apache Curator入门实战:http://blog.csdn.net/dc_726/article/details/46475633

4.1 实现分布式Lock

InterProcessMutex 实现了分布式Lock,使用很方便:

InterProcessMutex lock = new InterProcessMutex(client, lockPath);
if ( lock.acquire(maxWait, waitUnit) )
{
    try
    {
        // do some work inside of the critical section here
    }
    finally
    {
        lock.release();
    }
}

参考:
http://curator.apache.org/curator-recipes/


Author: cig01

Created: <2017-06-11 Sun 00:00>

Last updated: <2017-10-17 Tue 23:47>

Creator: Emacs 25.1.1 (Org mode 9.0.7)