How to Monitor Zookeeper

Update: We hosted a live Hangout on Air with some members of Server Density engineering and operations teams, in which we discussed between other topics, how we use Zookeeper here @ Server Density. We’ve made the video available, which can be found embedded at the bottom of this blog post.

Apache Zookeeper works at the zoo—not your usual zoo, but similar—and does what you’d expect. You know, keep your service-oriented architecture nice and clean.

It provides a distributed hierarchical file system that helps with the difficulties associated with services working in different machines (discovery, registration, configuration, locking, leader selection, queueing, etc). All data replicates across all nodes and the leader performs atomic broadcasts to other servers, therefore guaranteeing strong ordering on changes propagation.

Zookeeper nodes (ZNodes) are like hierarchical file system files (eg. /foo/foo1, /bar/taz, /dev/null/full). They store any data inside, and notify watchers on any event pertaining to them.

Zookeeper can be quite a tricky service to manage. From a client programming point of view there are plenty of low level and error handling pitfalls. That explains the popularity of higher level API wrappers, like the one created by Netflix team (Curator).

With that in mind, here is our very own checklist of best practices, including key Zookeeper metrics and alerts we monitor with Server Density.

Monitoring Zookeeper: Metrics and Alerts

As per previous articles, our general rule of thumb is “collect all possible/reasonable metrics that can help when troubleshooting, alert only on those that require an action from you”. Well, the Zookeeper list that satisfies this criteria is not that long.

Zookeeper process is running

Metric	Comments	Suggested Alert
Zookeeper process	Is the right binary daemon process running?	When process list contains the regexp /usr/bin/java*org.apache.zookeeper$.

You can also use the following script to check if the server is running:

$INSTALL_PREFIX/zk-server-3/bin/zkServer.sh status

Or if you run Zookeeper via supervisord (recommended) you can alert the supervisord resource instead.

System Metrics

Metric	Meaning / Comments	Suggested Alert
Memory usage	Zookeeper should run entirely on RAM. JVM heap size shouldn’t be bigger than your available RAM. That is to avoid swapping.	None
Swap usage	Watch for swap usage, as it will degrade performance on Zookeeper and lead to operations timing out (set vm.swappiness = 0).	When used swap is > 128MB.
Network bandwidth	Zookeeper servers can incur a high network usage. Keep an eye on this, especially if you notice any performance degradation. Also look out for dropped packet errors. Zookeeper standards are: 20% writes, 80% reads. More nodes result in more writes and higher overall traffic.	None
Disk usage	Zookeeper data is usually ephemeral and small. Still we recommend dataLogDir to be on a dedicated partition and watch for disk usage. Use purge task to clean up dataDir and dataLogDir.	When disk is > 85% usage.

Zookeeper disk writes are asynchronous which means they shouldn’t have high IO requirements. Still, keep an eye on this, especially if your server is shared with other services, say Kafka.

Here is how Server Density graphs disk usage and memory usage. Note the up and down curves created by the purge task:

how-to-monitor-zookeeper

And here are some Zookeeper alerts configured in Server Density:

How-to-monitor-zookeeper2

Zookeeper Metrics

Metric	Meaning / Comments	Suggested Alert
Request Avg/Max Latency	Amount of time it takes for the server to respond to a client request (since the server was started).	When latency > 10 (Ticks).
Outstanding Requests	Number of queued requests in the server. This goes up when the server receives more requests than it can process.	When count > 10.
Received	Number of client requests (typically operations) received.	None
Sent	Number of client packets sent (responses and notifications).	None
File Descriptors	Number of file descriptors used over the limit.	When FD percentage > 85 %.
Mode	Serving mode: leader or follower, or standalone if not running in an ensemble.	None
Pending syncs	(Only exposed by the leader) number of pending syncs from the followers.	When pending > 10.
Followers	(Only exposed by the leader) number of followers within the ensemble. You can deduce the number of servers from the MBeam Quorum Size.	When followers != (number of ensemble servers -1).
Node count	Number of znodes in the Zookeeper namespace	None
Watch count	Number of watchers setup over Zookeeper nodes.	None
Heap Memory Usage	Memory allocated dynamically by the Java process, Zookeeper in this case.	None

Here is a Zookeeper monitoring graph including Latency average and Outstanding requests:

How-to-monitor-zookeeper3

Zookeeper Monitoring Tools

The simplest way to monitor Zookeeper and collect these metrics is by using the commands known as “4 letter words” within the ZK community. You can run these using telnet or netcat directly:

$ echo ruok | nc 127.0.0.1 5111
imok
 
$ echo mntr | nc localhost 5111
zk_version  3.4.0
zk_avg_latency  0
zk_max_latency  0
zk_min_latency  0
zk_packets_received 70
zk_packets_sent 69
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count   4
zk_watch_count  0
zk_ephemerals_count 0
zk_approximate_data_size    27
zk_followers    4                   - only exposed by the Leader
zk_synced_followers 4               - only exposed by the Leader
zk_pending_syncs    0               - only exposed by the Leader
zk_open_file_descriptor_count 23    - only available on Unix platforms
zk_max_file_descriptor_count 1024   - only available on Unix platforms

We’ve looked at mytop for MySQL, and memcache-top for Memcached. Well, Zookeeper has one too, zktop:

$ ./zktop.py --servers "localhost:2181,localhost:2182,localhost:2183"
Ensemble -- nodecount:10 zxid:0x1300000001 sessions:4
SERVER           PORT M      OUTST    RECVD     SENT CONNS MINLAT AVGLAT MAXLAT
localhost        2181 F          0       93       92     2      2      7     13
localhost        2182 F          0       37       36     1      0      0      0
localhost        2183 L          0       36       35     1      0      0      0

CLIENT           PORT I   QUEUE RECVD  SENT
127.0.0.1       34705 1       0    56    56
127.0.0.1       35943 1       0     1     0
127.0.0.1       33999 1       0     1     0
127.0.0.1       37988 1       0     1     0

If you are after more detailed metrics, you can access those through JMX. You could also take the DIY road and go for JMXTrans and Graphite, or use Nagios/Cacti/Ganglia with check_zookeeper.py. Alternatively, you can save time (and preserve your sanity) by choosing a hosted service like Server Density (that’s us!).

If you want to test the quality and performance of your Zookeeper ensemble, then zk-smoketest with zk-smoketest.py and zk-latencies.py are great tools to check out.

Zookeeper Management tools

There are not too many management options out there. The folks at Netflix have released Exhibitor, a tool that provides some basic monitoring, log cleaning up (for old versions), backup/restore, ensemble configuration and nodes visualization. There is also zookeeper_dashboard, but it hasn’t been updated in years.

How-to-monitor-Zookeeper5

Tech chat: processing billions of events a day with Kafka, Zookeeper and Storm

The post How to Monitor Zookeeper appeared first on Server Density Blog.

How to Monitor Zookeeper

Monitoring Zookeeper: Metrics and Alerts

Zookeeper process is running

System Metrics

Zookeeper Metrics

Zookeeper Monitoring Tools

Zookeeper Management tools

Further reading

Tech chat: processing billions of events a day with Kafka, Zookeeper and Storm

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...