Time synchronization in data centres
Oleg Obleukhov, Production Engineer, Meta and Ahmad Byagowi, Research Scientist, Meta
Time synchronization is extremely important for almost every software application within a data centre.
Time is used for correlating and ordering simultaneous events between millions of servers.
In security, reliable timekeeping is essential for cache expiration and invalidation, short-lived certificates, and intrusion detection. Time synchronization helps engineers correlate log entries where Coordinated Universal Time (UTC) is often used.
As transaction throughput constantly increases, time differentiations of even just a couple of milliseconds can cause serious issues. How time reaches and propagates within the data centre, therefore, is crucial.
Global navigation satellite systems
There are different ways to propagate accurate time to data centres. In many cases, it starts with receiving a radio-frequency broadcast from global national satellite system (GNSS) constellations such as GPS, GLONASS, Galileo and BeiDou via special devices called time appliances.
Due to irregularities in the Earth’s rotation, the difference between monotonically increasing International Atomic Time (TAI) and UTC constantly fluctuates, eventually reaching a ±500 millisecond limit. At this point the International Earth Rotation and Reference Systems Service (IERS) issues an instruction for a leap second to be either added or removed from UTC.
This is further complicated by each constellation implementing its own operational time and additional conversion steps to UTC. For example, GPS time has a constant 19-second offset from TAI, while GLONASS is based on UTC.
Such complexity often falls on time appliances and, as with any other moving parts, occasionally causes problems.
Open source time appliance
Under Meta’s Open Compute Project, we have started a Time Appliances Project workstream dedicated to developing the Open Source Time Appliance. We wanted to liberate the industry from proprietary solutions, facilitate transparency, and significantly reduce the cost of the time appliance.
While implementing open-source time-appliance software, we had to address a complex logic handling different constellations and leap second indicators to produce TAI. We published an in-depth article detailing our approach, motivations, and the process of building our time appliance.
Once the time appliance is synchronized, we are ready to propagate time across a packet-switched network.
Network time protocol
Network time protocol (NTP) is one of the most common types of time synchronization within data centres. It is a very reliable, battle-tested technology. Most servers and end-user devices around the world rely on NTP to keep their time up to date.
At Meta, we run a state-of-the-art low-jitter NTP, which we constantly validate using extremely precise and accurate timing equipment. NTP can reliably bring down synchronization to hundreds of microseconds with a window of uncertainty under 100 milliseconds. This leaves two options to handle a leap event: stepping the clock or smearing.
Stepping is known to cause issues, making smearing (a technique of spreading or “smearing” time over a period of hours to account for leap seconds) the preferred option in most cases. Our equipment allows us to measure the impact of leap-second smearing down to a few nanoseconds (see figure).
From these measurements we know the sizes of adjustments can reach tens of microseconds per second — large enough to crash software unless a monotonic clock is used. This puts additional pressure on our engineering teams and frequently causes issues within different parts of the infrastructure.
Similar pressures are felt across the digital industry. Given such challenges, we are not looking forward to the first-ever negative leap second.
Precision time protocol
Even though network time protocol is fine for most user applications today, we find it increasingly difficult, or even impossible, to use for distributed storage systems, where demanding applications require much tighter guarantees.
This is why companies like Meta deploy additional synchronization solutions such as precision time protocol — pushing the window of uncertainty down to nanoseconds.
This level of precision makes it simply impossible to smear a leap second safely. Therefore, precision time protocol is mostly used with TAI. When conversion to UTC is required, it has to be performed separately for each client, which means degrading the window of uncertainty by several orders of magnitude.
It’s time
We support the decision of the International Bureau of Weights and Measures’ (BIPM) to discontinue the leap second in practice by 2035.
Fixed UTC will slowly diverge from solar observed time, but it will increase the stability of critical systems. Having a leap hour or a daylight-saving time correction once every few millennia will be a much safer and more sustainable approach for everyone.
This article first appeared in ITU News Magazine: The future of Coordinated Universal Time – part of a series of editions on topics to be discussed at the World Radiocommunication Conference (WRC-23), from 20 November to 15 December in Dubai, UAE.
Download your copy of the ITU News Magazine: The future of Coordinated Universal Time.
Header image credit: Adobe Stock