Cloudflare has reported on the upgrade of their key database, the Authoritative DNS database, which is the largest in the world at 14.3% (with GoDaddy at 10.3%). This database stores zone files that specify different IP addresses and is accessible through the DNS Records API, syncing out to key-value databases worldwide for public querying.
The database, named cfdb, runs on PostgreSQL without specifying a version. Previously, other services also ran on PostgreSQL but have since migrated out, leaving DNS as the primary service with only two tables remaining: cf_rec and cf_archived_rec. Despite having just these two tables, the database houses 1.7 billion rows, occupying 1.5TB on disk, with normal activities such as adding 3-5 million rows per day, editing 1 million rows, and deleting 3-5 million rows.
The upgrade plan has a crucial condition that no data should be lost, and downtime must be extremely minimal, accepted for just a few seconds. This upgrade will enable the team to utilize the latest features of PostgreSQL (unspecified version, possibly version 17 that has recently been released), naming the new database dnsdb due to its DNS usage.
When the team tried pgLogical as the initial option, it did not meet several conditions, such as the ability to move the database back in case of issues, the need for partitioning, and access to other tables in cfdb from dnsdb data. Therefore, a redesign for migrating the database was devised, adding special tables to track the migration process. The initial data transfer was done without pg_dump to minimize impact on production work, relying instead on bulk COPY commands of 1 million rows each directly to the new database. A script was then run to send modifying data for synchronization to the destination database every 3 seconds, running for several weeks to ensure real-time synchronization.
The final step before the migration was creating a table cf_migration_manager to allow the DNS Records API to check if migration was ongoing or ready for new data writing. If the API side was not ready, it would hold off requests. The team adjusted the process to run every 0.5 seconds. Once ready, they locked the database for writing, moved the remaining data, and allowed writing to the dnsdb database. The entire process took less than 2 seconds. After the DNS Records API migration, there was a spike in latency for approximately 7 minutes before returning to normal.
TLDR: Cloudflare upgraded their massive Authoritative DNS database, migrating to a new PostgreSQL database named dnsdb with a meticulous plan to ensure data integrity and minimal downtime, utilizing bulk data transfer and real-time synchronization for a seamless transition.
Leave a Comment