Slow MySQL Start Time in GTID mode? Binary Log File Size May Be The Issue

Have you been experiencing slow MySQL startup times in GTID mode? We recently ran into this issue on one of our MySQL hosting deployments and set out to solve the problem. In this blog, we break down the issue that could be slowing down your MySQL restart times, how to debug for your deployment, and what you can do to decrease your start time and improve your understanding of GTID-based replication.

How We Found The Problem

We were investigating slow MySQL startup times on a low-end, disk-based MySQL 5.7.21 deployment which had GTID mode enabled. The system was part of a master-slave pair and was under a moderate write load. When restarting during a scheduled maintenance, we noticed that the database server took 5-10 minutes to start up and begin accepting connections. That kind of delay didn’t make sense, so we set out to investigate.

Debugging Your Slow MySQL Start Time

We used the popular Percona tool pt-ioprofile to see what the database was doing. pt-ioprofile is a very important utility in Percona’s popular toolkit that’s used to debug MySQL issues, and you can see the complete list of features in their documentation. The pt-ioprofile tool uses strace and lsof to watch a process’s I/O and print out a table of files and I/O activity.

So, we started MySQL, waited for the mysqld process to get spawned, and started pt-ioprofile to see what the issue might be:

# pt-ioprofile --profile-process mysqld --run-time 200
Tue Oct 9 15:42:24 UTC 2018
Tracing process ID 18677
total      pread       read     pwrite      write      fsync  fdatasync       open      close   getdents      lseek      fcntl filename
216.550641   0.000000  216.550565   0.000000   0.000000   0.000000   0.000000   0.000015   0.000040   0.000000   0.000021   0.000000 /mysql_data/binlogs/mysql-bin.000014

What’s Slowing Your MySQL Restart?

On running this multiple times, we observed the following:

  • The mysqld process was spending most of its time reading the latest binary log file. This was the case even when the server had been stopped gracefully and there was no need for a crash recovery, etc.
  • The server also spent a considerable amount of time loading the InnoDB data files, but that time was much smaller compared to the time spent reading the latest binary log file.
  • If the server was restarted again immediately, this subsequent restart would be much faster.
  • Since a database shutdown flushes the binary log and creates a new one at startup, we did an additional experiment – before shutting down the server, we flushed the binary logs. The subsequent server start was fast again.

These observations clearly pointed to the fact that MySQL was spending a lot of time reading the latest binary log file. If the file was small, like it would be when the log file was flushed before a shutdown, the startup was fast.

Slow MySQL Start Time in GTID? Your Binary Log File Size May Be The IssueClick To Tweet

Understanding Binlog GTID Recovery

As it turns out, in order to populate the values of gtid_executed and gtid_purged, the MySQL server must parse the binary log files.

Here’s the summary the MySQL 5.7 documentation method recommendation based on a FALSE or TRUE reading:

When binlog_gtid_simple_recovery = FALSE:

To compute gtid_executed:

  • Iterate binary log files from the newest, stopping at the first file that has a Previous_gtids_log_event entry.
  • Consume all GTIDs from Previous_gtids_log_event and Gtid_log_events from this binary log file, and store this GTID set internally. It is referred to as gtids_in_binlog.
  • Value of gtid_executed is computed as the union of gtids_in_binlog and the GTIDs in the mysql.gtid_executed table.

This process can be very time consuming if there are a large number of binary log files without GTIDs, for example, created when gtid_mode=OFF.

Similarly, to compute gtid_purged:

  • Iterate binary log files from the oldest to the newest, stopping at the first binary log that contains either a nonempty Previous_gtids_log_event (has at least one GTID), or that has at least one Gtid_log_event.
  • Read Previous_gtids_log_event from this file. Compute the internal variable gtids_in_binlog_not_purged as this GTID set subtracted from gtids_in_binlog.
  • Value of gtid_purged is set to gtid_executed, minus gtids_in_binlog_not_purged.

So, this forms the basis of our understanding on how things used to work in older versions. However, certain optimizations can be made when binlog_gtid_simple_recovery is TRUE. This is the case we are interested in:

When binlog_gtid_simple_recovery = TRUE:

(Note, this is the default in MySQL 5.7.7 and later)

  • Read just the oldest and the newest binary log files.
  • Compute gtid_purged from the Previous_gtids_log_event or Gtid_log_event found in the oldest binary log file.
  • Compute gtid_executed from the Previous_gtids_log_event or Gtid_log_event found in newest binary log file.
  • Thus, only two binary log files are read during the server restart or when purging binary logs.

So, for MySQL versions 5.7.7 and above, the latest and the old binary log files are always read during system startup to correctly initialize GTID system variables. Reading the oldest binary log file isn’t as expensive since the event MySQL is looking for, Previous_gtids_log_event,  is always the first event in a binary log file.

However, in order to correctly compute gtid_executed, the server must read through the entire latest binary log file and collect all the events in that file. So, the system startup time becomes directly proportional to the size of the latest binary log file.

Note that the situation is even worse when binlog_gtid_simple_recovery is FALSE. Since it is no longer the default option in recent releases, it is not much of a concern.

How To Resolve Your Slow Start Time

Having understood the cause of the issue we were running into, the solution we decided on was fairly obvious – reduce the size of the binary log files. The default size of binary log files is 1GB. It takes time to parse through a file of this size during startup, so it makes sense to decrease the value of max_binlog_size to a lower value.

If decreasing the size of the binary log file is not an option, then flushing the binary log files just before a maintenance shutdown of the mysqld process can help to decrease the binlog GTID recovery times.

Learn More About MySQL Hosting

Vaibhaw is a Member of the Technical Staff at (Formerly MongoDirector). You can reach out to him at @_vaibhaw

  • Jean-François Gagné

    Great post. A similar problem happens when a slave connects to a master in GTID Auto-Positioning: the matching binlog can easily be found by a linear search in the header of the binlog, but the exact data to send to a slave needs parsing the content of the binlog. And there is no easy solution to that, binlog “indexing” has been suggested by some, but I do not like the idea of trying to improve a clunky technology, I would prefer not to need parsing the binlogs by dropping GTID in favour of Binlog Servers.

    Coming back to the subject of your post and IMHO, this slow start is a bug. Completely reading the last binlog file on startup is needed in case of a crash, but it should not be needed after a normal shutdown. Your trick of flushing the binlog before startup would make a very nice feature request (if could also be considered a S3 bug). As you documented the behaviour in this post, you should be credited for the bug, have you considered opening on on ?

    About reducing the binlog size, it is a nice workaround, but there are drawbacks of shrinking the size of the binlogs. One of them is creating more files, which should not be needed. There is a 32-bit limit in the current binlog implementation, which makes it hard to have binlogs larger than 4 GB (hence the 1 GB “normal” limit), even if the last transaction in a binlog can bring the binlog size to more than 4 GB. When externalising binlogs in HDFS, bigger is better and I would prefer to make progress by getting bigger binlogs, not smaller.

    • Vaibhaw

      Thanks for your thoughtful response Jean!
      You make a very interesting point there about initial slave handshake.
      I have no prior experience of reporting issues to the mysql team – not encouraged by the number of old bugs they have lying around. :) But since you suggested this – I will report this and see where it goes. LIke you said, if folks prefer to have large binary log files around – fixing this might come in handy.

      • Jean-François Gagné

        About the initial slave handshake, I think to remember an old bug where, while the master was parsing the binlogs, the master was not sending heartbeats, so if the slave had a too small timeout, it would reconnect to the master and there would now be 2 process parsing the binlogs at the same time on the master –> waste of CPU on the master.

        About bugs, it is true that some old bugs are lying around unprocessed. Some people (including me) are sometimes raising attention on those on Twitter using the @MySQL and #bugoftheday tags (the initiator for this is Valerii Kravchuk whose Twitter handle is @mysqlbugs). Eventually, things are moving. One thing is clear though: without reporting bugs, they will never get fixed.

        If/when you open your bug/feature request, can you share the bug number ?