Wednesday, September 22, 2010

Have you seen my replication files?

I recently started looking over how to get information about relay log file and binary log file using an SQL interface. Being able to do that can be quite handy when one is going to work with replication in various ways. In my particular case, I wanted to get the path to the relay log index file and binary log index file to be able to read the binary log files as well as the relay log files directly. You are probably familiar with the --relay-log-index and --relay-log options that can be set to specify where the index file. These options can either be used to set an absolute path or a relative path to use for the files. If the option starts with a /, it is considered an absolute path (drive letters are allowed on Windows though), otherwise the path is relative to the data directory (which is specified through the --datadir option). The values supplied to these options are provided from SQL as the system variables relay-log-index and relay-log respectively. The recommendation is to always set the --relay-log and --relay-log-index since the default value for these options contain the hostname. The problem with this is that if the database files is moved to a new machine with a different hostname, the server will not be able to pick up the files correctly and will assume that they do not exist. The logic for finding the location of the relay log files can be quite daunting; to find the location of the relay log index file:
  1. If relay_log_index is set, this is the location of the relay log index file.
  2. If relay_log_index is not set, then the value supplied to the relay_log option is used to figure out the name of the relay log index file.
  3. If neither relay_log_index nor relay_log is set, then the name of the relay log index file is taken by stripping the directory and extension from the pid_file variable (set using the --pid-file option), if supplied, and adding -relay-bin.index to the end of the string.
    • The pid_file variable has a default value which consists of datadir/hostname.pid, which would give the relay log index file a name of datadir/hostname-relay-bin.index.
  4. If the path is a relative path—that is, the path does not start with a directory separator—then the value of datadir is prepended to the relay log index file name.
Keeping track of all these details is not something I want to spend my time on, so I wrote a stored function for computing the name of the relay log index file which I simply called relay_log_index_file:
CREATE FUNCTION relay_log_index_file () RETURNS VARCHAR(256)
  DETERMINISTIC
  READS SQL DATA
BEGIN
  DECLARE rli_name VARCHAR(256);
  IF @@relay_log_index IS NOT NULL THEN
    SET rli_name = @@relay_log_index;
  ELSEIF @@relay_log IS NOT NULL THEN
    SET rli_name = @@relay_log;
  ELSE
    BEGIN
      DECLARE l_pid_file VARCHAR(256);
      DECLARE l_pid_base VARCHAR(256);
      SET l_pid_file = SUBSTRING_INDEX(@@pid_file, '/', -1);
      SET l_pid_base = SUBSTRING_INDEX(l_pid_file, '.', 1);
      SET rli_name = CONCAT(l_pid_base, '-relay-bin.index');
    END;
  END IF;

  IF rli_name NOT LIKE '/%' THEN
    RETURN CONCAT(@@datadir, rli_name);
  END IF;

  RETURN rli_name;
END
This is a quite complicated way of figuring out the location of the relay log files and hardly something that I consider very useful. It would be much better if the relay_log_index variable gave the complete path to the file, regardless of what was given to the --relay-log-index option (or even if the option was given at all).

Being able to fetch the relay log index file is quite convenient, but being able to fetch the binary log index file would be even more convenient. Unfortunately, there is no such variable. The --log-bin option can be used to supply a base name to use for the binary log, but the log_bin variable can only be ON or OFF, which in my book is not very smart. To fix this, I created WL#5465, which introduces three new variables—log_bin_basename, relay_log_basename, and log_bin_index—and changes behaviour of relay_log_index.

log_bin_basename
This is a global read-only variable that contain the base file name used for the binary log files, the path to the files but omitting the extension.
  • If a full path was given to --log-bin-index, this will be stored in log_bin_index.
  • If a relative path was given to --log-bin-index, the contents of datadir will be used as directory and prepended to the value of --log-bin-index
  • Otherwise, the value of datadir will be used as the directory of the file and the base name is created by taking the basename of pid_file (name without extension) and adding '-bin'.
log_bin_index
This is a global read-only variable containing the full name to the binary log index file. If no value is given, the value of log_bin_basename is used and the extension '.index' is added.
relay_log_basename
This is a global read-only variable containing the base file name used for the relay log file, that is, the full path to the relay logs but not including the extension. The value of this variable is created in the same way as for log_bin_basename with the only difference that the '-relay-bin' suffix is used instead of '-bin'.
relay_log_index
This is a global read-only variable containing the full name of the relay log index file. If no value is given, the value of relay_log_basename is used and the extension '.index' is added.
With these new variables, fetching the full path of the binary log index file is as easy as doing a 'SELECT @@log_bin_index'.
If you're interested in if the patch for this worklog will be in any particular server version or f it is pushed at all, you have to check the status of the worklog. Even if I have described the architecture and implemented a patch, there is no way to know where it ends up or even if it is pushed at all.

An alternative: let the application do the job

Creating a stored function for computing the relay log index file name might be overkill in many situation. If the value is needed from serveral different connections it makes sense to create it as a stored function to allow it to be used by different applications. It can, however, just as well be placed in the application code which would then compute the location of the relay log index file using a single query to the server.

The information you need is the data directory from datadir, the pid file name from pid_file (in the event that the relay log or the relay log index option does not have a value), and the relay_log and relay_log_index values.

For example, the following Python code could be used to compute the data directory, the base use for creating relay log files, and the name of the index file using a single query to the database server:

import os.path

def get_relay_log_info(connection):
    cursor = connection.cursor()
    cursor.execute("SELECT @@datadir, @@pid_file, @@relay_log, @@relay_log_index")
    datadir, pid_file, relay_log, relay_log_index = cursor.fetchone()
               
    def _add_datadir(filename):
        if os.path.isabs(filename):
            return filename
        else:
            return os.path.join(datadir, filename)

    pidfile_base = os.path.basename(os.path.splitext(pid_file)[0])
    base_name = _add_datadir(relay_log or pidfile_base + '-relay-bin')
    index_file = _add_datadir(relay_log_index or base_name + '.index')

    return { 'datadir': datadir, 'base': base_name, 'index': index_file }

Wednesday, August 18, 2010

Binary Log Group Commit - Recovery

It was a while since I wrote the previous article, but the merging of Oracle and Sun here resulted in quite a lot of time having to be spent on attending various events and courses for legal reason (one of the reasons I prefer working for smaller companies) and together with a summer vacation spent on looking over the house, there were little time for anything else. This is the second post of three, and in the last one I will cover some optimizations that improves performance significantly.

In the previous article, an approach was outlined to handle the binary log group commit. The basic idea is to use the binary log as a ticketing system by reserving space in it for the transactions that are going to be written. This will provide an order on the transactions as well as allowing writing the transactions in parallel to the binary log, thereby boosting performance. As noted in the previous post, a crash while writing transactions to the binary log requires recovery. To understand what needs to be changed, it is necessary to understand how the structure of the binary log as well as how recovery after a crash works currently together with the implementation of 2-phase commit that MySQL uses.

Figure 1. Binlog file structure

A quick intro to the structure of the binary log

Figure 1 gives the rough structure of the binary log with a set of binlog files and an binlog index file. The binlog index file just list the binlog files that makes up the binary log, while each binlog file have the real contents of the binary log that you can see when executing a SHOW BINLOG EVENTS.

Each binlog file consists of a sequence of binlog events, where the most important events from our perspective is the Format description event. In addition, each binlog file is also normally terminated by a Rotate event that refers to the next binlog file in the sequence.

The Format description event is used to describe the contents of the binlog file and therefore contain a a lot of information about the binlog file. In this case we are interested in a special flag called LOG_EVENT_BINLOG_IN_USE_F, which is used to tell if the binlog is actively being written by the server. When the server opens a new binlog file, this flag is set to indicate that the file is in use, and when the binary log is rotated and a new binlog file created, this flag is cleared when closing the old binlog file.

In the event of a crash, the flag will therefore be set and the server can see that the file was not closed properly and start with performing recovery.

Recovery and the binary log

When recovering, the server has to find all transactions that were partially executed and decide if they are going to be rolled back or committed properly. The deciding point when a transaction will be committed instead of rolled back is when the transaction has been written to the binary log. To do this, the server has to find all transactions that were written to the binary log and tell all storage engines to commit these transactions.

The recovery procedure is executed when the binary log is opened—which the server does calling TC_LOG_BINLOG::open during startup. When the binary log is opened, recovery is done if the last open binlog file was not closed properly. An outline of the procedure executed is:

  1. Open the binlog index file and go through it to find the last binlog file mentioned there [TC_LOG_BINLOG::open]
  2. Open this binlog file and check if the LOG_EVENT_BINLOG_IN_USE_F flag is set
  3. If the flag was clear, then the server stopped properly and no recovery is necessary. Otherwise, the server did not stop properly and recovery starts by calling.
  4. The last binlog file is now open, so the entire binlog file is scanned and the XID of each each Xid event is recorded. These XIDs denote the transactions that were properly written to the binary log—that is, the transactions that shall be committed [TC_LOG_BINLOG::recover].
  5. Each storage engine is handed the list of XIDs of transactions to commit through the handlerton::recover interface function [ha_recover].
  6. The storage engine will then commit each transaction in the list and roll back all the others.
Figure 2. Parallel binary log group commit

So, what's the problem?

The procedure above works fine, so what are the problems we have to solve to implement the procedure described in the previous article? If you look in Figure 2, you have a hint to what is the problem.

Now, assume that thread 1, 2, and 3 in Figure 2 is writing transactions to disk (starting at positions Trans_Pos1, Trans_Pos2, and Trans_Pos3 respectively) and that a preceding thread (a thread that got a binlog position before Last_Complete) decides that it is time to call fsync to group commit the state this far. The binlog file will then be written in this state—where some transactions are partially written—and Last_Committed will be set to the value of Last_Complete, leading to the situation depicted in Figure 2.

As you can see in the figure, thread 2 has already finished writing data to the binary log and is therefore written to durable storage. Since thread 1—which precedes thread 2 in the binary log—has not completed yet, thread 2 has not yet committed and is still waiting for all the preceding transactions to complete. If a crash occurs in this situation, it is necessary to somehow find the XID of all transactions that have committed—excluding the transaction that thread 2 has completed—and commit them to the storage engine when recovering.

A proposal for a new recovery algorithm

In the original algorithm, the scan of the binlog file stopped when the file ended, but since there can be partially written events in the binlog file after the "real" end of the file (the binlog file ends logically at Last_Committed/Last_Complete), so we have to find some other way to detect the logical end of the file.

To handle this, it is necessary to somehow mark events that are not yet committed so that the recovery algorithm can find the correct position where the binlog file ends. The same problem occurs if one wants to persist the end of the binlog file preallocating the binlog file. There are basically three ways to handle this:

  • Write the end of the binlog file in the binlog file header (that is, the Format description log event).
  • Mark each event by zeroing out a field that cannot be zero—for example, the length, the event type, or event position—before writing the event to the binary log. Then write this field with the correct value after the entire event has been written.
  • Checksum the events and find the end of the worklog by scanning for the first event with an incorrect checksum.
Write the length in the binlog file header
Finding the length of the binlog in this case is easy: just inspect the header and find the length of the binlog file there. In this case, it is necessary to update the length after the event has been written since there may be an fsync call at any time between starting to write the event data and finishing writing the event. Normally, this means updating two block of the file for each event written, which can be a problem since it requires at least the block containing the header and all the blocks that was written since the last group commit to be written when calling fsync. If a large number of events is written between each fsync, this might not impose a large penalty, but if sync-binlog=1 it might become quite expensive. Some experiments done by Yoshinori showed a drop from 15k events/sec to 10k events/sec, which means that we lose one third in performance.

Digression. The measurements that Yoshinori did consisted of one pwrite to write the event, one pwrite to write the length to the header and then a call to fsync. It is, in other word, most similar to using sync_binlog=1. In reality, however, this will not be the case since a user that is using the binary log group commit will have several events written between each call to fsync. Since these writes will be to memory (the file pages are in memory), performance will not drop as much. To evaluate the behavior for a group commit situation better, writing 10 events at a time was compared as well (pretending to be sync_binlog=10). Straight append (using write) gave at that point 110k events/sec and write to the header before calling fsync gave 80k events/sec. This means a performance reduction of 27%, which is an improvement but still a very large overhead.

Use a marker field
The second alternative is to use one of the fields as a marker field. By setting one of the fields that cannot be zero to zero, it is possible to detect that the event is incorrect and stop at the event before that. Good candidates as fields is the length—which cannot be zero for any event and is four bytes—and the event type, which is one byte and where zero denotes an unknown event and never occurs naturally in a binlog file. The technique would be to first blank out the type field of the event, write the event to the binlog file, and then use pwrite to fill in the correct type code after the entire event is written. If an fsync occurs before the event type is written, the event will be marked as unknown and if a crash occurs before the event is completely written (and written to disk), it will be possible to scan the binlog file to find the first event that is marked as unknown. In order for this technique to work, it is necessary to zero the unused part of the binlog file before starting to write anything there (or at least zero out the event type). Otherwise, crash recovery will not be able to correctly detect where the last completely written event is located.

Compared to the previous approach, this does not require writing to locations far apart (except in rare circumstances when the event spans two pages). It also has the advantage of not requiring any change of the binlog format. This technique is likely to be quite efficient. (Note that most of the writes will be to memory, so there will not be any extraneous "seeks" over the disk to zero out parts of the file.)

Checksum on each event
The third alternative is to rely on an event checksum to detect events that are incompletely written. This approach is by far the most efficient of the approaches since the event checksum is naturally written last. It also has the advantage of not requiring the unused parts of the binlog file to be zeroed since it is unlikely that the checksum will be correct for the event unless the event has been fully written. This also makes it a very good candidate for detecting the end of the binlog file when preallocating the binlog file. The disadvantage is, of course, that it requires checksums to be enabled and implemented.
With this in mind, the best approach seems to be to checksum each event and use that to detect the end of the binary log. If necessary, the second approach can be implemented when the binlog is not checksummed.

The next article will wrap up the description by pointing out some efficiency issues and how to solve them to get an efficient implementation.

Friday, April 30, 2010

Binary Log Group Commit - An Implementation Proposal

It is with interest that I read Kristian's three blogs on the binary log group commit. In the article, he mentions InnoDB's prepare_commit_mutex as the main hindrance to accomplish group commits—which it indeed is—and proposes to remove it with the motivation that FLUSH TABLES WITH READ LOCK can be used to get a good binlog position instead. That is a solution—but not really a good solution—as Kristian points out in the last post.

The prepare_commit_mutex is used to ensure that the order of transactions in the binary log is the same as the order of transactions in the InnoDB log—and keeping the same order in the logs is critical for getting a true on-line backup to work, so removing it is not really an option, which Kristian points out in his third article. In other words, it is necessary to ensure that the InnoDB transaction log and the binary log have the same order of transactions.

To understand how to solve the problem, it is necessary to take a closer look at the XA commit procedure and see how we can change it to implement a group commit of the binary log.

The transaction data is stored in a per-thread transaction cache and the transaction size is the size of the data in the transaction cache. In addition, each transaction will have a transaction binlog position (or just transaction position) where the transaction data is written in the binary log.

The procedure can be outlined in the following steps:

  1. Prepare InnoDB [ha_prepare]:
    1. Write prepare record to log buffer
    2. fsync() log file to disk (this can currently do group commit)
    3. Take prepare_commit_mutex
  2. Log transaction to binary log [TC_LOG_BINLOG::log_xid]:
    1. Lock binary log
    2. Write transaction data to binary log
    3. Sync binary log based on sync_binlog. This forces the binlog to always fsync() (no group commit) due to prepare_commit_mutex
    4. Unlock binary log
  3. Commit InnoDB:
    1. Release prepare_commit_mutex
    2. Write commit record to log buffer
    3. Sync log buffer to disk (this can currently do group commit)
    4. InnoDB locks are released
There are mainly two problems with this approach:
  • The InnoDB row level and table level locks are released very late in the sequence, which affects concurrency. Ideally, we need to release the locks very early, preferably as soon as we have prepared InnoDB.
  • It is not possible to perform a group commit in step 2
As you can see here, the prepare of the storage engines (in this case just InnoDB) is done before the binary log mutex is taken, and that means that if the prepare_commit_mutex is removed it is possible for another thread to overtake a transaction so that the prepare and writing to the binary log is done in different order.

To solve this, Mark suggests using a queue or a ticket system to ensure that transactions are committed in the same order, but we actually already have such a system that we can use to assign tickets: namely the binary log.

The idea is to allocate space in the binary log for the transaction to be written. This gives us a sequence number that we can use to order the transactions.

In the worklog on binary log group commits you will find the complete description as well as the status of the evolving work.

In this post, I will outline an approach that Harrison and I have discussed, which we think will solve the problems mentioned above. In this post, I will outline the procedure during normal operations, in the next post I will discuss recovery, and in the third post (but likely not the last on the subject), I will discuss some optimizations that can be done.

I want to emphasize that the fact that we have a worklog does not involve any guarantees or promises of what, when, or even if any patches will be pushed to any release of MySQL.

In Worklog #4007 an approach for writing the binary log is suggested where space is allocated for the transaction in the binary log before actually starting to write it. In addition to avoiding unnecessary locking of the binary log, it also allow us to use the binary log to order the transactions in-place. We will use this idea of reserving space in the binary log to implement the binary log group commit.

By re-structuring the procedure above slightly, we can ensure that the transactions are written in the same order in both the InnoDB transaction log and the binary log.

There are two ways to re-structure the code: one simple and one more complicated that potentially can render better performance. To simplify the presentation, it is assumed that pre-allocation is handled elsewhere, for example using Worklog #4925. In a real implementation, pre-allocation can either be handled when a new binlog file is created, or when transaction data is being written to the binary log.

The sequential write approach

Figure 1. Sequential binary log group commit
In the sequential write approach, the transactions are still written to the binary log in order and the code is just re-ordered to avoid keeping mutexes when calling fsync(). To describe the algorithm, three shared variables are introduced to keep track of the status of replication:
Next_Available
This variable keeps track of where a new transaction can be written
Last_Committed
This variable keeps track of the last committed transaction, meaning that all transactions preceding this position is actually on disc. This variable is not necessary in the real implementation, but it is kept here to simplify the presentation of the algorithm.
Last_Complete
This variable keeps track of the last complete transaction. All transactions preceding this point is actually written to the binary log, but are not necessarily flushed to disc yet.
You can see an illustration of how the variables are used with the binary log in Figure 1 where you can also see three threads each waiting to write a transaction. Both variables are initially is set to the beginning of the binary log and it is always true that Last_CommittedLast_CompleteNext_Available . The procedure can be described in the following steps:
  1. Lock the binary log
  2. Save value of Next_Available in a variable Trans_Pos and increase Next_Available with the size of the transaction.
  3. Prepare InnoDB:
    1. Write prepare record to log buffer (but do not fsync() buffer here)
    2. Release row locks
  4. Unlock binary log
  5. Post prepare InnoDB:
    1. fsync() log file to disk, which can now be done using group commit since no mutex is held.
  6. Log transaction to binary log:
    1. Wait until Last_Complete = Trans_Pos. (This can be implemented using a condition variable and a mutex.)
    2. Write transaction data to binary log using pwrite. At this point, it is not really necessary to use pwrite since the transaction data is simply appended, but it will be used in the second algorithm, so we introduce it here.
    3. Update Last_Complete to Trans_Pos + transaction size.
    4. Broadcast the the new position to all waiting threads to wake them up.
    5. Call fsync() to persist binary log on disk. This can now be group committed.
  7. Commit InnoDB:
    1. Write commit record to log buffer
    2. Sync log buffer to disk, which currently can be group committed.
To implement group commit, it is sufficient to have a condition variable and wait for that for a specified interval. Once the interval has passed, the transaction data can call fsync(), after which it broadcasts the fact that data has been flushed to disc to other waiting threads so that they can skip this. Typically, the code looks something along these lines (we ignore checking error codes here to simplify the description):
pthread_mutex_lock(&binlog_lock);
while (Last_Complete ≥ Last_Committed) {
  struct timespec timeout;
  gettimeofday(&timeout, NULL);
  timeout.tv_usec += 1000;    /* 1 msec */
  int error= pthread_cond_timedwait(&binlog_flush, &binlog_lock, &timeout);
  if (error == ETIMEDOUT) {
    fsync(&binlog_file);
    Last_Committed = Last_Complete;
    pthread_cond_broadcast(&binlog_flush);
  }
}
pthread_mutex_unlock(&binlog_lock);
There are a few observations regarding this approach:
  • Step 6a requires a condition variable and a mutex when waiting for Last_Complete to reach Trans_Pos. Since there is just a single condition variable, it is necessary to broadcast a wakeup to all waiting threads, which each will evaluate the condition just to find a single thread that should continue, while the other threads go to sleep again.

    This means that the condition will be checked O(N2) times to commit N transactions. This is a waste of resources, especially if there is a lot of threads waiting, and if we can avoid this, we can gain performance.

  • Since the thread has a good position in the binary log where it could write, it could just as well start writing instead of waiting. It will not interfere with any other threads, regardless if locks are kept or not.
These observations lead us to the second approach, that of writing transaction data to the binary log in parallel.

A parallel write approach

Figure 2. Parallel binary log group commit
In this approach, each session is allowed to write to the binary log at the same time using pwrite since the space for the transaction data has already been allocated when preparing the engines. Figure 2 illustrates how the binary log is filled in (grey areas) by multiple threads at the same time. Similar to the sequential write approach, we still have the Last_Complete, Last_Committed, and Next_Available variables.

Each thread does not have to wait for other threads before writing, but it does have to wait for the other threads to commit. This is necessary since we required the order of commits in the InnoDB log and the binary log to be the same. In reality, this does not pose a problem since the I/O is buffered, hence the writes are done to in-memory file buffers.

The algorithms look quite similar to the sequential write approach, but notice that in step 6, the transaction data is simply written to the binary log using pwrite.

  1. Lock the binary log
  2. Save value of Next_Available in a local variable Trans_Pos and increase Next_Available with the size of the transaction.
  3. Prepare InnoDB:
    1. Write prepare record to log buffer (but do not fsync() buffer here)
    2. Release row locks
  4. Unlock binary log
  5. Post prepare InnoDB:
    1. fsync() log file to disk, which can now be done using group commit since no mutex is held.
  6. Log transaction to binary log:
    1. Write transaction data to binary log using pwrite. There is no need to keep a lock to protect the binary log here since all threads will write to different positions.
    2. Wait until Last_Complete = Trans_Pos.
    3. Update Last_Complete to Trans_Pos + transaction size.
    4. Broadcast the the new position to all waiting threads to wake them up.
    5. Call fsync() to persist binary log on disk. This can now be group committed.
  7. Commit InnoDB:
    1. Write commit record to log
    2. Sync log file to disk
This new algorithm has some advantages, but there are a few things to note:
  • When a transaction is committed, it is guaranteed that Trans_PosLast_Committed for all threads (recall that Trans_Pos is a thread-local variable).
  • Writes are done in parallel, but when waiting for the condition in step 6b still requires a broadcast to wake up all waiting threads, while only one will be allowed to proceed. This means that we still have the O(N2) complexity of the sequential algorithm. However, for the parallel algorithm it is possible to improve the performance significantly, which we will demonstrate in the third part where we will discuss optimizations to the algorithms.
  • Recovery in the sequential algorithm is comparably simple since there are no partially written transactions. If you consider that a crash can occur in the situation described in Figure 2, it is necessary to device a method for correctly recovering. This we will discuss in the second part of these posts.

Tuesday, April 13, 2010

MySQL Conference Replication tutorial: Article and Demo Software

The MySQL Conference and Expo started with me and Lars Thalmann doing the replication tutorial. Unfortunately, we cannot at this time distribute the slides (please watch the replication tutorial page at the conference site), but there is a replication tutorial package for easy setup of server to play around with—including some sample scripts—and a paper that both explains how the package can be used as well as giving some example setups.

Friday, March 05, 2010

Going to the O'Reilly MySQL Conference & Expo

As I've been doing the last couple of years, I will be going to the O'Reilly MySQL Conference & Expo. In addition to the tutorial and the replication sessions that I will be holding together with Lars, I will be holding a session about the binary log together with Chuck from the Backup team which the Replication team normally works very close with.

This year, O'Reilly also have a Friend of the Speaker discount of 25% that you can use when you register using the code mys10fsp.

The sessions that we are going to hold are listed below. Note that I am using Microformats, which will allow you to easily extract and add the events to your calendar using, for example, the Operator plugin for Firefox.

See you there!

Mysteries of the Binary Log
April 14th, 2010 10:50am - 11:50am Room: Ballroom F
New Replication Features
April 13th, 2010 2:00pm - 3:00pm Room: Ballroom A
Replication Tricks & Tips
April 14th, 2010 2:00pm - 3:00pm Room: Ballroom B
The Replication Tutorial
April 12th, 2010 8:30am - 12:00pm Room: Ballroom E

Wednesday, February 03, 2010

MySQL Replicant: Architecture

MySQL Replicant Library
Class Design
In the previous post I described the first steps of a Python library for controlling the replication of large installations. The intention of the library is to provide a uniform interface to such installations and that will allow procedures for handling various situations to be written in a uniform language.

For the library to be useful, it is necessary to support installations that use different operating systems for the machines, as well as different versions of the servers. Specifically, it is necessary to allow some aspects of the system to vary.

  • Depending on the operating system, or even just how the server is installed on the machine, the procedures for bringing the server down and up will differ.

  • Configurations are managed different ways depending on the deployment and there are various other tools to manage configurations of large systems.

    As part of the management of the topology, it is necessary to change the configuration files, but this should play well with other tools.

    In either case, any specific method for configuration handling should neither be required nor enforced.

  • In the example in the previous article, the technique for cloning a server was demonstrated. In this case the naive method of copying the database files was used. For the general case, however, some backup method will be used, but it depends on the requirements of the deployment. In other words, it is necessary to parameterize the backup method as well.
  • Each server in the system has a specific role to fulfill. Some server are final slaves whose only purpose is to answer queries, at least one server is a master, and some servers are relay servers.

To allow the system to be parameterized on these aspects, a set of abstract classes is introduced. In the figure you can see a UML diagram describing the high-level architecture of the Replicant library.

In the figure, there are four abstract classes:
Machine
The responsibility of this class it to handle all issues that are specific to the remote operating system, for example, to fetch files or issue commands to start and stop the server.
Config
The responsibility of this class is to maintain the configuration of a server. To do this, it may need to parse configuration files to be able to extract the specific section containing the definition.
BackupMethod
The responsibility of this class is to provide the primitives to create a backup and restore a backup. In both cases, the class supports taking a backup and potentially placing the backup image at a different machine, and restoring it.
Role
The responsibility of this class is to provide all the information necessary to configure a server in a role. Since the role does not only entails pure configuration information, but can also involve keeping certain tables and other database objects available, this is modeled as a separate class.
The central Server class relies on a Machine instance and a Config instance to implement the interface to the machine and to the configuration, respectively.

Configuration Management

The configuration of the server is made part of the Replicant library since manipulating the server configuration is usually necessary when changing roles of servers.

Depending on the deployment, other configuration managers such as cfengine or puppet are used to administer the configuration of all servers, while others hand-edit the configuration files (which has to be for small configurations, since it would be a pain to administer larger deployments in this way).

Long-term, there should be support for some safety measures when working with server configurations, so implementing an interface for handling server configurations in a safe transaction-like manner—or maybe this should be called a RCU-style manner—seems like a good idea. To support that, the following methods to fetch and replace configurations are introduced.

Server.fetch_config()
Returns a Config instance of the configuration for the server.
Server.replace_config(config)
Replace the configuration of the server with the modified configuration instance config.

This will allow an implementation to keep version numbers around to avoid conflicts, but is not required by the interface.

Each Config instance can then be manipulated by using the following methods:

Config.get(option)
Get the value of option as a string.
Config.set(option[, value])
Set the value of option to value. If no value is supplied, None is used, which denotes that the option is set but not given a specific string value.
Config.remove(option)
Remove the option from the configuration instance entirely.
So, for example, the log-bin option can be set in the following manner:
config = server.fetch_config()
config.set('log-bin', 'master-bin')
server.replace_config(config)

Machines

A MySQL server can run on many different machines and in many setups. A server can run on Linux, Solaris, or Windows, and even in those cases, there can be multiple servers on a single machine.

For a Linux machine with a single server, one usually uses the script /etc/init.d/mysql to start and stop the server—at least on my Ubuntu—but if multiple servers are used on a single machine, then mysqld_multi should be used instead.

For Windows and Solaris, the procedure for starting and stopping servers are entirely different. Windows starts and stops the servers using net start MySQL and net stop MySQL, while Solaris uses the svcadm(1M)

To parameterize the system over the various ways it can be installed, the concept of a Machine is introduced (I actually had problems figuring out a name for this, but this was suggested to me and seems to be good enough).

The responsibility of the Machine class is to provide an interface to access the installed server together with installation information such as the location of configuration files.

BackupMethod

One of the more important techniques when managing a set of server is the ability to clone a slave or a master to create new slaves. Cloning involves taking a backup of a server and then restoring the backup image on a the new slave. Since the techniques for taking backups vary a lot and different techniques will be used in different situations, parameterizing over the various backup methods is sensible.

BackupMethod.backup_to(server, url)
This method will take a backup of server and store it at the location indicated by url.
BackupMethod.restore_from(server, url)
This method will restore the backup image indicated by url into server.

Role

In a deployment, each server is configured to play a specific role. It can either be acting as a master, a slave, or even a relay. To represent a role, a separate Role class is introduced. Once a role is created, a server can be imbued with it.

  • Not every server have an assigned role.
  • Each server can just have a single role.
  • Each roles can be assigned to multiple servers.

Since a role may encompass much more than just setting some configuration parameters, this more flexible approach was chosen. When imbuing a server with a role, a piece of Python code is executed to configure the server correctly.

The use of roles in this case is actually just one of many choices, and when using this approach, there is actually two different ways that roles can be used. I am slightly undecided on the two and would like to hear comments on which one to use.

  1. Roles are just applied to the initial deployment and does not play any role after the system have been deployed. Roles are imbued into a server initially, and then the configuration of the server can be changed by procedures to manipulate the deployment.
  2. Roles exists in the entire deployment and when a server changes roles in the deployment, the Role instance will also change. Every server is assigned a role in the system, which is represented using a subclass of the Role class.

The first is by far the easiest to implement, which is why I chose this at this time. Since the roles are just containers for configuration options and other items that needs to be added, they are easy to write. Since this is what is used in the library currently, it is also what you see in the class design above.

The second approach seems better, but it has a number of consequences:

  • Every server has to have a role class associated with it, even the "initial" role is required.
  • If the role changes, another role class will be associated with it. This forces the role class to not only be able to imbue a server in a role, but to also unimbue the server from that role.
  • It cannot be possible to change the configuration of a server directly, it has to be in the form of defining a role and then changing the server to that role. Unimbuing the server from a role becomes very hard if the configuration of the server is changed outside the control of the role.