Wednesday, September 22, 2010

Have you seen my replication files?

I recently started looking over how to get information about relay log file and binary log file using an SQL interface. Being able to do that can be quite handy when one is going to work with replication in various ways. In my particular case, I wanted to get the path to the relay log index file and binary log index file to be able to read the binary log files as well as the relay log files directly. You are probably familiar with the --relay-log-index and --relay-log options that can be set to specify where the index file. These options can either be used to set an absolute path or a relative path to use for the files. If the option starts with a /, it is considered an absolute path (drive letters are allowed on Windows though), otherwise the path is relative to the data directory (which is specified through the --datadir option). The values supplied to these options are provided from SQL as the system variables relay-log-index and relay-log respectively. The recommendation is to always set the --relay-log and --relay-log-index since the default value for these options contain the hostname. The problem with this is that if the database files is moved to a new machine with a different hostname, the server will not be able to pick up the files correctly and will assume that they do not exist. The logic for finding the location of the relay log files can be quite daunting; to find the location of the relay log index file:
  1. If relay_log_index is set, this is the location of the relay log index file.
  2. If relay_log_index is not set, then the value supplied to the relay_log option is used to figure out the name of the relay log index file.
  3. If neither relay_log_index nor relay_log is set, then the name of the relay log index file is taken by stripping the directory and extension from the pid_file variable (set using the --pid-file option), if supplied, and adding -relay-bin.index to the end of the string.
    • The pid_file variable has a default value which consists of datadir/hostname.pid, which would give the relay log index file a name of datadir/hostname-relay-bin.index.
  4. If the path is a relative path—that is, the path does not start with a directory separator—then the value of datadir is prepended to the relay log index file name.
Keeping track of all these details is not something I want to spend my time on, so I wrote a stored function for computing the name of the relay log index file which I simply called relay_log_index_file:
CREATE FUNCTION relay_log_index_file () RETURNS VARCHAR(256)
  DETERMINISTIC
  READS SQL DATA
BEGIN
  DECLARE rli_name VARCHAR(256);
  IF @@relay_log_index IS NOT NULL THEN
    SET rli_name = @@relay_log_index;
  ELSEIF @@relay_log IS NOT NULL THEN
    SET rli_name = @@relay_log;
  ELSE
    BEGIN
      DECLARE l_pid_file VARCHAR(256);
      DECLARE l_pid_base VARCHAR(256);
      SET l_pid_file = SUBSTRING_INDEX(@@pid_file, '/', -1);
      SET l_pid_base = SUBSTRING_INDEX(l_pid_file, '.', 1);
      SET rli_name = CONCAT(l_pid_base, '-relay-bin.index');
    END;
  END IF;

  IF rli_name NOT LIKE '/%' THEN
    RETURN CONCAT(@@datadir, rli_name);
  END IF;

  RETURN rli_name;
END
This is a quite complicated way of figuring out the location of the relay log files and hardly something that I consider very useful. It would be much better if the relay_log_index variable gave the complete path to the file, regardless of what was given to the --relay-log-index option (or even if the option was given at all).

Being able to fetch the relay log index file is quite convenient, but being able to fetch the binary log index file would be even more convenient. Unfortunately, there is no such variable. The --log-bin option can be used to supply a base name to use for the binary log, but the log_bin variable can only be ON or OFF, which in my book is not very smart. To fix this, I created WL#5465, which introduces three new variables—log_bin_basename, relay_log_basename, and log_bin_index—and changes behaviour of relay_log_index.

log_bin_basename
This is a global read-only variable that contain the base file name used for the binary log files, the path to the files but omitting the extension.
  • If a full path was given to --log-bin-index, this will be stored in log_bin_index.
  • If a relative path was given to --log-bin-index, the contents of datadir will be used as directory and prepended to the value of --log-bin-index
  • Otherwise, the value of datadir will be used as the directory of the file and the base name is created by taking the basename of pid_file (name without extension) and adding '-bin'.
log_bin_index
This is a global read-only variable containing the full name to the binary log index file. If no value is given, the value of log_bin_basename is used and the extension '.index' is added.
relay_log_basename
This is a global read-only variable containing the base file name used for the relay log file, that is, the full path to the relay logs but not including the extension. The value of this variable is created in the same way as for log_bin_basename with the only difference that the '-relay-bin' suffix is used instead of '-bin'.
relay_log_index
This is a global read-only variable containing the full name of the relay log index file. If no value is given, the value of relay_log_basename is used and the extension '.index' is added.
With these new variables, fetching the full path of the binary log index file is as easy as doing a 'SELECT @@log_bin_index'.
If you're interested in if the patch for this worklog will be in any particular server version or f it is pushed at all, you have to check the status of the worklog. Even if I have described the architecture and implemented a patch, there is no way to know where it ends up or even if it is pushed at all.

An alternative: let the application do the job

Creating a stored function for computing the relay log index file name might be overkill in many situation. If the value is needed from serveral different connections it makes sense to create it as a stored function to allow it to be used by different applications. It can, however, just as well be placed in the application code which would then compute the location of the relay log index file using a single query to the server.

The information you need is the data directory from datadir, the pid file name from pid_file (in the event that the relay log or the relay log index option does not have a value), and the relay_log and relay_log_index values.

For example, the following Python code could be used to compute the data directory, the base use for creating relay log files, and the name of the index file using a single query to the database server:

import os.path

def get_relay_log_info(connection):
    cursor = connection.cursor()
    cursor.execute("SELECT @@datadir, @@pid_file, @@relay_log, @@relay_log_index")
    datadir, pid_file, relay_log, relay_log_index = cursor.fetchone()
               
    def _add_datadir(filename):
        if os.path.isabs(filename):
            return filename
        else:
            return os.path.join(datadir, filename)

    pidfile_base = os.path.basename(os.path.splitext(pid_file)[0])
    base_name = _add_datadir(relay_log or pidfile_base + '-relay-bin')
    index_file = _add_datadir(relay_log_index or base_name + '.index')

    return { 'datadir': datadir, 'base': base_name, 'index': index_file }