MySQL Musings: 2013

Monday, October 21, 2013

MySQL Fabric: High Availability Groups

As you might have noticed, we have released a framework for managing farms (or grids, as Justin suggested) of MySQL servers called MySQL Fabric. MySQL Fabric is focused on being easy to use and extensible, and two extensions are currently part of the framework: one to manage high-availability and one to implement sharding.

High-Availability Group

High-Availability Groups

One of the central concepts used to construct a farm is the high-availability group (or just group when there is no risk of confusion) and is introduced by the high-availability extension. As mentioned in the previous post, the group concept does not really represent anything new but is rather a formalization of how we think and work with the structure of the farm. The key to supporting high-availability is to have redundancy in the system: if one component fail, another one should be ready to pick up the job of the failing component. Hardening the systems (by using hardware less prone to fail or hardware with built-in redundancy) can help reduce the chance of a component failing, but not completely eliminate it. Even a hardened system is susceptible to failure in a power outage or an earthquake. With this in mind, we introduced the group concept for managing pieces of data in our farm:

each group consists of several machines that are responsible for managing the same piece of data.The concept of a group is an abstraction to model the basic concept that we're after, but does not really say anything about how it is implemented. This is intentional: it should be concrete enough to support all the operations we need, but abstract enough to not restrict how it is implemented. This is important because connectors (or any other "outside" observer) that work with groups should not have to be updated whenever new implementations are added. For example, it should not make a difference to a connector if the group is implemented using a traditional Master-Slave setup, a MySQL Cluster, or using replicated storage such as DRBD.

Server properties in groups

There are a few key properties that we assume for groups:

A server belong to (at most) one group.
At any time, each server in the group have a designated a status.
At any time, each server has a mode indicating if it accepts reads, writes, both, or neither.
Each server also has a weight, which is the relative power of the server and is used to balance the load.

Note that these properties might change over time, depending on events that happen.For handling load-balancing and high-availability the properties Status, Mode, and Weight where introduced. The mode and weight properties are used by a connector when it comes to deciding where to send a transaction, while the status property is used by the Fabric to keep track of the status of the server. Let's take a closer look at the properties.

Figure 1. Server Status

Server Status (or Role). The status of the server provide information about what the server is currently doing in the group. The status of a server is Fabric's view of the status of the server and changes as time passes and the Fabric notice changes. A primary server accept both a write and a read load and sending high-priority read transactions here mean that they get current data. A secondary server can handle reads but, in the case of a master-slave configuration, it should not accept writes since that would lead to a split-brain situation. Secondary servers are servers waiting to pick up the job of the primary if it fails. Spare servers do not accept reads nor writes, but are ready and running and can therefore change status in the group to replace other servers in the event of failures. In addition, spare servers can be used to handle reads.

In Figure 1 you can see an example of how servers could change status, but note that at this time, we do not track all states. For example, we are considering how to handle the provisioning of new servers in flexible and extensible way, but more about that in a separate post.

Server Mode. The mode of the server gives information on whether it can be read or written and provide information for the connector on how it should send queries. For now, we only have three modes: Offline, Read-only, and Read-Write. Offline servers cannot be read from or written to, and usually does not accept connections. Read-only servers can only be read from and write transactions should not be sent to these. Read-Write servers are usually primaries of the group. They can accept writes and will propagate them correctly to other servers in the group.

Server Weight. The weight of a server is used to balance the load between servers. The weight represent the relative power of the server. When balancing the load between servers, the connector will figure out what servers are eligible for accepting a transaction and then pick one of the servers in such a way that the distribution over time will be proportional to the weight of the server.

What's up with transactions?

So what is it with transactions that makes it necessary to declare the properties of the transaction before it actually? The reason is quite simple: the transaction is fed to the server one line at a time. This means that it is not always possible to know if it is a read-only or read-write transaction until the entire transaction has been seen.

An example is this transaction (yeah, it's a little constructed, but it's just an example):

START TRANSACTION;
SELECT salary INTO @salary FROM salaries
 WHERE emp_no = 20101 AND to_date = DATE('9999-01-01');
UPDATE titles SET to_date = CURRENT_DATE()
 WHERE emp_no = 20101 and to_date = DATE('9999-01-01');
UPDATE salaries SET to_date = CURRENT_DATE()
 WHERE emp_no = %s and to_date = DATE('9999-01-01');
INSERT INTO titles VALUES
  (20101, 'Master of the Universe', CURRENT_DATE(), DATE('9999-01-01'));
INSERT INTO salaries VALUES
  (20101, 10 * @salary, CURRENT_DATE(), DATE('9999-01-01'));
COMMIT;

On line 1 or 2, it is not possible to know if this is a read-only or a read-write transaction, it cannot be known until line 3, where an UPDATE is executed.

Transaction properties

As mentioned before, one of the goals is to support sharding in the presence of transactions and to make that work correctly, it is necessary to declare up-front what the transaction will contain. Not everything, but the key elements of the transaction: what tables it will access, what sharding key is used, and if it is a read-only or read-write transaction. The first two properties are only necessary if you are working with a sharded system, so we skip those for now; the last one, however, is important for handling load-balancing in the connector.When executing transactions using a Fabric-aware connector, you provide the information about the transaction using transaction properties. There are several properties available, but we will focus on the ones related to group handling: group and type. The group property is used to provide the name of the group you want to connect to (you can have several), and the type property is used to tell if this is a read-only or read-write transaction. In the future, we might add more properties such as priority to indicate that this is an urgent transaction and a prompt reply is needed. For example, the following code is using a Fabric-aware connector to promote an employee.

from mysql.connector.fabric import (
   TYPE_READWRITE,
)

def promote_employee(conn, emp_no):
    stmts = [
        ("SELECT salary INTO @salary FROM salaries"
         " WHERE emp_no = %s AND to_date = DATE('9999-01-01')"),
        ("UPDATE titles SET to_date = CURRENT_DATE()"
         " WHERE emp_no = %s and to_date = DATE('9999-01-01')"),
        ("UPDATE salaries SET to_date = CURRENT_DATE()"
         " WHERE emp_no = %s and to_date = DATE('9999-01-01')"),
        ("INSERT INTO titles VALUES"
         " (%s, 'Master of the Universe', CURRENT_DATE(), DATE('9999-01-01'))"),
        ("INSERT INTO salaries VALUES"
         " (%s, 10 * @salary, CURRENT_DATE(), DATE('9999-01-01'))"),
        ]

    # Use the group for the ACME company
    conn.set_property('group', 'ACME')
    conn.set_property('type', TYPE_READWRITE)
    conn.start_transaction()
    cur = conn.cursor()
    for stmt in stmts:
        print "Executing:", stmt % (emp_no,)
        cur.execute(stmt, (emp_no,))
    conn.commit()

On line 20 and 21 you see how the properties of the transaction is set. In this case, we declare the group that we will access (for example, a fictional company "ACME") and also the type of the transaction. After that, a transaction is started as normal and executed. The Fabric-aware connector will pick the right server to send the transaction to and you will get the result back in the normal fashion.

Note that the property type is not yet implemented in Connector/Python, some work remains to make it support load-balancing fully.

Picking a server

But are these server and transaction properties sufficient for a connector to make a decision on what to do with a transaction? Let's take a look and see how the server can be selected.A server can be chosen by first selecting a set of candidates and then picking one of the candidates based on the weight of the server. Picking the candidates are done by matching the transaction properties and the server properties to find all server that are eligible for accepting the transaction. When a list of candidates are available you can, for example, pick one at random based on the weight of the servers. You can see an example Python code below that illustrates how this could be done. The first function find_candidates computes the set of candidates from the set of all servers SERVERS, while the second function pick_server pick one of the servers at random based on the weight of the server.

def find_candidates(props):
   candidates = []
   for srv in SERVERS:
      if props.group == srv.group and (props.mode & srv.mode):
         candidates.append(srv)
   return candidates

def pick_server(servers):
   random_weight = random() * sum(srv.weight for srv in servers)
   sum_weight = 0.0
   for idx, srv in enumerate(servers):
      sum_weight += srv.weight
      if sum_weight > random_weight:
         return servers[idx]
   return servers[-1]    # Last server in list

# Example code for picking a server based on transaction properties
pick_server(find_candidates(trans.props))

Implementation of groups

The reason to why we introduced the group concept in this manner is to be able to vary the implementation of a group, so the question is then, does it work? To see if it works, it is good to consider some sample implementations of high-availability groups and see if they can be described in this manner, so let's do that. Note that the only version that is currently implemented is the primary-secondary approach: the other ones are just food for thought (at this point).

The primary-secondary approach (also known as primary-backup or master-slave) is the traditional way to set up MySQL servers for high-availability. The idea is that there is a single primary managing the data and one or more secondaries that replicate the data from the primary and are ready to become primary in the event that the primary dies. In addition, there is a number of pure read slaves that are used to scale-out reads.

In this approach, the primary would be in read-write mode, and the secondaries could either be offline or in read mode. Secondaries cannot accept writes since that might cause a split-brain situation, but they can either be in read-only mode or offline. Not loading the servers with read-only transactions can make it easier for the secondaries to be up to date with the primary, but this depends on the general load on the system. Scale-out slaves added would then, of course, be pure read-only servers, and they cannot be promoted to be masters because they do not have the binary log enabled. However, if the primary master fails, they still need to fail-over to the new primary.

If (when?) the primary master fails, MySQL Fabric will detect the failure and start executing a procedure to promote one of the secondary master to be primary instead of the one that failed. MySQL Fabric have to do this because the servers do not know how to handle the fail-over themselves, and in addition it is necessary to inform the connectors about the new topology. In this procedure, the scale-out servers have to be moved over to the new primary as well.

Another popular solution for high-availability shared storage (for example, using shared network disks) or replicated storage (for example, using DRBD to replicate the block device). In this case, one of the server will be on-line, but the other will be on standby. For both DRBD and shared storage, it is necessary that the standby is completely offline and in the case of DRBD the server should not even be running on the standby machine. In addition to the primary and the secondary, you could have read slaves attached to the primary.

In this setup, the primary would then be a read-write server, while the standby server would be in offline mode. Scale-out servers would be in read-only mode, in this case attached to the primary.

Another approach is to use MySQL Cluster as a group. The cluster consists of several data nodes and employ a shared-nothing architecture to ensure high availability. In this case, all the servers will be both write and read servers, and all might be primaries. In the event that an NDB data node fails, the other nodes are always ready to pick up the job, so a MySQL Cluster group is self-managing. (There is an excellent overview of MySQL Cluster at http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-overview.html.)

The two solutions above employ different fail-over procedures that are executed by the Fabric node when it notices the failure. In contrast with the solutions above, MySQL Cluster is self-governing and does not require any fail-over handling implemented in the Fabric node.

Summary and considerations for the future

For the examples above, the properties we have outlined is definitely sufficient, but there might be other cases where more information is needed.One property that is missing in the current implementation is a way to select a server based on the proximity to the connector. For example, it could be possible to put the primaries and secondaries in a group in different data centers to ensure that it can handle a catastrophic failure. This, however, opens two issues:

There will be a set of read servers in each data center that should be connected to the primary or secondary in the same data center.
When the connector picks one of the candidates, it should prefer to use those in the same data center.

Both these issues mean that we need some measure of the distance between the servers and connectors so that when adding new scale-out servers it is not added in such a way that the same data is shipped several times between data centers, nor should a connector connect to a server in a different data center. Adding a complete matrix with distances between each and every server would not really work well, so it is likely that some other way to model the proximity is needed.Another case that might require some additional information is if the Fabric node fails or is unavailable temporarily (for instance, is restarting). Such an event should not block the entire system and since the connectors have information it would be possible to "run on the cache" for a brief period. The key issues here is that nothing can be updated, so each connector need to have a fallback plan in the event that a server fails. For example, if a master fails, the fail-over will not be executed, but it would still be possible to read information from the slaves.

Tuesday, October 08, 2013

MySQL Connect presentations on MySQL Fabric available on SlideShare

Going to MySQL Connect was truly a blast. We got a lot of good questions and feedback in the sessions and there were a lot of interest in both MySQL Fabric and the MySQL Applier for Hadoop.

A big thank you to all that attended the talks, I got a lot of good questions and comments that will help us build good solutions.

The talks are available on SlideShare:

Saturday, September 21, 2013

A Brief Introduction to MySQL Fabric

As you saw on the keynote, we are introducing an integrated framework for managing farms of MySQL servers with support for both high-availability and sharding. It should be noted that this is a very early alpha and that it at this point is not ready for production use.

MySQL Fabric is an integrated system for managing a collection of MySQL servers and is the framework on which high-availability and sharding is built. MySQL Fabric is open-source and is intended to be extensible, easy to use, and support procedure execution even in the presence of failure, an execution model we call resilient execution.

To ensure high-availability, it is necessary to have redundancy in the system. For database systems, the redundancy traditionally takes the form of having a primary server acting as a master and using replication to keep secondaries available to take over in case the primary fails. This means that the "server" that the application connects to is in reality a collection of servers, not a single server. In a similar manner, if the application is using a sharded database, it is in reality working with a collection of servers, not a single server. In this case, we refer to a collection of servers as a farm.

Now, just having a collection of servers does not really help us that much: it is necessary to have some structure imposed on the farm as well as an API to work with the farm, and this is where MySQL Fabric comes in.

Before going over the concepts, have a look at the farm below. In the figure, there is an application that want to connect to the farm and there are a set of database servers at the bottom organized into groups called high-availability groups. To manage the structure of the farm, there is a MySQL Fabric Node available that keeps, among other things, track of the meta-data as well as handles procedure execution. Both the application and an operator can connect to the Fabric node to get information about the farm being managed.

The MySQL Fabric nodes are responsible for keeping track of the structure of the farm as well as all information about their status and is also where any management procedures are executed. If a server fails, a the Fabric node will handle the procedure of promoting one of the slaves to be the new master, but it also contain the logic for handling shard moving and shard splitting.

High-Availability Group

High-availability Groups

The central concept to handling high-availability in Fabric are the high-availability groups. These are collections of servers that shall work together to deliver a database service to the application that connects. The groups are introduced to give structure to the farm and allow you to describe how the servers are organized to support the redundancy necessary to ensure high-availability.

Inside the group, all the servers manage the same data and hence have the same schema. In addition, each server have distinct roles and no server can belong to more than one group. The group concept does not really represent anything new: it is just a way to structure the farm in such a manner that managing it is easy to understand and the roles of servers are clear. To manage redundancy, a traditional Master-Slave Setup is used, a topology that we all are familiar with (it is often called the Primary-Secondary Approach, hence the names that follow). Each group have a primary that is is the master for all the data. Any queries that update data is sent here and that data is propagated to the other servers in the group. Redundancy is achieved by keeping one or more secondaries in the group that receive changes from the primary and are ready to take over the role as primary should the primary dissapear. To handle scale-out, the group also contain scale-out servers which are severs that receive changes from the primary but are not eligable for being promoted to primary. In the group, there are also spares, which are servers that are available for use but which are not assigned any active role yet.

Sharding

In addition to high-availability support, MySQL Fabric also offer support for sharding, which is a technique for handling very large databases and/or very high write loads. The database is split into a large number of shards, where each shard contain a fragment of the data in the database. Each shard is stored on a separate server (or a separate set of servers if you want to ensure high-availability) and the transactions are directed to each shard based on the shard key. Splitting the database in this way allow you both to manage a larger database by separating it onto more servers, but it also scale the write traffic because you can execute writes independently.

When using sharding, MySQL Fabric separate tables into sharded tables MySQL Fabric allow you to shard just some of the tables in the database and keep the other tables available on all shards, which we call global tablesand global tables. Since databases usually have multiple tables with foreign key relationships between, it is critical to be able to shard several tables the same way (but possibly on different columns), which is something that MySQL Fabric supports. Using this support, you can give all the tables that are sharded and what column should be used as the sharding key and MySQL Fabric will shard all tables and distribute the rows on the shards. Tables that are not sharded are the global tables and they will be available on all shards.

If you want to know more about how Fabric support sharding, or about sharding in general, you should come to the session MySQL Sharding: Tools and Best Practices for Horizontal Scaling, September 21, 4:00pm-5:00pm in Imperial Ballroom B.

Connecting to a MySQL Farm

To provide better control when working with a MySQL farm we have extended the connector API in such a manner that it hides the complexities of handling fail-over in the event of a server failure as well as dispatching transactions to shards correctly. There is currently support for Fabric-aware versions of Connector/J, Connector/PHP, Connector/Python as well as some rudimentary support for Hibernate and Doctrine. If you are interested in how the extensions to the interface look and how you can use them to scale your application, you should come to Scaling PHP Applications, September 22, 10:00am-11:00am in Union Square Room 3/4.

More information about MySQL Fabric

There are several blogs being published on high-availability and sharding from the developers working on the Fabric system or the connectors.

If you are interested in discussing and asking questions about MySQL Fabric, or sharding and high-availability in general, the forum on Fabric, Sharding, HA, Utilities is an excellent place for discussions. Also, if you are at MySQL Connect, going to MySQL Sharding, Replication, and HA (September 21, 5:30-6:30pm in Imperial Ballroom B) is an excellent opportunity to learn more about the project, meet us developers and team leads, and provide feedback to us. The BOF will cover several areas, some overlapping, but the discussion is likely to cover MySQL Fabric and MySQL replication.

Thursday, August 29, 2013

Going to MySQL Connect 2013

MySQL Connect 2013 is coming up with several interesting new sessions. Some sessions that I am participating in got accepted for the conference, so if you are going there, you might find the following sessions interesting. For your convenience, the sessions have hCalendar markup, so it should be easier to add them to your calendar.

MySQL Sharding, Replication, and HA (September 21, 5:30-6:30pm in Imperial Ballroom B)

This session is an opportunity for you to meet the MySQL engineering team and discuss the latest tools and best practices for sharding MySQL across distributed server farms while maintaining high availability.

Come here and meet Lars, Luis, Johannes, and me, and bring up any questions or comments you have regarding sharding, high-availability, and replication. We might have some informal presentations available to discuss sharding and high-availability issues.

MySQL Sharding: Tools and Best Practices for Horizontal Scaling (September 21, 4:00pm-5:00pm in Imperial Ballroom B)

In this session, Alfranio and I will discuss how to create and manage a sharded MySQL database system. The session covers tools, techniques, and best practices for handing various aspects of sharding such as:

Hybrid approaches mixing sharding and global tables.
Sharding in the precense of transactions and how that affect
applications
Turning an unsharded database into a sharded database
system.
Handling schema changes to a sharded database.

MySQL and Hadoop: Big Data Integration—Unlocking New Insights (September 22,
11:30am-12:30pm in Taylor)

Hadoop enables organizations to gain deeper insight into their customers, partners, and processes. As the world's most popular open source database, MySQL is a key component of many big data platforms. This session discusses technologies that enable integration between MySQL and Hadoop, exploring the lifecycle of big data, from acquisition via SQL and NoSQL APIs through to delivering operational insight and tools such as Sqoop and the Binlog API with Hadoop Applier enabling both batch and real-time integration.

MySQL High Availability: Managing Farms of Distributed Servers (September 22, 5:30pm-6:30pm in Imperial Ballroom B)

In this session, Alfranio and I will discuss tools, best practices, and frameworks for delivering high availability using MySQL. For example, handling such issues as ensuring server redundancy, handling failure detection and executing recovery, re-directing clients in the event of failure.

Scaling PHP Applications (September 22, 10:00am-11:00am in Union Square Room 3/4)

When building a PHP application, it is necessary to consider both scaling and high-availability issues. In this session, Johannes and I will discuss how to ensure that your PHP application scales well and can work in a high-availability environment.