Tuesday, August 14, 2007

Stopping the slave exactly at a specified binlog position

Catching up on some articles on the Planet MySQL feed, I just read the post by Dathan on how to promote a slave to be master by using MASTER_POS_WAIT(). The MASTER_POS_WAIT() is an excellent function that allows you to wait until the slave reaches a point at or after the given binlog position. Observe that after the statement issuing a MASTER_POS_WAIT() returns, the slave threads are still running, so this means that even if a STOP SLAVE is issued immediately after the statement with MASTER_POS_WAIT(), it is bound to move a little more before actually stopping. For Dathan's situation, this is not necessary, but wouldn't it be great if you could stop a slave at exactly the position that you want? Well, that is possible.

There is another command that does exactly what you want, and that is START SLAVE UNTIL. The problem with this command is that it can only be executed when the slave threads have stopped, but that is trivial to do by just issuing a STOP SLAVE. In other words, instead of doing:

SELECT MASTER_POS_WAIT('dbmaster1-bin.000002', 4);
SLAVE STOP;
like Dathan suggests, do:
STOP SLAVE;
START SLAVE UNTIL
    MASTER_LOG_FILE='dbmaster1-bin.000002', MASTER_LOG_POS=4;
SELECT MASTER_POS_WAIT('dbmaster1-bin.000002', 4);

Sunday, July 01, 2007

Musings on MySQL Proxy

When seeing that the MySQL Proxy was released, I decided to try to experiment with it since I see strong potential with this tool, both for replication and for other uses (recall that I'm a replication guy, so this is my primary focus). I'm actually on vacation, but this will of course not stop me from tinkering with things (I know, I'm just a hopeless case in this aspect ;) ).

After reporting a minor bug, I managed to build and run it with some sample scripts. I'm using Kubuntu Feisty, and had some initial problems, but it was actually pretty straightforward. I'll repeat the steps anyway, in case anybody else have problems.

  1. Get the source from the repository
    svn co http://svn.mysql.com/svnpublic/mysql-proxy/ mysql-proxy
  2. Make sure you have all packages necessary. Several of the packages below were not installed for me.
    apt-get install pkg-config liblua5.1-0 liblua5.1-dev libevent-dev libevent1 libglib2.0 libglib2.0-dev libmysqlclient-dev
  3. Switch to the directory where the source is.
    cd mysql-proxy/trunk
  4. Set up all the stuff necessary for the autotools (Autoconf, Automake, and Libtool)

    ./autogen.sh

  5. Run configure, but make sure to tell the configuration script that it should use Lua 5.1

    ./configure --with-lua=lua5.1

  6. Build the proxy

    make

Some applications

After having experimented a little, I see some of the potential applications of the MySQL Proxy, but there are things missing to make these scenarios possible. Just to give some ideas, I will just present the ideas and last present what I see as missing pieces to make these possible.

Vertical and horizontal partitioning

If it was possible to parse the query and decompose it into it's fragments, it could be possible to separate the query into two queries and send them off to different servers. The result sets could then be composed to form a new result set that is then delivered to the final client. This can, of course, be accomplished thorough other means, but if you take a look at the next item, you have a variation that is not that simple to handle.

It could be interesting to handle horizontal partitioning in the case that data for, e.g., a user is stored in different machines depending on geographic location. This is something that is interesting for companies like Flickr, Google, and YouTube since contacting a server near yourself geographically significantly improves response times.

BLOB Streaming

In order to handle BLOB streaming, as I outlined in a previous post, it could be left to the proxy to build a final result set where the BLOB Locators (URIs in my example) are replaced with the actual BLOBs by contacting a dedicated BLOB server that holds the BLOB and building the final result set inside the proxy.

Pre-heating slave threads

By placing the proxy in between a master and a slave, it could be possible to pre-heat the slave by issuing a SELECT query through a client connection to the slave. This is the oracle algorithm presented by Paul Tuckfield from YouTube, but the solution in this case is simpler since it is not necessary to read the events from the relay log, but they can instead be caught "in the air" and acted upon.

This actually requires some parsing of the replication stream, so it might be better to handle this by embedding Lua or Perl into the slave threads. (Personally, I would prefer Perl. Not because it is a better language, or easier to embed, but because I've been using Perl on an almost daily basis since 1988. OTOH, Lua is designed to be efficient and map internal structures to the language and seems very easy to work with, so we'll see what happens.)

The missing pieces

Others have focused on what can be done with the MySQL Proxy, but I see some omissions that, if implemented, would turn the MySQL Proxy into an incredibly flexible tool.
  • It should be possible to parse and rewrite a query before sending it to the server. This is already possible, but a library to parse and build SQL queries would help a lot here.
  • It should be possible to rewrite the result set in a convenient manner. Jan just added the ability to rewrite the packet that is sent back, so the proxy is heading in that direction, but a more convenient interface would be an advantage here as well.
  • It should be possible to keep connections to several servers active, and decide what server to send the query to on a per-query basis (even sending queries to all servers, or different queries to different servers). AIUI, there is some rudimentary support for it right now, but not to the extent that I describe here.
  • It should be possible to send several query-result sequences to servers for each query-result sequence sent to the proxy. This will make it possible to act on the result of a response to one server, and dynamically decide, e.g., what server to contact next in order to get the data that forms the final result set.
All-in-all, the MySQL Proxy is showing incredible potential and I, for one, will see what I can contribute with in order to make it even better.

Monday, June 18, 2007

BLOB locators + BLOB streaming + Replication = Yeah!

On the MySQL Conference & Expo 2007, I had the chance of meeting up with Paul (the author of PBXT) and Mikael. We briefly touched the topic of the BLOB Streaming Protocol that Paul is working on, which I find really neat. On the way back home, I traveled with Anders Karlsson (one of MySQL:s Sales Engineers), who is responsible for the BLOB Locator worklog and he described the concepts from his viewpoint.

Since I work with replication, these things got me thinking on what the impact is for replication and how it affects usability, efficiency, and scale-out. Being a RESTful guy, I started thinking about URIs both when Paul described the BLOB Streaming Protocol and when Anders starting describing the BLOB Locators. Apparently, I wasn't the only one.

Combining BLOB Locators with the BLOB Streaming Protocol has a significant impact on the scalability and performance of replication, and I'm going to show how by giving a typical use of replication: scaling out reads from a installation by replicating from a single master to several slaves.

Now, when a client connects to get a blob from the database, the server delivers a result set containing one or more blob locators. Since we are using URI:s and the HTTP protocol, the blobs can be served by a normal web server, and the client can fetch the data in the Blobs using HTTP and build the real result set. The existence of blob locators is completely transparent to the client, who sees no difference from the previous implementation.

Now, what does this give us that make this setup so scalable?

  • Instead of storing the actual blob data, we store a reference to the data (in the form of an URI). When working with the blob and copying it to another table, we will actually just copy the reference, which is a very quick operation compared to the size of most blobs. The use of the BLOB locator is entirely transparent to any operations on the blob: reading is not affected, and changing the blob can be accomplished using a copy-on-write semantics (which of course makes the operation slower).

  • Since we have a unique reference to a blob it is possible to implement caching mechanisms to cache results of, e.g., fulltext searches in the blobs.

  • The use of an URI makes the blob locator server-agnostic, which means that we can reliably replicate the URI instead of the blob and still expect any client that connects to the slave server to be able to fetch the blob using HTTP. There is no translation necessary when doing the replication, and the URI can be treated as just a string. This means that a scale-out strategy is trivial to implement. This is just a generalization of the recommended practice to store the blobs as files on a server, and save the file name in the tables instead: we just make it transparent to the user and simplify the deployment.

  • By using an URI as reference, we can put the blob data on a separate server, which can be dedicated to delivering blob data to requesters. Since everything is going via this server, it is very likely that "hot" data is available immediately, and since we are using an URI, delivery over the Internet can rely on Web Caches to avoid re-sending data that is already cached somewhere.

    We do not lose the ability to count the number of deliveries of the data, since we can always count the number of blob locators that we have been delivered instead of the number of BLOBs that have been delivered.

  • The HTTP protocol has support to both PUT and GET to read and write data to the server.

  • We unload a significant amount of "dumb" job from the server, that of assembling result sets consisting of blob data and other data, and therefore allow the server to perform more of the "intelligent" job of doing database searches.

  • The design is incredibly flexible since it is possible to, for example, allowing the blob server(s) to be placed anywhere, even in different towns, and can still keep the main operating site in one location.
scale

Post on replication poll was lost

My last post on the replication poll was apparently lost from Planet MySQL. If you're interested, I commented on the replication poll, our future plans, and how they were affected by the poll.

Sunday, June 03, 2007

The replication poll and our plans for the future

We've been running replication poll and we've got some answers, so I thought I would comment a little on the results of the poll and what our future plans with respect to replication is as a result of the feedback. As I commented in the previous post, there are some items that require a significant development effort, but the feedback we got helps us to prioritize.

The top five items from the poll above stands out, so I thought that I would comment on each of them in turn. The results of the poll were (when this post were written):

Online check that Master and Slave tables are consistent 45.4%
Multi-source replication: replicating from several masters to one slave 36.3%
Multi-threaded application of data on slave to improve performance 29.2%
Conflict resolution: earlier replicated rows are not applied 21.0%
Semi-synchronous replication: transaction copied to slave before commit 20.3%

Online check that Master and Slave tables are consistent

The most natural way to check that tables are consistent is to compute a hash of the contents of the table and then compare that with a hash of the same table on the slave. There are storage engines that have support for incrementally computing a hash, and for the other cases, the Table checksum that was released by Baron "Xaprb" can be used. The problem is to do the comparison while the replication is running, since any change to the table between computing the hash on the master and the slave will indicate that the tables are different when they in reality are not. To solve this, we are planning to introduce support to transfer the table hash and perform the check while replication is running. By adding the hash to the binary log, we have a computed hash for a table at a certain point in time, and the slave can then compare the contents of the tables as it see the event, being sure that this is the same (relative) point in time as it were on the master. We will probably add a default hash function for those engines that do not have something, and allow storage engines to return a hash of the data in the table (probably computed incrementally for efficiency).

Multi-source replication: replicating from several masters to one slave

This is something that actually was started a while ago, but for several reasons is not finished yet. A large portion of the work is actually done, but since the code is a tad old (enough to be non-trivial to incorporate into the current clone), there is some work remaining to actually close this one. Since there seems to be a considerable interest in this, both at the poll and at the MySQL Conference, we are considering finishing off this feature sometime in the aftermath of 5.1 GA. No promises here, though. There's a lot of things that we need to consider to build a high-quality replication solution, and we're a tad strained when it comes to manpower in the team.

Multi-threaded application of data on slave to improve performance

This is something that we really want to do, but which we really do not have the manpower for currently. It is a significant amount of work, it would be a huge improvement of the replication, but it would utterly make us unable to do anything else for a significant period. Sorry folks, but however much I would like to see this happen, it would be irresponsible to promise that we will implement this in the near future. There are also some changes going on internally with the threading model, so it might be easier to implement in the near future.

Conflict resolution: earlier replicated rows are not applied

When multi-source comes into the picture, it is inevitable that some form of conflict resolution will be needed. We are currently working on providing a simple version of timestamp-based conflict resolution in the form of "latest change wins". This is ongoing work, so you will see it in a post-5.1 release in the near future.

Semi-synchronous replication: transaction copied to slave before commit

There is already a MySQL 4 patch for this written by folks at Google Code under the Mysql4Patches work. The idea is to not commit the ongoing transaction until the entire transaction has been successfully transferred to at least one slave. The reason for this is that it should be possible to switch to a slave in the event of a failure of the master, so it has to be certain that the transaction exists somewhere else (at least in disk). We consider this as very important for our ongoing work of being the best on-line database server for modern applications, so you will probably see it pretty soon. Compared to the patch above, we would like to generalize it slightly to allow it to be configurable how many slave should have received it before the transaction is committed. This will of course reduce performance of the master, but it will provide better redundancy in the case of a serious failure and it is a minor addition to the work anyway.

Binary log event checksum

In addition to the things we mentioned above, this is very important to both find and repair problems with replication. The relay log is a potential source of problem, as is the code that writes the events to the relay log, so it is prudent to add a simple CRC checksum to each event to check the integrity of the event. Sadly enough, this does not exist currently, so we're trying to make this get into the code base as soon as possible, maybe even for 5.1 (keep your fingers crossed). This is not a promise: we're doing what we can, but there are no guarantees.

Friday, May 11, 2007

The coolest future replication features...

...is something that you influence what it will be.

The problem with replication is that we have so many things that we want to do, but we are not that many people. What we do is what everybody does when the to-do list is to long: prioritize. Since the replication features are developed for you (yes, you), we have added a quickpoll on the http://dev.mysql.com/ where you can pick the three most important replication features that you would like to see us focus on next (after the 5.1 GA).

Do you think that on-line checks for table consistency is for weenies that cannot write a simple little script to do that? Please tell us that.

Do you prefer to live on the edge and think that semi-synchronous replication is for safety junkies? Well, we'll be glad to hear your opinion.

Do you think that the YouTube oracle algorithm hack is the coolest thing on earth and that we should make sure to have it in a release soon? In this case you should especially tell me, because I think it is a pretty cool idea as well.

We cannot promise that they will be done, and there are some features on the list that requires a substantial amount of work, so it might be that we decide to deliver many small features rather than one big feature... but we need your input, so please go and take the quickpoll on http://dev.mysql.com/, because you can be part of making MySQL the best on-line database for modern applications in the world.

Wednesday, April 18, 2007

Heading off to the MySQL Conference

MySQL Conference & Expo
After some long months of intensive bug fixing, it's time to pack up the stuff and head off to the MySQL Conference & Expo. Since this is actually my first MySQL (User's) Conference, it's bound to be interesting.

We've got a full schedule here, so it will definitely not be boring.

Monday, April 23, 8:30am - 5:00pm, Ballroom D (with lunch 12:00am to 1:30pm). Lars, JDD and I am going to hold a replication tutorial for a full day where we'll go over some basics on how to set up replication to work properly, but also delve into some advanced stuff like row-based replication and cluster replication to see what it can do for you. This is a must for anybody that is seriously going to work with replication.

Tuesday, April 24, 10:45am - 11:45am, Ballroom F. Lars and I will keep a "Tips and Tricks" session on replication and bring a bunch of useful tricks for getting replication to work the way you want.

Tuesday, April 24, 7:30pm - 9:00pm, Bayshore. Lars, Jeremy Cole and I will have a BoF session where we'll have an open discussion about the future of replication. We will also bring some food for though and a digression into upcoming features like conflict handling.

Tuesday, April 24, 5:30pm - 6:15pm, Ballroom E. There will be a session on the replication roadmap where you will find Lars and me presenting some of the ideas that MySQL envision for the future. If you want to take part in making MySQL the best on-line database in the world, make sure to be there!

You will also find Chuck and me in the Guru bar from 2pm to 4:30pm the same day, so bring your problems and we'll give them a good beating!

On Friday, you will also find me at the Storage Engine Summit hosted by Brian.