Wednesday, April 26, 2006

Row-based replication for the future

I just read Eric Bergen's blog on row-based replication and application development from the presentation of Row-based replication at the User's Conference. Eric is giving all kinds of new ways to use the replication, for example that for a statement you can now configure the replication to only replicate changes to one of the tables in a multi-table statement. He is, however, missing the most important aspect: everything that we can do now and will be able to do in the future that we couldn't do with just statement-based replication. Here are some things that you can do with row-based replication that was not possible with statement-based replication.

Cluster Replication Cluster replication is already in 5.1, but it's worth to mention since it could not be handled with statement-based replication. Inside the cluster, everything is rows: rows are passed back and forth between the nodes and rows are collected to form result of queries. If you are inserting data into the tables using the MySQL NDB Cluster handler ha_ndbcluster, it will be replicated as usual and both statement-based and row-based replication will work. However, if you are using the NDB API to insert rows into the cluster, the rows are "lost" since the server never see those rows. To solve this problem, we invented an "injector" whose sole purpose is to inject rows into the binary log. The injector is created inside ha_ndbcluster and the rows that are inserted into the cluster are injected into the binary log.

Replication of individual partitions to different servers This is not in any version of the server and there are (not yet) any plans to add it, but it's something that is feasible when having row-based replication. Eric is mentioning ways to separate the rows going to different tables, and only replicate one of the tables. This is not limited to replicating different tables to different servers, you could even replicate different parts of the same table to different servers. For example, assume that you want to partition your data depending on how frequently it is accessed. Now, imagine that you could set up partitions for your table of blogs like this:

CREATE TABLE blogs (id INT ..., freq INT, ...)
  PARTITION BY RANGE (freq) (
   PARTITION blog0 VALUES LESS THAN (10),
   PARTITION blog1 VALUES LESS THAN (100),
      ...
      PARTITION blog5 VALUES LESS THAN (1000000)
 );
Further imagine that you could set up replication to replicate the different partitions to different servers. We could, for example assume that we wanted to place the blog5 partition on high-end servers dedicated for handling high loads, while the less frequently accessed blogs would be placed on low-end servers. Each access would then also update the statistics, causing the row to move to other servers as it becomes less frequently accessed. To prevent the row from moving back and forth between partitions when it is bordering one of the ranges, we could do the partitioning on the result of calling a UDF, which implements hysteresis by taking trends into account.

No comments: