Galera clusters help with performance and scalability. When considering secondary databases, it is important to understand the way the data replication happens is different than on the primary. The primary database can perform bulk operations. When the replication happens for all the secondary databases, they do actions one at a time. For example, if you have 67,000,000 items in a delete, on the secondary databases each one happens one at a time. When reviewing the Mysql slow query logs, you find this action can take a long time. Because the secondary databases are performing things one at a time, is a reason for things to not be in sync and performance impacts can be detected.
As a solution, if possible, batch your large operations to help the secondary databases keep in sync with the primary. By doing things in batch, it allows the actions to be executed in a timely manner and performance impacts are kept down to a minimum.
we are talking about the Galera Replication Basically, the flow control in Galera Replication First, let’s define Galera Replication Galera Replication is replication scheme used to make the data flow from writer to reader, so primary to secondary node. In this diagram, we have a simplistic data flow or representation of the application in Galera. Let’s say you have your app, your web server sending requests to the primary. This is data is getting replicated. The data is going to be encapsulated into some write set, and the write set will be broadcast into the secondaries. That’s not the topic in Galera Replication But just to keep in mind that the time it takes on the primary is much shorter for any given transactions. It’s much shorter than the time it basically takes to execute the same transactions on the secondary or secondaries. Because some internal design, but we are in the future presentation, we might dig into it, delve into it to explain why. This is the pinpoint, is the flow control. Flow control, the definition is flow control means pause. The application gets paused or stopped. It’s as simple as that. Here’s the use case. You see a show process list, you see a slow queries in commit state.
If you go on top, you’re going to see this one. It’s on commit state, it’s deleting data. Here’s the thread number. It’s true that there are other threads on the bottom, but they are in sleep mode, and the other ones, they are system. We are not interested in those. Those are internal threads. We are talking about custom threads, threads created to carry out queries, execute queries for Magento application. Here’s the thread number 0084. This query has been running for 740 seconds. It’s true that subsequent query is taking longer, but there’s a reason for that, and I’m going to explain you why. Here’s the thread. Let’s copy it and go to this slow query log. Let’s go to slow query. Let’s first look for it. You see the first query, and it’s basically at the same time. This time coincides with the time of the capture of the show process list. You see the delete was running. It was deleting more than 67 million records at once. The shortcoming of Galera is that this, you see on the primary, it’s going to be executed at once, like in bulk. The server is going to execute one commit, is going to basically apply the data to the primary. Whereas once the transaction, the query is broadcasted or the output is broadcasted to the secondaries, the secondaries are not going to execute those 67 million records or data changes at once. It’s going to execute them one by one. That’s one of the biggest shortcomings when it comes to Galera. That’s basically what’s happening. That’s why it’s much faster on the primary, and that’s why the secondary nodes cannot keep up with the primary. If you keep looking, you see there’s another delete. This one is nearly 20,000 records. It’s not as many as 67, but then if you move to the third one, you see it was deleting 82 million records at once. That means 82 million individual queries on the secondary nodes. That’s by design. This is again, this is shortcoming as far as Galera is concerned. That’s again, you remember we talked about why subsequent queries were taking longer. This is the explanation that was being executed in a loop. There are many multiple queries, but it’s the same thread. The remedy is to have some good practices in place. One of them is to, that’s the only one as far as I know, because this is shortcoming by design. There’s no other way to overcome it, to solve it. The only way you can at least alleviate this issue is through good practices. The only one I know by experience is to basically, instead of trying to delete 82 million records at once on the primary, you’re doing in small batches like 2,000 between 2,000 and 5,000 records at once. Then you can do 2,000, you delete 2,000, you take a break. In the code, you can do sleep or whatever command you have at your disposal. You can wait for five seconds so that the secondary can execute those 2,000 records and take a break until you send the second batch, subsequent batches. In a nutshell, that’s how we should basically approach this issue and how we can at least alleviate this issue. That’s the recommendation. Thank you for listening.