This page provides information on how to troubleshoot replication issues.
Replication (non-reverse replication) is failing for some reason.
There are various reasons for replication to fail. This article explains the approach one might take when analyzing these issues.
Are replications getting triggered at all when clicking the Activate button? If NOT then do the following:
Are the replications getting queued up in the replication agent queues?
Check this by going to /etc/replication/agents.author.html then click on the replication agents to check.
If one agent queue or a few agent queues are stuck:
Does the queue show blocked status? If so then is the publish instance not running or totally unresponsive? Check the publish instance to see what is wrong with it (i.e. check the logs, and see if there is an OutOfMemory error or some other issue. Then if it is just generally slow then take thread dumps and analyze them.
Does the queue status show Queue is active - # pending? Basically the replication job could be stuck in a socket read waiting for the pubilsh instance or dispatcher to respond. This could mean that the publish instance or dispatcher is under high load or stuck in a lock. Take thread dumps from author and publish in this case.
If all agent queues are stuck
It is possible that a certain piece of content cannot be serialized under /var/replication/data due to repository corruption or some other issue. Check the logs/error.log for a related error. To clear out the bad replication item, do the following:
There might be something wrong with sling eventing framework job queues. Try restarting the org.apache.sling.event bundle in the/system/console.
It might be that job processing is completely turned off. You can check that under Felix Console in the Sling Eventing Tab. Check if it displays - Apache Sling Eventing (JOB PROCESSING IS DISABLED!)
It might also be the case that DefaultJobManager configuration gets into an inconsistent state. This can happen when someone manually modifies the ‘Apache Sling Job Event Handler’ configuration via the OSGiconsole (For example disable and re-enable the ‘Job Processing Enabled’ property and Save the configuration).
Create a replication.log
Sometimes it can be very helpful to set all replication logging to be added in a separate log file at DEBUG level. To do this:
Go to https://host:port/system/console/configMgr and login as admin.
Find the Apache Sling Logging Logger factory and create an instance by clicking the + button on the right of the factory configuration. This will create a new logging logger.
Set the configuration like this:
If you suspect the problem to be related to sling eventing/jobs in any way then you can also add this java package under categories:org.apache.sling.event
Sometime it might be suitable to pause the replication queue to reduce load on the author system, without disabling it. Currently this is only possible by a hack of temporarily configuring an invalid port. From 5.4 onwards you could see pause button in replication agent queue it has some limitation
Page permissions are not replicated because they are stored under the nodes to which access is granted, not with the user.
In general page permissions should not be replicated from the author to publish and are not by default. This is because access rights should be different in those two environments. Therefore it is recommended to configure ACLs on publish separately from author.
In some cases the replication queue is blocked when trying to replicate namespace information from the author instance to the publish instance. This happens because the replication user does not have
jcr:namespaceManagement privilege. To avoid this issue, make sure that:
jcr:namespaceManagementprivilege at the repository level. You can grant the privilege as follows:
https://localhost:4502/crx/de/index.jsp) as administrator.
jcr:namespaceManagementfrom the privileges list.