I was recently performing local mailbox moves to help redistribute IOPS and database sizes across a 6 node, cross-site DAG. Some of these mailboxes were rather large, and even though the 200MB pipe between the two sites wasn't saturated, I still experienced some interesting errors in the Application event log:
It's important to note, that I actually didn't notice this message until the day after I completed my move requests, and they DID complete successfully. There were no errors or issues during the move process. Regardless...I hate seeing error messages in my logs, so naturally, I dug in further. Upon further investigation, I found that this issue typically crops us during mailbox moves only. The reason for this is largely due to the "DataMoveReplicationConstraint" which is a value set on each and every mailbox database in your DAG. This property assists the Mailbox Replication Service in determining how many database copies should be evaluated as part of your move request.
As a whole, it's important to understand that the DataMoveReplicationConstraint is only one piece of an important system of checks and balances during mailbox moves. Formally, this series of checks and balances is referred to as Data Guarantee API. The Data Guarantee API is used to check the following:
Check Replication Health - Confirm that the prerequisite number of database copies is available
Check Replication Flush - Confirm that the required log files have been replayed against the prerequisite number of database copies.
In my specific case, the 10011 Event ID's I saw in my application logs were a direct result of this "Check Replication Flush" component. In the environment I was working in, the datamovereplicationconstraint was configured to SecondDatacenter. What does this mean, and what other values can be configured?
DataMoveReplicationConstraint has the following possible values:
None: This is the default value when a mailbox database is created. When set to None, the data guarantee API conditions are ignored. This setting should only be used for mailbox databases that are NOT part of a DAG.
SecondCopy: At least one passive database copy must meet the data guarantee API conditions. This is the default value when you add the second copy of a mailbox database.
SecondDatacenter: At least one passive database copy in another Active Directory site must meet the data guarantee API conditions.
AllDatacenters: At least one passive database copy in each Active Directory site must meet the data guarantee API conditions.
AllCopies: All copies of the mailbox database must meet the data guarantee API conditions.
NOTE: Want to know what constraint your databases are operating in??
Run the following Exchange Powershell Cmdlet:
Get-MailboxDatabase | select name,datamove*
I highly recommend utilizing SecondDatacenter when you have a DAG that spans cross-site. It ensures the data of each mailbox database is healthy and replicated to at least one other copy, in your alternate site. This protects us in case of a server/datacenter or WAN failure during MRS operations. Before MRS can move any log files from one database to another, it must verify the replication health and replication flush of the database involved in the move request. In my case, with the datamovereplicationconstraint set to SecondDatacenter, then the following items must be satisfied by the Data Guarantee API:
1. At least one passive copy of the database must be healthy, in another Active Directory site.
2. This same passive copy must have a replay queue within 10 minutes of the replay lag time.
3. This same passive copy must have a copy queue length less than 10 logs.
4. This same passive copy must have an average copy queue length less than 10 logs.
As I mentioned earlier, the 10011 Event ID in my logs indicated that a passive copy in the seconddatacenter was behind, in log replaying which then prompted the Data Guarantee API to respond with "NotSatisfied". The specific line in the event log that proves this is here:
minimum replay time: 10/10/2016 4:55:14 PM, maximum replay time: 10/10/2016 4:55:14 PM, commit time: 10/10/2016 4:55:26 PM.
The Check Replication Flush component is looking to validate that the prerequisite number of database copies (in my case four), have replayed their required transaction logs. This is verified by comparing the last log replayed time stamp with that of the originating servers commit time stamp, plus an additional five seconds to account for time clock drifting/skews. If the replay time stamp from the receiving server (server you are moving the mailbox to) is greater than the commit time from the source server, then the DataMoveReplicationConstraint is "Satisfied". If the replay time is not greater than the commit time, then it is NOT satisfied. Keep in mind, YOU as the admin, do not need to preoccupy yourself with the timestamps of each MRS request. The system manages this stuff for you and things should just run smoothly. You should have a rough understanding of what is happening behind the scenes though, in case something goes bump. Once I refreshed my brain on all of this, I found that both of the passive database copies fell behind in log replaying by about 800 logs. It's not a catastrophe by any means, but referencing the Data Guarantee API health prerequisites up above, we know that we needed at least one passive copy in the alternate site to be up to date and healthy which it was not. Further investigation, also confirmed that there was some latency and packet loss in the WAN link between the two sites, causing this issue. Needless to say, it has been resolved and all my mailbox moves completed without issue. The only issue noted was in the application logs, telling me that the passive copies in my alternate site fell behind in log replication.
Running Get-MailboxDatabaseCopyStatus on each of the servers confirms the DBs are mounted/healthy, and are free of any excessive copyqueue and replayque problems.
I hope this information helps!