Commit bc9ef10b authored by Philipp Reisner's avatar Philipp Reisner

drbd_uuid_compare(): Handle loss of last P_WRITE_ACK packet of a resync right....

drbd_uuid_compare(): Handle loss of last P_WRITE_ACK packet of a resync right. (Caused missing resyncs) Bugz 246

Connection drop while transmitting last ack:
SyncSource losses connection, SyncTarget sees the end of resync.

Aug 18 08:39:42 uml1 drbd0: Handshake successful: Agreed network protocol version 90
Aug 18 08:39:42 uml1 drbd0: conn( WFConnection -> WFReportParams )
Aug 18 08:39:42 uml1 drbd0: drbd_sync_handshake:
Aug 18 08:39:42 uml1 drbd0: self 81DAF2FF6134FC1E:16EF5753AD5FA994:95B9E9AD329C137B:A4B1B25AC5927436 bits:4255 flags:0
Aug 18 08:39:42 uml1 drbd0: peer 16EF5753AD5FA994:0000000000000000:95B9E9AD329C137A:A4B1B25AC5927436 bits:0 flags:0
Aug 18 08:39:42 uml1 drbd0: uuid_compare()=1 by rule 70
Aug 18 08:39:42 uml1 drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> UpToDate )
Aug 18 08:39:42 uml1 drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent )
Aug 18 08:39:42 uml1 drbd0: Began resync as SyncSource (will sync 17020 KB [4255 bits set]).
Aug 18 08:39:43 uml1 drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> Disconnecting )

Aug 18 08:39:42 uml2 drbd0: Handshake successful: Agreed network protocol version 90
Aug 18 08:39:42 uml2 drbd0: conn( WFConnection -> WFReportParams )
Aug 18 08:39:42 uml2 drbd0: drbd_sync_handshake:
Aug 18 08:39:42 uml2 drbd0: self 16EF5753AD5FA994:0000000000000000:95B9E9AD329C137A:A4B1B25AC5927436 bits:0 flags:0
Aug 18 08:39:42 uml2 drbd0: peer 81DAF2FF6134FC1E:16EF5753AD5FA994:95B9E9AD329C137B:A4B1B25AC5927436 bits:4255 flags:0
Aug 18 08:39:42 uml2 drbd0: uuid_compare()=-1 by rule 50
Aug 18 08:39:42 uml2 drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 18 08:39:42 uml2 drbd0: conn( WFBitMapT -> WFSyncUUID )
Aug 18 08:39:42 uml2 drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
Aug 18 08:39:43 uml2 drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )

Only uml2 recognised the end of resync.

Aug 18 09:49:51 uml1 drbd0: Handshake successful: Agreed network protocol version 90
Aug 18 09:49:51 uml1 drbd0: conn( WFConnection -> WFReportParams )
Aug 18 09:49:51 uml1 drbd0: drbd_sync_handshake:
Aug 18 09:49:51 uml1 drbd0: self 81DAF2FF6134FC1E:CB7A2BEB83B25C28:16EF5753AD5FA994:95B9E9AD329C137B bits:3 flags:0
Aug 18 09:49:51 uml1 drbd0: peer 81DAF2FF6134FC1E:0000000000000000:CB7A2BEB83B25C28:16EF5753AD5FA994 bits:0 flags:0
Aug 18 09:49:51 uml1 drbd0: uuid_compare()=0 by rule 40
Aug 18 09:49:51 uml1 drbd0: No resync, but 3 bits in bitmap!
Aug 18 09:49:51 uml1 drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( Inconsistent -> UpToDate )

Aug 18 09:49:51 uml2 drbd0: Handshake successful: Agreed network protocol version 90
Aug 18 09:49:51 uml2 drbd0: conn( WFConnection -> WFReportParams )
Aug 18 09:49:51 uml2 drbd0: drbd_sync_handshake:
Aug 18 09:49:51 uml2 drbd0: self 81DAF2FF6134FC1E:0000000000000000:CB7A2BEB83B25C28:16EF5753AD5FA994 bits:0 flags:0
Aug 18 09:49:51 uml2 drbd0: peer 81DAF2FF6134FC1E:CB7A2BEB83B25C28:16EF5753AD5FA994:95B9E9AD329C137B bits:3 flags:0
Aug 18 09:49:51 uml2 drbd0: uuid_compare()=0 by rule 40
Aug 18 09:49:51 uml2 drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> UpToDate )

=> No resync, but 3 bits in bitmap! message on uml1.

rule 3.4:
  If Cs = Cp & Bs != 0 & Bp = 0 & Bs = H1p & H1s = H2p
 => I have not realized end of resync. I was SyncSource, target saw the end of resync.

    Correct my UUIDs: Bs = 0 (with rotate)

rule 3.5:
  If Cs = Cp & Bs = 0 & Bp != 0 & H1s = Bp & H2s = H1p
 => Peer has not realized end of resync. I was SyncTarget, resync is actually done.

    Correct peer's UUIDS: Bp = 0 (with rotate)
Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
parent ff34dbb0
...@@ -2332,14 +2332,45 @@ static int drbd_uuid_compare(struct drbd_conf *mdev, int *rule_nr) __must_hold(l ...@@ -2332,14 +2332,45 @@ static int drbd_uuid_compare(struct drbd_conf *mdev, int *rule_nr) __must_hold(l
(peer == UUID_JUST_CREATED || peer == (u64)0)) (peer == UUID_JUST_CREATED || peer == (u64)0))
return 2; return 2;
*rule_nr = 40; if (self == peer) {
if (self == peer) { /* Common power [off|failure] */
int rct, dc; /* roles at crash time */ int rct, dc; /* roles at crash time */
if (mdev->p_uuid[UI_BITMAP] == (u64)0 &&
mdev->ldev->md.uuid[UI_BITMAP] != (u64)0 &&
(mdev->ldev->md.uuid[UI_BITMAP] & ~((u64)1)) == (mdev->p_uuid[UI_HISTORY_START] & ~((u64)1)) &&
(mdev->ldev->md.uuid[UI_HISTORY_START] & ~((u64)1)) == (mdev->p_uuid[UI_HISTORY_START + 1] & ~((u64)1))) {
dev_info(DEV, "was SyncSource, missed the resync finished event, corrected myself:\n");
drbd_uuid_set_bm(mdev, 0UL);
drbd_uuid_dump(mdev, "self", mdev->ldev->md.uuid,
mdev->state.disk >= D_NEGOTIATING ? drbd_bm_total_weight(mdev) : 0, 0);
*rule_nr = 34;
return 1;
}
if (mdev->ldev->md.uuid[UI_BITMAP] == (u64)0 &&
mdev->p_uuid[UI_BITMAP] != (u64)0 &&
(mdev->ldev->md.uuid[UI_HISTORY_START] & ~((u64)1)) == (mdev->p_uuid[UI_BITMAP] & ~((u64)1)) &&
(mdev->ldev->md.uuid[UI_HISTORY_START + 1] & ~((u64)1)) == (mdev->p_uuid[UI_HISTORY_START] & ~((u64)1))) {
dev_info(DEV, "was SyncTarget, peer missed the resync finished event, correced peer:\n");
mdev->p_uuid[UI_HISTORY_START + 1] = mdev->p_uuid[UI_HISTORY_START];
mdev->p_uuid[UI_HISTORY_START] = mdev->p_uuid[UI_BITMAP];
mdev->p_uuid[UI_BITMAP] = 0UL;
drbd_uuid_dump(mdev, "peer", mdev->p_uuid, mdev->p_uuid[UI_SIZE], mdev->p_uuid[UI_FLAGS]);
*rule_nr = 35;
return -1;
}
/* Common power [off|failure] */
rct = (test_bit(CRASHED_PRIMARY, &mdev->flags) ? 1 : 0) + rct = (test_bit(CRASHED_PRIMARY, &mdev->flags) ? 1 : 0) +
(mdev->p_uuid[UI_FLAGS] & 2); (mdev->p_uuid[UI_FLAGS] & 2);
/* lowest bit is set when we were primary, /* lowest bit is set when we were primary,
* next bit (weight 2) is set when peer was primary */ * next bit (weight 2) is set when peer was primary */
*rule_nr = 40;
switch (rct) { switch (rct) {
case 0: /* !self_pri && !peer_pri */ return 0; case 0: /* !self_pri && !peer_pri */ return 0;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment