鼎甲公司使用kfed修複ASM磁盤組故障案例
發布人:scutech 發布日期:2013-05-21 15:55:12 點擊數:6396
前言
3月30日廣州某單位Oracle RAC的數據庫在添加新磁盤後出現故障,導緻磁盤組無法挂載,該單位尋求Oracle公司解決,但費(fèi)用非常昂貴,他們抱着僥幸的心理找到了鼎甲科技。北京百思彙計算機技術培訓中心作爲國内頂尖的數據容災備份廠商(shāng),鼎甲科技憑借其雄厚的技術實力以及完善的服務體(tǐ)系,與各種行業用戶有着緊密的合作關系。鼎甲科技的技術人員(yuán)在最短時間内,以零成本的服務,成功解決了此故障,讓該單位的業務一(yī)切恢複正常,該單位對鼎甲科技的專業技術實力及服務質量給予高度贊揚及認可。下(xià)面給大(dà)家分(fēn)享本案例使用kfed修複ASM磁盤組故障過程。
一(yī)、故障現象
鼎甲公司了解情況後立刻派工(gōng)程師前往探究,通過對ASM的v$asm_disk視圖的檢查,發現磁盤狀态全部顯示正常;執行“alterdiskgroup dgdata mount;”顯示成功,但接着查詢v$asm_diskgroup,該磁盤組仍處于dismounted的狀态。檢查alert日志(zhì),看到磁盤組報告錯誤,然後在mounted之後随即被dismount了。alert日志(zhì)如下(xià):
Sat Mar 30 10:51:592013 NOTE: erasingincomplete header on grp 1 disk VOL19 NOTE: cache openingdisk 0 of grp 1: VOL10 label:VOL10 NOTE: F1X0 found ondisk 0 fcn 0.4276074 NOTE: cache openingdisk 1 of grp 1: VOL11 label:VOL11 NOTE: cache openingdisk 2 of grp 1: VOL12 label:VOL12 NOTE: cache openingdisk 3 of grp 1: VOL13 label:VOL13 NOTE: cache openingdisk 4 of grp 1: VOL14 label:VOL14 NOTE: cache openingdisk 5 of grp 1: VOL3 label:VOL3 NOTE: cache openingdisk 6 of grp 1: VOL4 label:VOL4 NOTE: cache openingdisk 7 of grp 1: VOL5 label:VOL5 NOTE: cache openingdisk 8 of grp 1: VOL6 label:VOL6 NOTE: cache openingdisk 9 of grp 1: VOL7 label:VOL7 NOTE: cache openingdisk 10 of grp 1: VOL8 label:VOL8 NOTE: cache openingdisk 11 of grp 1: VOL9 label:VOL9 NOTE: cache openingdisk 12 of grp 1: VOL1 label:VOL1 NOTE: cache openingdisk 13 of grp 1: VOL2 label:VOL2 NOTE: cache openingdisk 14 of grp 1: VOL15 label:VOL15 NOTE: cache openingdisk 15 of grp 1: VOL16 label:VOL16 NOTE: cache openingdisk 16 of grp 1: VOL17 label:VOL17 NOTE: cache openingdisk 17 of grp 1: VOL18 label:VOL18 NOTE: cachemounting (first) group 1/0x36E8615F (DGDATA) * allocate domain1, invalid = TRUE kjbdomatt send tonode 1 Sat Mar 30 10:51:592013 NOTE: attached torecovery domain 1 Sat Mar 30 10:51:592013 NOTE: startingrecovery of thread=1 ckpt=75.5792 group=1 NOTE: advancingckpt for thread=1 ckpt=75.5792 NOTE: cacherecovered group 1 to fcn 0.5174872 Sat Mar 30 10:51:592013 NOTE: opening chunk1 at fcn 0.5174872 ABA NOTE: seq=76blk=5793 Sat Mar 30 10:51:592013 NOTE: cachemounting group 1/0x36E8615F (DGDATA) succeeded WARNING: offliningdisk 16.3915944441 (VOL17) with mask 0x3 NOTE: PST update:grp = 1, dsk = 16, mode = 0x6 Sat Mar 30 10:51:592013 ERROR: too manyoffline disks in PST (grp 1) NOTE: cache closingdisk 16 of grp 1: VOL17 label:VOL17 NOTE: cache closingdisk 16 of grp 1: VOL17 label:VOL17 Sat Mar 30 10:51:592013 SUCCESS: diskgroupDGDATA was mounted Sat Mar 30 10:51:592013 ERROR:PST-initiated MANDATORY DISMOUNT of group DGDATA NOTE: cachedismounting group 1/0x36E8615F (DGDATA) Sat Mar 30 10:51:592013 NOTE: halting all I/Osto diskgroup DGDATA Sat Mar 30 10:51:592013 kjbdomdet send tonode 1 detach from dom 1,sending detach message to node 1 Sat Mar 30 10:51:592013 Dirty detachreconfiguration started (old inc 2, new inc 2) List of nodes: 0 1 Global Resource Directory partially frozen fordirty detach * dirty detach -domain 1 invalid = TRUE 10 GCS resources traversed, 0 cancelled 4014 GCS resources on freelist, 6138 on array,6138 allocated Dirty DetachReconfiguration complete Sat Mar 30 10:51:592013 freeing rdom 1 Sat Mar 30 10:51:592013 WARNING: dirtydetached from domain 1 Sat Mar 30 10:51:592013 SUCCESS: diskgroupDGDATA was dismounted Received detach msg from node 1 for dom 2
憑着ASM知(zhī)識的了解和經驗,能大(dà)概知(zhī)道是某個盤存在故障而被離(lí)線,進而導緻磁盤組由于缺少磁盤而被卸載。
目前最大(dà)的問題就是,磁盤組無法挂載,導緻無法對磁盤組進行任何操作,即使想删除可能存在故障的磁盤都沒有辦法。通過對某單位技術員(yuán)溝通,了解到導緻故障的操作:首先在磁盤組中(zhōng)添加3個新磁盤,報錯,随後再嘗試将新磁盤單獨加入,仍報錯,此後發現磁盤組被卸載。
通過查閱、分(fēn)析和對比相關的信息資(zī)料,在metalink上看到一(yī)些類似情況的解決辦法,使用dd清空故障磁盤頭部,或者強制将故障磁盤加入新的磁盤組,使原有磁盤組無法識别原有故障盤,之後便可以成功加載。爲了避免造成進一(yī)步損壞,我(wǒ)與對方單位已經達成共識,在有确定可行的方案之前,不能作任何修改操作。
二、故障分(fēn)析
開(kāi)始檢查日志(zhì),查找最開(kāi)始導緻問題的操作和相關日志(zhì)信息。
節點1,第一(yī)次同時添加VOL17、VOL18、VOL19時沒有明顯錯誤,但有一(yī)個警告“WARNING: offlining disk 18.3915945713 (VOL19) withmask 0x3”,判斷可能VOL19添加時出現問題。日志(zhì)如下(xià):
Fri Mar 29 18:31:372013 SQL> alterdiskgroup DGDATA add disk 'ORCL:VOL17','ORCL:VOL18','ORCL:VOL19' Fri Mar 29 18:31:372013 NOTE:reconfiguration of group 1/0x44e8663d (DGDATA), full=1 Fri Mar 29 18:31:382013 NOTE: initializingheader on grp 1 disk VOL17 NOTE: initializingheader on grp 1 disk VOL18 NOTE: initializingheader on grp 1 disk VOL19 NOTE: cache openingdisk 16 of grp 1: VOL17 label:VOL17 NOTE: cache openingdisk 17 of grp 1: VOL18 label:VOL18 NOTE: cache openingdisk 18 of grp 1: VOL19 label:VOL19 NOTE: PST update:grp = 1 NOTE: requestingall-instance disk validation for group=1 Fri Mar 29 18:31:382013 NOTE: diskvalidation pending for group 1/0x44e8663d (DGDATA) SUCCESS: validateddisks for 1/0x44e8663d (DGDATA) Fri Mar 29 18:31:402013 NOTE: requestingall-instance membership refresh for group=1 Fri Mar 29 18:31:402013 NOTE: membershiprefresh pending for group 1/0x44e8663d (DGDATA) SUCCESS: refreshedmembership for 1/0x44e8663d (DGDATA) Fri Mar 29 18:31:432013 WARNING: offliningdisk 18.3915945713 (VOL19) with mask 0x3 NOTE: PST update:grp = 1, dsk = 18, mode = 0x6 NOTE: PST update:grp = 1, dsk = 18, mode = 0x4 NOTE: cache closingdisk 18 of grp 1: VOL19 NOTE: PST update:grp = 1 NOTE: requestingall-instance membership refresh for group=1 Fri Mar 29 18:31:492013 NOTE: membershiprefresh pending for group 1/0x44e8663d (DGDATA) NOTE: cache closingdisk 18 of grp 1: VOL19 SUCCESS: refreshedmembership for 1/0x44e8663d (DGDATA) Received dirty detach msg from node 1 for dom1 Fri Mar 29 18:31:512013 Dirty detachreconfiguration started (old inc 4, new inc 4) List of nodes: 0 1 Global Resource Directory partially frozen fordirty detach * dirty detach -domain 1 invalid = TRUE 2817 GCS resources traversed, 0 cancelled 1981 GCS resources on freelist, 7162 on array,6138 allocated 1719 GCS shadows traversed, 0 replayed Dirty DetachReconfiguration complete Fri Mar 29 18:31:512013 NOTE: PST enablingheartbeating (grp 1) Fri Mar 29 18:31:512013 NOTE: SMON startinginstance recovery for group 1 (mounted) NOTE: F1X0 found ondisk 0 fcn 0.4276074 NOTE: startingrecovery of thread=1 ckpt=39.5722 group=1 NOTE: advancingckpt for thread=1 ckpt=39.5722 NOTE: smon didinstance recovery for domain 1 Fri Mar 29 18:31:532013 NOTE: recoveringCOD for group 1/0x44e8663d (DGDATA) SUCCESS: completedCOD recovery for group 1/0x44e8663d (DGDATA) Fri Mar 29 18:32:182013
同一(yī)時間可看到節點2有報錯“ERROR:group 1
/0x44e86390
(DGDATA): could not validate disk 18”,随後VOL19(即disk18)被離(lí)線并導緻磁盤組被卸載,部分(fēn)錯誤信息與後來磁盤組無法加載的日志(zhì)吻合。日志(zhì)如下(xià):
Fri Mar 29 18:31:372013 NOTE:reconfiguration of group 1/0x44e86390 (DGDATA), full=1 NOTE: diskvalidation pending for group 1/0x44e86390 (DGDATA) ERROR: group1/0x44e86390 (DGDATA): could not validate disk 18 SUCCESS: validateddisks for 1/0x44e86390 (DGDATA) NOTE: membershiprefresh pending for group 1/0x44e86390 (DGDATA) NOTE: PST update:grp = 1, dsk = 18, mode = 0x4 Fri Mar 29 18:31:432013 ERROR: too manyoffline disks in PST (grp 1) Fri Mar 29 18:31:432013 SUCCESS: refreshedmembership for 1/0x44e86390 (DGDATA) ERROR: ORA-15040thrown in RBAL for group number 1 Fri Mar 29 18:31:432013 Errors in file/opt/app/oracle/admin/+ASM/bdump/+asm2_rbal_14019.trc: ORA-15040:diskgroup is incomplete ORA-15066: offliningdisk "" may result in a data loss ORA-15042: ASM disk"18" is missing NOTE: cache closingdisk 18 of grp 1: NOTE: membershiprefresh pending for group 1/0x44e86390 (DGDATA) NOTE: cache closingdisk 18 of grp 1: NOTE: cache openingdisk 16 of grp 1: VOL17 label:VOL17 NOTE: cache openingdisk 17 of grp 1: VOL18 label:VOL18 SUCCESS: refreshedmembership for 1/0x44e86390 (DGDATA) Fri Mar 29 18:31:502013 ERROR:PST-initiated MANDATORY DISMOUNT of group DGDATA NOTE: cachedismounting group 1/0x44E86390 (DGDATA) Fri Mar 29 18:31:512013 NOTE: halting allI/Os to diskgroup DGDATA Fri Mar 29 18:31:512013 kjbdomdet send tonode 0 detach from dom 1,sending detach message to node 0 Fri Mar 29 18:31:512013 Dirty detachreconfiguration started (old inc 4, new inc 4) List of nodes: 0 1 Global Resource Directory partially frozen fordirty detach * dirty detach -domain 1 invalid = TRUE 2214 GCS resources traversed, 0 cancelled 5528 GCS resources on freelist, 7162 on array,6138 allocated Dirty DetachReconfiguration complete Fri Mar 29 18:31:512013 WARNING: dirtydetached from domain 1 Fri Mar 29 18:31:512013 SUCCESS: diskgroupDGDATA was dismounted
由此判斷,很可能是添加磁盤時VOL19在節點2上存在權限問題:通常情況下(xià)是Oracle用戶沒有相關設備的訪問權限。根據此判斷,我(wǒ)在自己的虛拟機上運行RAC,并模拟這一(yī)錯誤:在節點1上設置好Oracle用戶對新增磁盤的訪問權限,在節點2上不作設置,然後添加新增磁盤。操作後果然出現幾乎相同的日志(zhì),但有一(yī)處差别:在我(wǒ)的模拟環境中(zhōng)日志(zhì)有報告“ORA-15075:disk(s) are not visible cluster-wide”,而單位提供的日志(zhì)沒有這一(yī)錯誤,因此仍無法斷定是同一(yī)問題。
後來,發現這單位的操作記錄下(xià)确實有出現ORA-15075的錯誤,證實了第一(yī)次添加磁盤失敗是由于權限問題造成的。圍繞這一(yī)個誤操作進行反複多次測試,發現在模拟環境中(zhōng),即使出現該誤操作也不會導緻磁盤組無法挂載。隻要哪個節點設置好Oracle用戶對磁盤的訪問權限,該節點就可以成功挂載磁盤組。
随後繼續模拟實際操作,失敗後再繼續輸入添加磁盤的命令,也不會出現任何進一(yī)步的故障,Oracle都會正确地報告“ORA-15029: disk '…' is already mounted by thisinstance”。這單位提供的操作記錄顯示,在第二次嘗試添加VOL17及VOL18時,Oracle正确報告ORA-15029,說明VOL17及VOL18已成功加入磁盤組。
但操作記錄顯示随後的一(yī)次操作卻出現了異常,此時再次嘗試添加VOL17卻出現“ORA-15033: disk 'ORCL:VOL17' belongs todiskgroup "DGDATA"”的錯誤。這是一(yī)個異常的錯誤,根據前面多次測試得到的經驗,該錯誤表示的意思是“VOL17是屬于另一(yī)個磁盤組的,不能添加到指定的磁盤組,除非加上FORCE選項強制加入”。也就是說,第二次嘗試添加磁盤時VOL17還能被識别出是DGDATA磁盤組的,但第三次嘗試添加磁盤時卻沒被識别出來。此時日志(zhì)也出現了異常情況:
Fri Mar 29 18:35:412013 SQL> alter diskgroupDGDATA add disk 'ORCL:VOL17' Fri Mar 29 18:35:412013 NOTE:reconfiguration of group 1/0x44e8663d (DGDATA), full=1 Fri Mar 29 18:35:412013 WARNING: ignoringdisk ORCL:VOL18 in deep discovery WARNING: ignoringdisk ORCL:VOL19 in deep discovery NOTE: requestingall-instance membership refresh for group=1 Fri Mar 29 18:35:412013 NOTE: membershiprefresh pending for group 1/0x44e8663d (DGDATA) SUCCESS: validateddisks for 1/0x44e8663d (DGDATA) NOTE: PST update:grp = 1, dsk = 16, mode = 0x4 Fri Mar 29 18:35:452013 ERROR: too manyoffline disks in PST (grp 1) Fri Mar 29 18:35:452013 SUCCESS: refreshedmembership for 1/0x44e8663d (DGDATA) ERROR: ORA-15040thrown in RBAL for group number 1 Fri Mar 29 18:35:452013 Errors in file/opt/app/oracle/admin/+ASM/bdump/+asm1_rbal_13974.trc: ORA-15040:diskgroup is incomplete ORA-15066:offlining disk "" may result in a data loss ORA-15042: ASM disk"16" is missing Fri Mar 29 18:35:452013 ERROR:PST-initiated MANDATORY DISMOUNT of group DGDATA NOTE: cache dismountinggroup 1/0x44E8663D (DGDATA) Fri Mar 29 18:35:452013 NOTE: halting allI/Os to diskgroup DGDATA Fri Mar 29 18:35:452013 kjbdomdet send tonode 1 detach from dom 1,sending detach message to node 1 Fri Mar 29 18:35:452013 Dirty detachreconfiguration started (old inc 4, new inc 4) List of nodes: 0 1 Global Resource Directory partially frozen fordirty detach * dirty detach -domain 1 invalid = TRUE 1291 GCS resources traversed, 0 cancelled 2347 GCS resources on freelist, 7162 on array,6138 allocated Dirty DetachReconfiguration complete Fri Mar 29 18:35:452013 freeing rdom 1 Fri Mar 29 18:35:452013 WARNING: dirtydetached from domain 1 Fri Mar 29 18:35:462013 SUCCESS: diskgroupDGDATA was dismounted
此時磁盤節點1的磁盤組也被卸載,可以判斷正是此時的異常導緻了後來出現的故障。
由于在模拟環境上反複進行添加磁盤的操作并未重現出故障,此時隻能判斷該故障很可能是Oracle的BUG,可能正好該添加磁盤的操作影響了Oracle對新磁盤的rebalance操作,随後Oracle将該磁盤标記爲離(lí)線,并導緻磁盤組被卸載。與這單位技術員(yuán)交流了測試結果,得知(zhī)在單位的環境中(zhōng)節點2後來已經設置好Oracle用戶對磁盤的訪問權限,但故障依舊(jiù)。此後我(wǒ)繼續做dd及強制把故障磁盤加入新磁盤組的測試。
随後進行了一(yī)系列測試。由于測試環境下(xià)磁盤并不會出現故障,因此隻能手動把磁盤組離(lí)線,然後進行“修複”後嘗試挂載磁盤組。嘗試了使用dd覆蓋“故障磁盤”的頭部,及把“故障磁盤”加入新磁盤組後删除,都無法再挂載原加入的磁盤組。但在測試環境下(xià),磁盤組無法挂載都會報告“ORA-15042:ASM disk "…" is missing”,而不像實際環境中(zhōng)報告挂載成功。對比了網上其他人使用dd及強制加入新磁盤組的文章,發現有一(yī)個很大(dà)差異:網上修複的案例都是使用“normalredundancy”方式的磁盤組,這種情況下(xià)磁盤組中(zhōng)存在冗餘數據,所以一(yī)個磁盤出現故障并不會使磁盤組被卸載,在這個前提下(xià)許多操作都有可能進行。而單位的故障系統是使用了“externalredundancy”,數據在Oracle看來是沒有冗餘的,這也是磁盤組目前無法挂載的一(yī)個原因。
基于上述情況,想到了2個解決方案。一(yī)個是查看Oracle有沒有強制挂載磁盤組的命令,也許會有這種命令提供給用戶進行故障修複。另一(yī)個是想到使用kfed可以修改磁盤頭信息,那麽我(wǒ)找一(yī)個正常的磁盤修改下(xià)磁盤頭信息後恢複到故障盤,是否就能使故障盤被正确識别?随後第一(yī)個辦法被否定了,查閱了資(zī)料發現隻有11g有強制挂載磁盤組的選項,關鍵是隻是“normal redundancy”的磁盤組才能使用。第二個辦法在昨天被破壞的模拟環境上進行測試,居然可以成功!将這個方法的操作過程發給這單位的技術員(yuán),讓他在自己的測試環境上進行驗證。
這個kfed修複磁盤頭的方法如下(xià):找一(yī)個正常的磁盤,用kfed導出其磁盤頭信息,對比故障盤導出的磁盤頭信息,合并出一(yī)個修複後的故障盤磁盤頭信息,導入故障盤。例如正常的磁盤是/dev/rdsk/c1t0d0s3,故障盤是/dev/rdsk/c1t1d0s1,使用以下(xià)操作:
kfed read/dev/rdsk/c1t0d0s3 text=header0 kfed read/dev/rdsk/c1t1d0s1 text=header1 vimdiff header0header1 (...修改出一(yī)個“正确”的故障盤磁盤頭,另存爲header1fix...) kfed merge/dev/rdsk/c1t1d0s1 text=header1fix
如果故障盤的磁盤頭沒有可用信息,需要把它加入新磁盤組後删除,這樣其磁盤頭中(zhōng)就有新磁盤組的信息。
其中(zhōng)關鍵的需要修複的信息有:
kfdhdb.dsknum:磁盤在磁盤組中(zhōng)的序号,從0開(kāi)始,如Oracle日志(zhì)中(zhōng)的disk 18應該對應的數字爲17
kfdhdb.grpname:磁盤組的名稱,如果是從新磁盤組中(zhōng)删除,需要改爲原磁盤組的名稱
kfdhdb.grpstmp.hi:磁盤組的時間截,需要從正常磁盤頭中(zhōng)複制
kfdhdb.grpstmp.lo:同上
不過後來收到單位技術員(yuán)的反饋,故障系統上的VOL17、VOL18、VOL19磁盤頭都是正确的,說明這種方法不會起作用。
克隆故障環境
後來,提出了可以使用dd把故障系統的磁盤都拷貝出來,然後在此基礎上搭建測試環境,可以在克隆出的故障系統上進行研究。周三拿到了拷好的數據,使用iscsi加載到測試環境,運行oracleasm scandisks,開(kāi)始在模拟環境上測試。
三、解決問題
使用kfed檢查了VOL17、VOL18、VOL19的磁盤頭,确實全部正常。把之前嘗試過的方法在該模拟環境上重新嘗試一(yī)遍,确實也都不奏效。需要想想其它辦法。
參考了文章:http://blog.csdn.net/tianlesoftware/article/details/6740716 ,先在磁盤組中(zhōng)找到KFBTYP_LISTHEAD,然後再找到KFBTYP_DISKDIR,可看到DISKDIR塊中(zhōng)包含有各磁盤的信息,其中(zhōng)VOL17的狀态與其它盤都不同:
kfddde[0].entry.incarn: 4 ; 0x024: A=0 NUMM=0x1
其它盤(包括VOL18、VOL19)都是:
kfddde[0].entry.incarn: 1 ; 0x024: A=1 NUMM=0x0
當時分(fēn)析後認爲應該可以通過修改VOL17的狀态,讓VOL17變回正常。 不過當時并沒有馬上嘗試,而是根據這個思路去(qù)找到PST表。與其修改VOL17的狀态,不如找到PST表把VOL17删除掉。
PST表的解釋:Partner StatusTable. Maintains info ondisk-to-diskgroup membership.
根據http://blog.csdn.net/tianlesoftware/article/details/6743677 這個鏈接的内容,PST表應該存在于某個磁盤的AU=1位置。檢查了磁盤組中(zhōng)的所有磁盤,隻有VOL10包含了PST表,但AU=1處并不包含任何有用的内容,它的類型是KFBTYP_PST_META。根據前面查找DISKDIR的經驗,繼續檢查AU=1,BLK=1處的數據,仍然是KFBTYP_PST_META,再繼續檢查AU=1,BLK=2,發現了KFBTYP_PST_DTA。繼續檢查其内容,很有規律:
kfdpDtaE[0].status: 117440512 ; 0x000: V=1 R=1 W=1 kfdpDtaE[0].index: 0 ; 0x004: CURR=0x0CURR=0x0 FORM=0x0 FORM=0x0 kfdpDtaE[0].partner[0]: 0 ; 0x008: 0x0000 kfdpDtaE[0].partner[1]: 0 ; 0x00a: 0x0000 kfdpDtaE[0].partner[2]: 0 ; 0x00c: 0x0000 ...... kfdpDtaE[0].partner[19]: 0 ; 0x02e: 0x0000 kfdpDtaE[1].status: 117440512 ; 0x030: V=1 R=1 W=1 kfdpDtaE[1].index: 0 ; 0x034: CURR=0x0CURR=0x0 FORM=0x0 FORM=0x0 kfdpDtaE[1].partner[0]: 0 ; 0x038: 0x0000 kfdpDtaE[1].partner[1]: 0 ; 0x03a: 0x0000 kfdpDtaE[1].partner[2]: 0 ; 0x03c: 0x0000 ...... kfdpDtaE[1].partner[19]: 0 ; 0x05e: 0x0000 kfdpDtaE[2].status: 83886080 ; 0x060: V=1 R=1 W=1 ......
直到檢查到kfdpDtaE [18].status開(kāi)始變爲0。與磁盤一(yī)一(yī)進行對應,0~15對應原有的16塊磁盤,16、17對應新增的VOL17、VOL18,而VOL19則由于權限問題沒有出現在表中(zhōng)。決定嘗試修改該表,将VOL17、VOL18從磁盤組中(zhōng)删除:
ddif=/dev/oracleasm/disks/VOL10 of=vol10.save bs=1048576 count=10 kfed read/dev/oracleasm/disks/VOL10 aun=1 blkn=2 text=pst.data vi pst.data (...修改kfdpDtaE[16].status及kfdpDtaE[17].status爲0,另存爲pst.update...) kfed merge/dev/oracleasm/disks/VOL10 aun=1 blkn=2 text=pst.update
嘗試挂載磁盤組,如原來一(yī)樣報告成功,還得看日志(zhì):
Thu Apr 4 14:15:08 2013 SQL> alterdiskgroup dgdata mount Thu Apr 4 14:15:08 2013 NOTE: cacheregistered group DGDATA number=2 incarn=0x0c76f699 Thu Apr 4 14:15:08 2013 NOTE: Hbeat:instance first (grp 2) Thu Apr 4 14:15:13 2013 NOTE: startheartbeating (grp 2) Thu Apr 4 14:15:13 2013 NOTE: erasingincomplete header on grp 2 disk VOL17 NOTE: erasingincomplete header on grp 2 disk VOL18 NOTE: erasingincomplete header on grp 2 disk VOL19 NOTE: cache openingdisk 0 of grp 2: VOL10 label:VOL10 NOTE: F1X0 found ondisk 0 fcn 0.4276074 NOTE: cache openingdisk 1 of grp 2: VOL11 label:VOL11 NOTE: cache openingdisk 2 of grp 2: VOL12 label:VOL12 NOTE: cache openingdisk 3 of grp 2: VOL13 label:VOL13 NOTE: cache openingdisk 4 of grp 2: VOL14 label:VOL14 NOTE: cache openingdisk 5 of grp 2: VOL3 label:VOL3 NOTE: cache openingdisk 6 of grp 2: VOL4 label:VOL4 NOTE: cache openingdisk 7 of grp 2: VOL5 label:VOL5 NOTE: cache openingdisk 8 of grp 2: VOL6 label:VOL6 NOTE: cache openingdisk 9 of grp 2: VOL7 label:VOL7 NOTE: cache openingdisk 10 of grp 2: VOL8 label:VOL8 NOTE: cache openingdisk 11 of grp 2: VOL9 label:VOL9 NOTE: cache openingdisk 12 of grp 2: VOL1 label:VOL1 NOTE: cache openingdisk 13 of grp 2: VOL2 label:VOL2 NOTE: cache openingdisk 14 of grp 2: VOL15 label:VOL15 NOTE: cache openingdisk 15 of grp 2: VOL16 label:VOL16 NOTE: cachemounting (first) group 2/0x0C76F699 (DGDATA) NOTE: startingrecovery of thread=1 ckpt=94.5829 group=2 NOTE: advancing ckptfor thread=1 ckpt=94.5830 NOTE: cacherecovered group 2 to fcn 0.5174912 Thu Apr 4 14:15:13 2013 NOTE: opening chunk1 at fcn 0.5174912 ABA NOTE: seq=95blk=5831 Thu Apr 4 14:15:13 2013 NOTE: cachemounting group 2/0x0C76F699 (DGDATA) succeeded SUCCESS: diskgroupDGDATA was mounted Thu Apr 4 14:15:13 2013 NOTE: recoveringCOD for group 2/0xc76f699 (DGDATA) SUCCESS: completedCOD recovery for group 2/0xc76f699 (DGDATA)
有變化!VOL17及VOL18也跟VOL19一(yī)樣被清理了頭部,然後磁盤組不再報告VOL17需要離(lí)線,不再被卸載。此後檢查磁盤組狀态、磁盤狀态,一(yī)切正常。修改了pfile,啓動數據庫,成功打開(kāi)。重新強制添加3個新磁盤,成功,一(yī)切穩定運行。
修複故障
在模拟環境中(zhōng)打開(kāi)數據庫後,開(kāi)始使用data pump導出部分(fēn)業務數據。第二天安排部分(fēn)應用開(kāi)發人員(yuán)上門檢查數據一(yī)緻性。最後在生(shēng)産系統上按模拟環境的方法進行修複,生(shēng)産數據庫可正常打開(kāi),業務正常運行。
總結
本次ASM磁盤組故障問題反映了數據容災備份的主要性,要防止系統出現操作失誤或系統故障導緻數據丢失,提前做好備份工(gōng)作。