太阳城官网

Contact Us
Technical Guide
Your current position:Home > Technical Guide
【案例分享】一次V7000U的宕机故障处理记录



一、故障背景



目前IBM storwize v7000 unified 存储设备在国内的保有量不超过10台,技术资料和技术人员都相对匮乏。在此背景下,近期太阳城官网成功处理了一次V7000U的宕机故障,现将经验分享给大家。


二、故障描述


2018年5月3日客户机房异常掉电,23:00电力恢复后,V7000U 存储无法正常工作,影响到客户核心业务。接到报修后,太阳城官网工程师迅速响应。现场情况如下:

1、管理软件IP可以 PING通,界面GUI无法启动、控制器管理IP:10.183.2.74无法访问:

image001.png


image002.png


2、检查GUI服务,发现未启动:

1)NAS、SAN无法正常使用;

2)主柜控制器均告警前面板告警;

[root@7803626.mgmt002st001 ~]# lshealth

Host Sensor Status Value

mgmt001st001 HOST_STATE OK OK

SERVICE ERROR At least one service is not running

CTDB ERROR CTDBSTATE_STATE_UNHEALTHY

GPFS WARNING Running OK with warnings

SCM WARNING Running OK with warnings

MGMTNODE_REPL_STATE OK OK

TB_STATE ERROR GENERAL_SSH_ERROR

NETWORK OK Network interfaces are online

mgmt002st001 HOST_STATE OK OK

SERVICE WARNING Running OK with warnings

CTDB ERROR CTDBSTATE_STATE_UNHEALTHY

GPFS WARNING Running OK with warnings

SCM WARNING Running OK with warnings

TB_STATE ERROR GENERAL_SSH_ERROR

NETWORK OK Network interfaces are online

V7000 CONNECTION ERROR Failed to open the SSH channel


3、检查NODE 状态及cluster服务状态:

[root@7803626.mgmt002st001 ~]# mmlsmount all

Device not ready.

mmremote: Command was unable to determine whether file system ADDOMAIN is mounted.

Device not ready.

mmremote: Command was unable to determine whether file system QYwenhua is mounted.

Device not ready.

mmremote: Command was unable to determine whether file system RuiLongData is mounted.

Device not ready.

mmremote: Command was unable to determine whether file system SafeFile is mounted.

Device not ready.

mmremote: Command was unable to determine whether file system SuiCheKa is mounted.

Device not ready.

mmremote: Command was unable to determine whether file system XZshare is mounted.

Device not ready.

mmremote: Command was unable to determine whether file system itshare is mounted.

Device not ready.

mmremote: Command was unable to determine whether file system systembak is mounted.

mmlsmount: Command failed. Examine previous error messages to determine cause.

[root@7803626.mgmt002st001 ~]# mmlscluster

GPFS cluster information

========================

GPFS cluster name: 7803626.ibm

GPFS cluster id: 12402640704854046074

GPFS UID domain: 7803626.ibm

Remote shell command: /usr/bin/ssh

Remote file copy command: /usr/bin/scp

GPFS cluster configuration servers:

---------------------------------------------------------------------------------

Primary server: mgmt001st001

Secondary server: mgmt002st001

Node Daemon node name IP address Admin node name Designation

-----------------------------------------------------------------------------------------------

1 mgmt001st001 172.31.8.2 mgmt001st001 quorum-manager-ctdb

2 mgmt002st001 172.31.8.3 mgmt002st001 quorum-manager-ctdb


4、检查V7000U canister node,发现NODE 告警,且存储主柜前面板告警,如图:

image003.jpg



三、故障处理



工程师为探究故障原因做了大量工作,事件终于峰回路转,柳暗花明:

1、通过第一步对日志及故障现象的搜集、分析后作出判断:V7000U 2个控制器出现故障,均不可用,导致存储无法正常使用;

2、登录存储控制器进行故障查看。由于10.183.2.74该管理IP无法使用,通过笔记本网线直连到控制器上,发现节点有673报错,该报错为电池问题;

连IP:192.168.70.121/192.168.70.122

用户名:****** 密码:******

image004.png


image005.png


3、对主柜进行断电、重启,报错依旧存在;对控制器各个端口进行测试,光纤线有光,网线LED状态显示正常,但是无法仍然ping通管理口10.183.2.74. ;

4、联系公司调配备件,5月4日22:50专人专送乘机抵达机场;23:20对控制器2个电池进行了更换,10分钟后设备恢复启动;

5、启动管理服务及NAS机头等;

6、5月5日0:10业务恢复正常状态。


四、经验总结


经过一天两夜的连续奋战,在各方领导的关心支持下,工程师在技术资料和技术人员都紧缺的情况下,不断检查、测试、分析,终于找到故障根源,使事件得以圆满解决,客户核心应用和主要业务恢复正常。

1、本次故障,是由于电池老化造成的:两块电池加电后均不能正常充电,电池状态不正常,使得控制器无法进入正常使用状态;

2、通常情况下,对于很多的IT设备(比如小型机、存储)来说,处在通电状态时,部件出现了故障(比如电池、电源),不会引起宕机,按故障处理流程在线更换部件即可;但如果在加电开机期间部件故障,有些产品从设计安全上考虑,会禁止设备加电,要将故障备件更换后(即使是冗余部件),才允许继续加电;

3、一些运行时间较长的设备,由于下电、加电等操作的浪涌电流的影响,硬件受到损害的风险较高,故障率比平时运行中高很多,要特别注意。


如欲了解更多,请登录太阳城官网官方网站:www.antute.com.cn

版权所有 太阳城官网 Filing No:京ICP备17074963号-1
Technical Support:Genesis Network
十大网赌靠谱网址平台