当前位置：首页 → 问答吧 → sun cluster 3.2 集群问题很怪

sun cluster 3.2 集群问题很怪

时间：2010-09-25

来源：互联网

很奇怪的问题两台富士通m5000，安装solaris 10 系统，系统补丁是富士通的。安装配置sun cluster（已安装最新补丁）后，pmfd进程总死掉然后一台系统重启。系统日志里信息有限，并且没有任何proc core dump 产生。太奇怪了不知道哪里的问题
ep 25 10:57:46 trs00mlcprc01 kernel: cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking
Sep 25 10:57:46 trs00mlcprc01 kernel: unix: [ID 836849 kern.notice]
panic[cpu0]/thread=2a10007fca0:
Sep 25 10:57:46 trs00mlcprc01 kernel: unix: [ID 562397 kern.notice] Failfast: Aborting zone "global" (zone ID 0) because "pmfd" died
35 seconds ago.

作者: 淡然的紫色发布时间: 2010-09-25

看看 / 下有没有一pmfd产生的core文件，core开头的一个文件，

file core*

作者: doging 发布时间: 2010-09-25

回复 doging

我的系统core dump 定义的目录是/var/core但是目录下没有

作者: 淡然的紫色发布时间: 2010-09-25

回复淡然的紫色

系统的core文件位于 /var/crash/xxx

进程的core文件一般位于运行目录，比如 "/"

作者: doging 发布时间: 2010-09-25

Article ID : 1020514.1
Article Type : Problem Resolutions (SURE)
Last reviewed : 2010-06-03
Audience : PUBLIC
Copyright Notice: Copyright © 2010, Oracle Corporation and/or its affiliates.
Sun[TM] Cluster 3.X: Cluster node paniced with rgmd, rpc.fed, pmfd died some seconds ago message

--------------------------------------------------------------------------------

Symptoms
This document provides the basic steps to resolve the following failfast panics.

Failfast: Aborting because "rgmd" died 30 seconds ago.
Failfast: Aborting because "rpc.fed" died 30 seconds ago.
Failfast: Aborting because "rpc.pmfd" died 30 seconds ago.
Failfast: Aborting because "clexecd" died 30 seconds ago.
Failfast: Aborting because "globalrgmd" died 30 seconds ago

Resolution
How to do the trouble shooting

Whey it happens?
The panic message indicates that a cluster-specific daemon shown in the message dies. It is a recovery action taken by failfast mechanism of the cluster monitoring a critical problem. As those processes are critical processes and cannot be restarted, the cluster shuts down the node using the failfast panic.

Trouble shooting steps
To find the root cause of the problem, you need to find out why a cluster-specific daemon shown in the messages dies. The followings are steps how to identify the root cause.

Check out the /var/adm/messages system log file for messages indicating that memory resources may have been limited, such as in the following example. If those messages appears before the panic messages, the root cause would be memory exhaustion since a process may get application core dumping and dies when a system has lack of memory resource. If you find messages indicating lack of memory resource, you will need to find out why the system had lock of memory resource and fix it to avoid this panic.

Apr  2 18:05:13 sun-server1 cl_runtime: [ID 661778 kern.warning] WARNING: clcomm: memory low: freemem 0xfbAnother indication is to check for messages reporting that swap space was limited, such as in the following example.

Apr  2 18:05:03 sun-server1 tmpfs: [ID 518458 kern.warning] WARNING: /tmp: File system full, swap space limit exceeded
Apr  2 18:05:10 sun-server1 Cluster.PMF.pmfd: [ID 837760 daemon.error] monitored processes forked failed(errno=12)
Apr  2 18:05:12 sun-server1 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 25825 (in.ftpd)For additional information, from a kernel core file, you can see messages given before the panic using the mdb. Check out those messages as well as the /var/adm/messages file.

# cd /var/crash/`uname -n`
# echo "::msgbuf -v" | mdb -k unix.0 vmcore.0 As some bugs causing this panic were fixed in the Core Patch or Core/Sys admin those patches, check out if your system still has old patch installed. Check out a the README of patch installed on your machine if there are any relevant bugs fixed between a patch installed and the latest one. The following is the list of relevant bug causing this panic and their patches

Patch ID: 126105 Sun Cluster 3.2: CORE patch for Solaris 9
Patch ID: 126106 Sun Cluster 3.2: CORE patch for Solaris 10
Patch ID: 126107 Sun Cluster 3.2: CORE patch for Solaris 10_x86

Bug ID: 6784007 Running scstat(1M) causes memory to be leaked in rgmd
Bug ID: 6747452 RU failed- Failfast: Aborting because "globalrgmd" died 30 seconds ago. rgmd core dump
Bug ID: 6620504 Failfast: Aborting zone "global" (zone ID 0)because "rgmd" died 30 seconds ago, in cthon testing
Bug ID: 6739317 during ha_frmwk_fi, Failfast: Aborting zone "global" (zone ID 0) because "fed" died 35 seconds ago

Patch ID: 117950 Sun Cluster 3.1: Core Patch for Solaris 8
Patch ID: 117949 Sun Cluster 3.1: Core Patch for Solaris 9
Patch ID: 117909 Sun Cluster 3.1_x86: Core Patch for Solaris 9_x86

Bug ID: 5035341 Failfast: Aborting because "clexecd" died 30 seconds ago
Bug ID: 5043407 RGM test panic "Failfast: Aborting because "rgmd" died 30 seconds ago"
Bug ID: 6438132 rgmd dumped core while resources were being disabled
Bug ID: 6460419 syntax error in scswitch kills rgmd
Bug ID: 6312828 cluster panics with 'rgmd died' panic when ld_preload set and scstat or scha_resource_get is used
Bug ID: 6192133 rgmd core dumped during functional tests on sc32/sc31u4 clusters
Bug ID: 6290248 rgmd dumped core while rs stop failed flag was cleared after disabling all resources in an RG

Patch ID: 120500 Sun Cluster 3.1: Core Patch for Solaris 10
Patch ID: 120501 Sun Cluster 3.1_x86: Core Patch for Solaris 10_x86

Bug ID: 6438132 rgmd dumped core while resources were being disabled
Bug ID: 6460419 syntax error in scswitch kills rgmd
Bug ID: 6312828 cluster panics with 'rgmd died' panic when ld_preload set and scstat or scha_resource_get is used
Bug ID: 6290248 rgmd dumped core while rs stop failed flag was cleared after disabling all resources in an RG

Patch ID: 110648 Sun Cluster 3.0: Core/Sys Admin Patch for Solaris 8
Patch ID: 112563 Sun Cluster 3.0: Core/Sys Admin Patch for Solaris 9

Bug ID: 4756973 rgmd uses idl object after failed idl call in scha control giveover: causes segv
Bug ID: 4690244 Failfast: Aborting because "rgmd" died 30 seconds ago

The failfast panic will generate a kernel core file, however, in general, it does not help you find a reason why a process dies. But in most of causes, when this panic happens, a process dies due to an application core dumping and this application core file will help you find the root cause. To collect an application core file, use the coreadm command to get core files that are uniquely named and are stored in a consistent place. Run the following commands on each cluster node.

# mkdir -p /var/cores
# coreadm -g /var/cores/%f.%n.%p.%t.core \
-e global \
-e global-setid \
-e log \
-d process \
-d proc-setidFor more details, have a look at Solution: 202274 Using coreadm to Generate Core Files in Solaris[TM] 8 Operating System and Later

If you have an application core file of cluster-specific daemon, you may want to analyze it. To analyze core file, you can start with Solution 206410 How to correctly debug an application core file. But for quick analysis, use the pstack command. It gives you stack trace of core file and this can be used for searching existing bug from the SunSolve[SM]. It is also good ideal to give your Sun[TM] Services representative a command output given by the pstack command for further analysis.

# /usr/bin/pstack /var/cores/rgmd.halab1.7699.1242026038.core

Product
Sun Cluster 3.0
Sun Cluster 3.1
Solaris Cluster 3.2

作者: calcm 发布时间: 2010-09-25

回复 calcm

我在sunsolve上看的了你的帖子，还是没有解决。在上面看到了一个和我的现象很相似的帖子，但是没有写如何解决晕!!!!!!

Bug ID: 4834069
Synopsis: rpc.pmfd dies, panicing node after 35 secs
Category: sunclusterSubcategory: pmfState: closedPriority:
Responsible Manager:    Responsible Engineer:
Description: repeated loss of pmfd for no apparent reason. when pmfd dies , the node panics
after 30 grace period with the message.
Mar 10 13:17:20 kbcsms unix: [ID 562397 kern.notice] Failfast: Aborting because
"pmfd" died 35 seconds ago.

there does not appear to be any resource contention (proc slots, memory/swap)
which could explain these deaths. There have been no core files either.

This is a two node cluster, the problem only occurs on kbcsms.

We configured and ran rpc.pmfd in debug node, but there does not appear to
be any concrete lead given by the debug info. The following debug line is
recorded just before the panics and reboot.

Mar 13 10:33:32 kbcsms Cluster.PMF.pmfd: [ID 668555 daemon.debug] exec

-------

Note that the above line was not the last one before the panic.  From the
messages file, we can see that the node didn't panic until 10:35:48:

Mar 13 10:35:48 kbcsms unix: [ID 562397 kern.notice] Failfast: Aborting because
"pmfd" died 35 seconds ago.

In the pmf debug output file, we can see other messages from pmf and
libsecurity following the exec, before the panic:

Mar 13 10:33:32 kbcsms Cluster.PMF.pmfd: [ID 668555 daemon.debug] exec
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 681573 daemon.debug] entering secur
ity_svc_reg.
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 147686 daemon.debug] entering secur
ity_svc_init.
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 347755 daemon.debug] security_svc_i
nit: loopbackset is:
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 695139 daemon.debug] 'ticotsord'
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 695139 daemon.debug] 'ticots'
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 577222 daemon.debug] End of loopbac
k set.
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 719136 daemon.debug] inited = 1.
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:47 kbcsms Cluster.PMF.pmfd: [ID 635902 daemon.debug] regcnt=3 and d
idnegotiatecnt=3
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 746762 daemon.debug] pmfproc_null_1
_svc called
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 4
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 746762 daemon.debug] pmfproc_null_1
_svc called
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 471215 daemon.debug] STATUS
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 456214 daemon.debug] Entering secur
ity_svc_authenticate
Mar 13 10:33:49 kbcsms sec_type=0 and flavor is=1
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 206602 daemon.debug] in AUTH_SYS
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 387944 daemon.debug] entering check
_authsys_security.
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 324347 daemon.debug] 1
Mar 13 10:33:49 kbcsms Cluster.PMF.pmfd: [ID 392924 daemon.debug] Caller's uid f
rom authsys_parms is 0.

[email protected] 2003-03-21Work Around: noneIntegrated in releases: , Duplicate of: Patch ID: , See Also: 4948129, Summary:

作者: 淡然的紫色发布时间: 2010-09-25

回复 calcm

我总感觉是富士通补丁的问题