Forum Discussion

Sun-Ray's avatar
Sun-Ray
Level 3
11 years ago

VCS Cluster not starting.

Hi

I am facing problem while trying to start VCS .

From LOG :

==============================================================

tail /var/VRTSvcs/log/engine_A.log

    2014/01/13 21:39:14 VCS NOTICE V-16-1-11050 VCS engine version=5.1
    2014/01/13 21:39:14 VCS NOTICE V-16-1-11051 VCS engine join version=5.1.00.0
    2014/01/13 21:39:14 VCS NOTICE V-16-1-11052 VCS engine pstamp=Veritas-5.1-10/06/09-14:37:00
    2014/01/13 21:39:14 VCS INFO V-16-1-10196 Cluster logger started
    2014/01/13 21:39:14 VCS NOTICE V-16-1-10114 Opening GAB library
    2014/01/13 21:39:14 VCS NOTICE V-16-1-10619 ‘HAD’ starting on: nsscls01
    2014/01/13 21:39:16 VCS INFO V-16-1-10125 GAB timeout set to 30000 ms
    2014/01/13 21:39:16 VCS NOTICE V-16-1-11057 GAB registration monitoring timeout set to 200000 ms
    2014/01/13 21:39:16 VCS NOTICE V-16-1-11059 GAB registration monitoring action set to log system message
    2014/01/13 21:39:31 VCS CRITICAL V-16-1-11306 Did not receive cluster membership, manual intervention may be needed for seeding

=============================================================================================

root@nsscls01# hastatus -sum
VCS ERROR V-16-1-10600 Cannot connect to VCS engine
VCS WARNING V-16-1-11046 Local system not available

 

Please advice how can I start the VCS.

  • Hello,

    Clearly the issue is at LLT layer ... as you can see from llstat output you have posted, node 1 says that node 2 LLT is down, & node 2 says that node 1 LLT is down .. couple of possibilities here

    1. Either the physical connection itself has issues, you can use tools like dlpiping or lltping to determine the status of LLT links. These tools are helpful because LLT works at mac layer. Alternatively to test, you can plumb some IPs on both the sides & try test ping. for e.g plumb 1.1.1.1 on nxge1 on node 1 & 1.1.1.2 on nxge1 on node 2 & you can ping to confirm connectivity.

    Link for dlpiping

    http://sfdoccentral.symantec.com/sf/5.0MP3/aix/manpages/vcs/man1/dlpiping.html

     

    2. If connectivity is found right, just try to start all the components manually on node 1

    # /etc/init.d/llt start

    # /etc/init.d/gab start

    I have observed that sometimes LLT status is now shown correctly unless GAB is started correct. once these are started, check the "gabconfig -a" output again. If GAB starts & shows membership with other node, you will need to start IOFencing

    # /etc/init.d/vxfen start

    post this you would be able to execute "hastart" in order for VCS to start

     

    G

  • Hello,

    Clearly the issue is at LLT layer ... as you can see from llstat output you have posted, node 1 says that node 2 LLT is down, & node 2 says that node 1 LLT is down .. couple of possibilities here

    1. Either the physical connection itself has issues, you can use tools like dlpiping or lltping to determine the status of LLT links. These tools are helpful because LLT works at mac layer. Alternatively to test, you can plumb some IPs on both the sides & try test ping. for e.g plumb 1.1.1.1 on nxge1 on node 1 & 1.1.1.2 on nxge1 on node 2 & you can ping to confirm connectivity.

    Link for dlpiping

    http://sfdoccentral.symantec.com/sf/5.0MP3/aix/manpages/vcs/man1/dlpiping.html

     

    2. If connectivity is found right, just try to start all the components manually on node 1

    # /etc/init.d/llt start

    # /etc/init.d/gab start

    I have observed that sometimes LLT status is now shown correctly unless GAB is started correct. once these are started, check the "gabconfig -a" output again. If GAB starts & shows membership with other node, you will need to start IOFencing

    # /etc/init.d/vxfen start

    post this you would be able to execute "hastart" in order for VCS to start

     

    G

    • adeyint's avatar
      adeyint
      Level 1

      Hello Gaurav,

      Thanks as your solution saved the day in my environment as I had similar issue.

      Regards,

       

      Adeyinka Taiwo.

  • Hello,

    can you post more details, how many node this cluster has ? is VCS already running on other nodes or this is the first node you are trying to start VCS ? what is operating system version ?

    also paste below from all the nodes

    # gabconfig -a

    # cat /etc/gabtab

     

    from the initial looks, it appears that GAB on this node is not able to communicate with GAB on other nodes & untill your GAB is successfully started you can't start HAD (VCS)

    Below GAB sits LLT (heartbeat), you might want to look at LLT as well if all the heartbeats are working & connected, below output can give view on LLT status

    # lltstat -vvn

     

    G

  •  2014/01/13 21:39:31 VCS CRITICAL V-16-1-11306  Did not receive cluster membership, manual intervention may be needed for seeding

    As per Gaurav's excellent post - you firstly need to check if LLT and GAB are running.

    Cluster comms must first of all be established before VCS (had) will start.

    Is this a new cluster? 
    Or a cluster that has worked fine previously?

  • As Gaurav and Marianne have said, there is an issue with LLT or GAB.

    By default VCS will not start until all nodes are up and can communicate with each other via the heartbeats, so most likely is that another node is not up or there is problem with LLT communication so run "lltstat -nvv" to check that you have no nodes which have links in the DOWN state. It could also be that LLT is ok and GAB is not running on one of the nodes, but this is less likely as GAB is started by the O/S running "/etc/gabtab" file on boot up, (unless you have manaully disabled GAB startup)

    Mike 

  • root@nsscls02#  hastatus -sum

    – SYSTEM STATE
    – System               State                Frozen

    A  nsscls01              UNKNOWN              0
    A  nsscls02              RUNNING              0


    ==============================================================
    root@nsscls01# lltstat -nvv
    LLT node information:
    Node                 State    Link  Status  Address
    * 0 nsscls01           OPEN
    nxge1   UP      08:00:28:16:39:46
    nxge2   UP      08:00:28:23:AA:8A
    1 nsscls02           CONNWAIT
    nxge1   DOWN
    nxge2   DOWN

    root@nsscls02# lltstat -nvv
    LLT node information:
    Node                 State    Link  Status  Address
    0 nsscls01           CONNWAIT
    nxge1   DOWN
    nxge2   DOWN
    * 1 nsscls02           OPEN
    nxge1   UP      08:00:23:0E:29:74
    nxge2   UP      08:00:23:15:CD:85

    ===============================================================
    root@nsscls01# gabconfig -a
    GAB Port Memberships
    ===============================================================


    root@nsscls02# gabconfig -a

    GAB Port Memberships
    ================================================================
    Port a gen  1c77f01 membership ;1
    Port b gen  1c77f04 membership ;1
    Port h gen  1c77f03 membership ;1


    In Gabtab :
    /sbin/gabconfig -c -n2

     

    Please advice.


  • root@nsscls02#  hastatus -sum

    – SYSTEM STATE
    – System               State                Frozen

    A  nsscls01              UNKNOWN              0
    A  nsscls02              RUNNING              0


    ==============================================================
    root@nsscls01# lltstat -nvv
    LLT node information:
    Node                 State    Link  Status  Address
    * 0 nsscls01           OPEN
    nxge1   UP      08:00:28:16:39:46
    nxge2   UP      08:00:28:23:AA:8A
    1 nsscls02           CONNWAIT
    nxge1   DOWN
    nxge2   DOWN

    root@nsscls02# lltstat -nvv
    LLT node information:
    Node                 State    Link  Status  Address
    0 nsscls01           CONNWAIT
    nxge1   DOWN
    nxge2   DOWN
    * 1 nsscls02           OPEN
    nxge1   UP      08:00:23:0E:29:74
    nxge2   UP      08:00:23:15:CD:85

    ===============================================================
    root@nsscls01# gabconfig -a
    GAB Port Memberships
    ===============================================================


    root@nsscls02# gabconfig -a

    GAB Port Memberships
    ================================================================
    Port a gen  1c77f01 membership ;1
    Port b gen  1c77f04 membership ;1
    Port h gen  1c77f03 membership ;1


    In Gabtab :
    /sbin/gabconfig -c -n2

  • Hi

    I am not sure if anybody is able to see my posting. I have posted twice with the output but its not visible for me. So posting once more. If you are able to see please respond with resolution.

     


    root@nsscls02#  hastatus -sum

    – SYSTEM STATE
    – System               State                Frozen

    A  nsscls01              UNKNOWN              0
    A  nsscls02              RUNNING              0


    ==============================================================
    root@nsscls01# lltstat -nvv
    LLT node information:
    Node                 State    Link  Status  Address
    * 0 nsscls01           OPEN
    nxge1   UP      08:00:28:16:39:46
    nxge2   UP      08:00:28:23:AA:8A
    1 nsscls02           CONNWAIT
    nxge1   DOWN
    nxge2   DOWN

    root@nsscls02# lltstat -nvv
    LLT node information:
    Node                 State    Link  Status  Address
    0 nsscls01           CONNWAIT
    nxge1   DOWN
    nxge2   DOWN
    * 1 nsscls02           OPEN
    nxge1   UP      08:00:23:0E:29:74
    nxge2   UP      08:00:23:15:CD:85

    ===============================================================
    root@nsscls01# gabconfig -a
    GAB Port Memberships
    ===============================================================


    root@nsscls02# gabconfig -a

    GAB Port Memberships
    ================================================================
    Port a gen  1c77f01 membership ;1
    Port b gen  1c77f04 membership ;1
    Port h gen  1c77f03 membership ;1


    In Gabtab :
    /sbin/gabconfig -c -n2

  • I am able to see your posts in a "red dotted line sqaure" ... not sure why this is appearing like this, I will check with admins on this..

    G

  • To test heartbeat network you can temporarily plumb some IPs on the interfaces:

    Choose an interface, say nxge1 and plumb 1.1.1.1 netmask 255.255.255.0 on one node and 1.1.1.2 netmask 255.255.255.0 on the other node and then test you can ping 1.1.1.1/2 between the two nodes.

    Mike

  • Hi Gaurav,

     

    I have checked the llt connectivity okay. Stop and start the llt in the problem node. Then manually start remaining process manually and start the cluster. Now everythig looks green. 

     

    root@nsscls02#  hastatus -sum

    – SYSTEM STATE
    – System               State                Frozen

    A  nsscls01              RUNNING              0
    A  nsscls02              RUNNING              0

    ==============================================================

    Thanks to all for your support. :)

     

    Cheers ....