torque+maui问题-任务不运行

2024-03-24 10:38
文章标签 问题 运行 任务 maui torque

本文主要是介绍torque+maui问题-任务不运行,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

这次因停电服务器关机重启后发现pbs系统不能正常工作了,提上去任务显示Q状态,就是不运行。


pbs_server进程、pbs_mom进程及trqauthd进程重启多次都无果,无意中使用了showq命令,提示
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ERROR:    cannot send request to server logon:42559 (server may not be running)
ERROR:    cannot request service (status)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

然后就在这里找到了答案:http://www.supercluster.org/pipermail/torqueusers/2011-June/012986.html

=================================================================================

[torqueusers] still problem with FQDN and torque


Mahmood Naderan nt_mahmood at yahoo.com 
Wed Jun 8 23:07:30 MDT 2011
Previous message: [torqueusers] still problem with FQDN and torque
Next message: [torqueusers] TORQUE 2.4.14 available. Contains fix for buffer overflow risk
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>that is maui, not torque. does maui run?


You are right. I didn't start maui.
it is now ok. Thanks for your help :)


// Naderan *Mahmood;




----- Original Message -----
From: Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu>
To: Mahmood Naderan <nt_mahmood at yahoo.com>
Cc: torque cluster <torqueusers at supercluster.org>
Sent: Thursday, June 9, 2011 12:48 AM
Subject: Re: [torqueusers] still problem with FQDN and torque


On Wed, Jun 8, 2011 at 1:18 PM, Mahmood Naderan <nt_mahmood at yahoo.com> wrote:
>>did you restart the servers?
>
> Even I reboot the server but I get
>
> root at srv:/var/spool/maui# /etc/init.d/pbs_mom restart
>  * Restarting TORQUE mom pbs_mom                                                       [ OK ]
>
> root at srv:/var/spool/maui# showq
> ERROR:    cannot send request to server srv:42559 (server may not be running)
> ERROR:    cannot request service (status)


that is maui, not torque. does maui run?




>>depends on your torque configuration.
>
> Exactly what?


for example: /var/spool/torque/server_name


axel.


>
>
> // Naderan *Mahmood;
>
>
> ----- Original Message -----
> From: Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu>
> To: Mahmood Naderan <nt_mahmood at yahoo.com>
> Cc: torque cluster <torqueusers at supercluster.org>
> Sent: Wednesday, June 8, 2011 8:29 PM
> Subject: Re: [torqueusers] still problem with FQDN and torque
>
> On Wed, Jun 8, 2011 at 11:18 AM, Mahmood Naderan <nt_mahmood at yahoo.com> wrote:
>>
>> How about changeing  srv.domain.com
>>
>> to srv-ex.domain.com ? I think it is much easier. Do you agree with that?
>
> yes.
>
>> I change that however  the server_log still says
>>
>> 06/08/2011 19:43:52;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::is_request, bad attempt to connect from 192.168.1.1:1023 (address not trusted - check entry in server_priv/nodes)
>
> did you restart the servers?
>
>
>> It is very strange for me and I can not understand.
>>
>> root at srv:/var/spool/pbs/server_logs# ping srv
>> PING srv (192.168.1.1) 56(84) bytes of data.
>> 64 bytes from srv (192.168.1.1): icmp_seq=1 ttl=64 time=0.092 ms
>>
>>
>> As you can see the IP address of srv is 192.168.1.1
>> also the server_priv/nodes contains
>>
>> srv np=14
>>
>> however I am really confused why it says "192.168.1.1 address not trusted" !!!!
>
> depends on your torque configuration.
> again, did you restart after changing /etc/hosts and DNS entries?
>
> axel.
>
>> Please assume
>>  192.168.1.1     srv
>>  196.215.64.105 srv-ex.domain.com
>>
>> // Naderan *Mahmood;
>>
>>
>> ----- Original Message -----
>> From: Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu>
>> To: Mahmood Naderan <nt_mahmood at yahoo.com>
>> Cc: torque cluster <torqueusers at supercluster.org>
>> Sent: Wednesday, June 8, 2011 7:25 PM
>> Subject: Re: [torqueusers] still problem with FQDN and torque
>>
>> On Wed, Jun 8, 2011 at 9:11 AM, Mahmood Naderan <nt_mahmood at yahoo.com> wrote:
>>> root at srv:~# cat /etc/hosts
>>> 127.0.0.1       localhost.localdomain localhost
>>> 192.168.1.1     srv-internal
>>> 196.215.64.105 srv.domain.com
>>>
>>>
>>> root at srv:~# cat /var/spool/pbs/server_priv/nodes
>>> srv-internal np=14
>>> ...
>>>
>>>
>>> root at srv:~# cat /var/spool/pbs/server_name
>>> srv-internal
>>>
>>> root at srv:~# /etc/init.d/pbs_mom restart
>>>  * Restarting TORQUE mom pbs_mom                                                       [ OK ]
>>>
>>> root at srv:~# showq
>>> ERROR:    cannot connect to 'srv' port 42559
>>
>>>
>>> Still it is looking for srv.....
>>
>> sure. did you change your configuration to
>> connect to srv-internal?
>> you have to do this and the corresponding
>> changes in the hosts files on _all_ compute
>> nodes, mom config files, pbs_server and so on.
>> you want the batch system communication only
>> on the internal network, so you have to use
>> the proper names to get it to use the right
>> ip numbers. elementary.
>>
>> axel.
>>
>>>
>>> // Naderan *Mahmood;
>>>
>>>
>>> ----- Original Message -----
>>> From: Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu>
>>> To: Mahmood Naderan <nt_mahmood at yahoo.com>
>>> Cc: torque cluster <torqueusers at supercluster.org>
>>> Sent: Wednesday, June 8, 2011 5:30 PM
>>> Subject: Re: [torqueusers] still problem with FQDN and torque
>>>
>>> On Wed, Jun 8, 2011 at 8:50 AM, Mahmood Naderan <nt_mahmood at yahoo.com> wrote:
>>>>>have you tried using
>>>>>something like this instead?
>>>>>192.168.1.1 srv-internal
>>>>
>>>> You mean the hostname for the invalid ip should be different from valid ip?
>>>>
>>>>
>>>>  127.0.0.1           localhost.localdomain localhost
>>>>  192.168.1.1        srv-internal
>>>>  196.215.64.105 srv.domain.com
>>>>
>>>> you mean that?
>>>
>>> yes.
>>>
>>>>
>>>> // Naderan *Mahmood;
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Axel Kohlmeyer <akohlmey at cmm.chem.upenn.edu>
>>>> To: Mahmood Naderan <nt_mahmood at yahoo.com>; Torque Users Mailing List <torqueusers at supercluster.org>
>>>> Cc:
>>>> Sent: Wednesday, June 8, 2011 4:58 PM
>>>> Subject: Re: [torqueusers] still problem with FQDN and torque
>>>>
>>>> On Wed, Jun 8, 2011 at 6:09 AM, Mahmood Naderan <nt_mahmood at yahoo.com> wrote:
>>>>> Dear all
>>>>>
>>>>> I have asked about FQDN before while torque is running however somethings are still vague that causes the server node to be "down".
>>>>>
>>>>> The server has two NIC. One is connected to internet with valid IP and the other is connected to a local switch with invalid IP. The content of /etc/hosts looks like
>>>>>
>>>>> 127.0.0.1           localhost.localdomain localhost
>>>>> 192.168.1.1        srv
>>>>> 196.215.64.105  srv.domain.com
>>>>
>>>> i think it is a bad idea to have "srv" for the internal name
>>>> and "srv.<something" for the external one. this will defeat
>>>> attempts to canonicalize hostnames. have you tried using
>>>> something like this instead?
>>>>
>>>> 192.168.1.1 srv-internal
>>>>
>>>> axel.
>>>>
>>>>>
>>>>> some information:
>>>>>
>>>>> mahmood at srv:~$ cat /var/spool/pbs/server_name
>>>>> srv
>>>>>
>>>>>
>>>>> root at srv:~# cat /var/spool/pbs/server_priv/nodes
>>>>> srv np=14
>>>>> ...
>>>>>
>>>>>
>>>>> mahmood at srv:~$ pbsnodes -a srv
>>>>> srv
>>>>>      state = down
>>>>>      np = 14
>>>>>      ntype = cluster
>>>>>      status = rectime=1307513072,varattr=,jobs=26463.srv,state=free,netload=359025735787,gres=,loadave=0.34,ncpus=16,physmem=33013828kb,availmem=11987350430564kb,totmem=12345261654628kb,idletime=8,nusers=1,nsessions=4,sessions=21685 21758 22420 27547,uname=Linux srv 2.6.32-24-server #39-Ubuntu SMP Wed Jul 28 06:21:40 UTC 2010 x86_64,opsys=linux
>>>>>      mom_service_port = 15002
>>>>>      mom_manager_port = 15003
>>>>>      gpus = 0
>>>>>
>>>>>
>>>>>
>>>>> My question is hy the srv is "down". What is wrong?
>>>>>
>>>>> // Naderan *Mahmood;
>>>>> _______________________________________________
>>>>> torqueusers mailing list
>>>>> torqueusers at supercluster.org
>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Axel Kohlmeyer    akohlmey at gmail.com
>>>> http://sites.google.com/site/akohlmey/
>>>>
>>>> Institute for Computational Molecular Science
>>>> Temple University, Philadelphia PA, USA.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Dr. Axel Kohlmeyer    akohlmey at gmail.com
>>> http://sites.google.com/site/akohlmey/
>>>
>>> Institute for Computational Molecular Science
>>> Temple University, Philadelphia PA, USA.
>>>
>>>
>>
>>
>>
>> --
>> Dr. Axel Kohlmeyer    akohlmey at gmail.com
>> http://sites.google.com/site/akohlmey/
>>
>> Institute for Computational Molecular Science
>> Temple University, Philadelphia PA, USA.
>>
>>
>
>
>
> --
> Dr. Axel Kohlmeyer    akohlmey at gmail.com
> http://sites.google.com/site/akohlmey/
>
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
>
>






-- 
Dr. Axel Kohlmeyer    akohlmey at gmail.com
http://sites.google.com/site/akohlmey/


Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.


Previous message: [torqueusers] still problem with FQDN and torque
Next message: [torqueusers] TORQUE 2.4.14 available. Contains fix for buffer overflow risk
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the torqueusers mailing list

这篇关于torque+maui问题-任务不运行的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/841347

相关文章

如何用Docker运行Django项目

本章教程,介绍如何用Docker创建一个Django,并运行能够访问。 一、拉取镜像 这里我们使用python3.11版本的docker镜像 docker pull python:3.11 二、运行容器 这里我们将容器内部的8080端口,映射到宿主机的80端口上。 docker run -itd --name python311 -p

好题——hdu2522(小数问题:求1/n的第一个循环节)

好喜欢这题,第一次做小数问题,一开始真心没思路,然后参考了网上的一些资料。 知识点***********************************无限不循环小数即无理数,不能写作两整数之比*****************************(一开始没想到,小学没学好) 此题1/n肯定是一个有限循环小数,了解这些后就能做此题了。 按照除法的机制,用一个函数表示出来就可以了,代码如下

hdu1043(八数码问题,广搜 + hash(实现状态压缩) )

利用康拓展开将一个排列映射成一个自然数,然后就变成了普通的广搜题。 #include<iostream>#include<algorithm>#include<string>#include<stack>#include<queue>#include<map>#include<stdio.h>#include<stdlib.h>#include<ctype.h>#inclu

购买磨轮平衡机时应该注意什么问题和技巧

在购买磨轮平衡机时,您应该注意以下几个关键点: 平衡精度 平衡精度是衡量平衡机性能的核心指标,直接影响到不平衡量的检测与校准的准确性,从而决定磨轮的振动和噪声水平。高精度的平衡机能显著减少振动和噪声,提高磨削加工的精度。 转速范围 宽广的转速范围意味着平衡机能够处理更多种类的磨轮,适应不同的工作条件和规格要求。 振动监测能力 振动监测能力是评估平衡机性能的重要因素。通过传感器实时监

缓存雪崩问题

缓存雪崩是缓存中大量key失效后当高并发到来时导致大量请求到数据库,瞬间耗尽数据库资源,导致数据库无法使用。 解决方案: 1、使用锁进行控制 2、对同一类型信息的key设置不同的过期时间 3、缓存预热 1. 什么是缓存雪崩 缓存雪崩是指在短时间内,大量缓存数据同时失效,导致所有请求直接涌向数据库,瞬间增加数据库的负载压力,可能导致数据库性能下降甚至崩溃。这种情况往往发生在缓存中大量 k

6.1.数据结构-c/c++堆详解下篇(堆排序,TopK问题)

上篇:6.1.数据结构-c/c++模拟实现堆上篇(向下,上调整算法,建堆,增删数据)-CSDN博客 本章重点 1.使用堆来完成堆排序 2.使用堆解决TopK问题 目录 一.堆排序 1.1 思路 1.2 代码 1.3 简单测试 二.TopK问题 2.1 思路(求最小): 2.2 C语言代码(手写堆) 2.3 C++代码(使用优先级队列 priority_queue)

跨系统环境下LabVIEW程序稳定运行

在LabVIEW开发中,不同电脑的配置和操作系统(如Win11与Win7)可能对程序的稳定运行产生影响。为了确保程序在不同平台上都能正常且稳定运行,需要从兼容性、驱动、以及性能优化等多个方面入手。本文将详细介绍如何在不同系统环境下,使LabVIEW开发的程序保持稳定运行的有效策略。 LabVIEW版本兼容性 LabVIEW各版本对不同操作系统的支持存在差异。因此,在开发程序时,尽量使用

【VUE】跨域问题的概念,以及解决方法。

目录 1.跨域概念 2.解决方法 2.1 配置网络请求代理 2.2 使用@CrossOrigin 注解 2.3 通过配置文件实现跨域 2.4 添加 CorsWebFilter 来解决跨域问题 1.跨域概念 跨域问题是由于浏览器实施了同源策略,该策略要求请求的域名、协议和端口必须与提供资源的服务相同。如果不相同,则需要服务器显式地允许这种跨域请求。一般在springbo

题目1254:N皇后问题

题目1254:N皇后问题 时间限制:1 秒 内存限制:128 兆 特殊判题:否 题目描述: N皇后问题,即在N*N的方格棋盘内放置了N个皇后,使得它们不相互攻击(即任意2个皇后不允许处在同一排,同一列,也不允许处在同一斜线上。因为皇后可以直走,横走和斜走如下图)。 你的任务是,对于给定的N,求出有多少种合法的放置方法。输出N皇后问题所有不同的摆放情况个数。 输入

vscode中文乱码问题,注释,终端,调试乱码一劳永逸版

忘记咋回事突然出现了乱码问题,很多方法都试了,注释乱码解决了,终端又乱码,调试窗口也乱码,最后经过本人不懈努力,终于全部解决了,现在分享给大家我的方法。 乱码的原因是各个地方用的编码格式不统一,所以把他们设成统一的utf8. 1.电脑的编码格式 开始-设置-时间和语言-语言和区域 管理语言设置-更改系统区域设置-勾选Bata版:使用utf8-确定-然后按指示重启 2.vscode