-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [Module Name] Bug title yarn宕机 #576
Comments
You did not use DDP within the scope |
|
没有添加jmx监控,可能会监控不到状态,显示报错 |
jmx从3.3.3复制过来了,但是prometheus_config.yml是空的,hdfs是正常监控的,yarn不行。 |
jmx同级还有ranger-hdfs-plugin目录也复制过来了,表象是hdfs能正常上传文件,mapreduce示例也可以执行 |
检查你得yarn-evn.sh是否配置jmx,然后检查你的Prometheus里的configs下面是否有nodemanager的配置,如果都有的话,检查你得yarn进程是否是你新启动的,还是之前安装过的遗留进程 |
感谢 |
Search before asking
What happened
yarn集群启动过一会宕机
What you expected to happen
不确定是不是3.3.6版本包漏改了什么
How to reproduce
1.2.1分支,官网下载来的Hadoop3.3.6版本包,我这一共做了以下处理:
2.hdfs正常安装正常运行
3.yarn集群启动过一会宕机
日志显示并无报错
每次重启之后会显示上次的是kill -15
如:2024-07-12 15:58:57,029 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added node ddp3:45454 cluster capacity: <memory:12144, vCores:6>
2024-07-12 16:02:34,726 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: RECEIVED SIGNAL 15: SIGTERM
2024-07-12 16:02:34,733 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2024-07-12 16:02:34,737 INFO org.eclipse.jetty.server.handler.ContextHandler: Stopped o.e.j.w.WebAppContext@516592b1{cluster,/,null,STOPPED}{jar:file:/datasophon/hadoop-3.3.6/share/hadoop/yarn/hadoop-yarn-common-3.3.6.jar!/webapps/cluster}
2024-07-12 16:02:34,742 INFO org.eclipse.jetty.server.AbstractConnector: Stopped ServerConnector@464a4442{HTTP/1.1, (http/1.1)}{ddp4:8088}
ps -ef 发现nn,nm的进程还在,并且yarn也能通过命令看到服务状态
[hdfs@ddp4 datasophon]$ yarn node -list -all
2024-07-12 16:52:47,106 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
ddp4:45454 RUNNING ddp4:8042 0
ddp1:45454 RUNNING ddp1:8042 0
ddp3:45454 RUNNING ddp3:8042 0
[hdfs@ddp4 datasophon]$ yarn rmadmin -getAllServiceState
ddp1:8033 standby
ddp4:8033 active
[hdfs@ddp4 datasophon]$ ping ddp1
PING ddp1 (xxxx) 56(84) bytes of data.
64 bytes from ddp1 (xxxx): icmp_seq=1 ttl=64 time=16.6 ms
64 bytes from ddp1 (xxxx): icmp_seq=2 ttl=64 time=8.33 ms
^C
--- ddp1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 8.337/12.510/16.684/4.174 ms
[hdfs@ddp4 datasophon]$ ping ddp3
PING ddp3 (1xxxx) 56(84) bytes of data.
64 bytes from ddp3 (xxxx): icmp_seq=1 ttl=64 time=1.72 ms
64 bytes from ddp3 (xxxx): icmp_seq=2 ttl=64 time=0.540 ms
rn 8088管理页面每一个tab都显示错误
Anything else
No response
Version
main
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: