什么是 HA ?
High Available : 高可用
虽然 HDFS 存在多个副本,但 NameNode 可能会出现单节点故障。对于只有一个 NameNode 节点的集群,一旦该节点出现故障,集群将无法使用直至重新启动。
通过开启 HDFS 的 HA 功能,通过在不同节点上设置 Active/Standby 多个 NameNode,当 Active NameNode 出现故障时,可以很快的将 Standby NameNode 切换至 Active 状态。只有 Active NameNode 才能对外提供读写服务。
环境:
- CentOS 7.6.1810 Minimal
- NAT 网络模式(虚拟机)
- JDK 1.8
- Hadoop 3.2.0
- Zookeeper 3.4.13
集群规划(3 台):
主机名 | NameNode | DataNode | ResourceManager | NodeManager | Zookeeper | JournalNode | ZKFC |
---|---|---|---|---|---|---|---|
master | √ | √ | √ | √ | √ | √ | |
master2 | √ | √ | √ | √ | √ | √ | |
slave1 | √ | √ | √ | √ | √ | √ |
无关紧要的配置
- CentOS 7 安装完后感觉分辨率太高,小屏幕顶不住,修改分辨率
vi /boot/grub2/grub.cfg
(CentOS 7)
调整为 800x600x32 的分辨率,在linux16 /vmlinuz-x.xx.x
行,末尾添加:vga=0x340
,重启生效。修改为vga=ask
将在启动时提示选择显示模式。
勿改linux16 /vmlinuz-0-rescue
行
可用显示模式:
系统基础配置
由于是最小化安装(Minimal),部分用得到的命令可能需要手动安装。
关闭防火墙
- 查看防火墙状态:
Firewall:service firewalld status
orfirewall-cmd --state
Iptables:service iptables status
- 禁用防火墙开机自启
CentOS 6,关闭 Iptables:chkconfig iptables off
CentOS 7,关闭 Firewall:systemctl firewalld disable
网络
分配固定 IP。
主机名 | IP |
---|---|
master | 192.168.222.128 |
master2 | 192.168.222.129 |
slave1 | 192.168.222.130 |
- 启用网卡
系统安装完后网卡默认不自启。- 查看设备:
ip addr
- 启用:
ifup ens33
默认为 dhcp 自动获取 IP。
- 查看设备:
- 安装 net-tools
基本网络实用程序集合。含常用的ifconfig
,netstat
命令等,方便查看和配置网络。yum install -y net-tools
设置网卡自动启动,分配固定 IP
直接修改配置文件:vi /etc/sysconfig/network-scripts/ifcfg-ens33
1
2
3
4
5
6
7
8
9
10BOOTPROTO=static # 使用什么协议,static(静态)
...
ONBOOT=yes # 开机自启
# IP 和网关等信息
IPADDR=192.168.222.128
NETMASK=255.255.255.0
GATEWAY=192.168.222.2
DNS1=8.8.8.8
DNS2=4.4.4.4图中的
BOOTPROTO
应改为static
,不然手动配置的固定 IP 可能不生效。
主机名和域名映射
- 修改主机名
三台分别设置为 master master2 slave1
如:hostnamectl set-hostname master
(使用此命令修改无需重启就可永久生效) 修改 hosts 文件
这里遇到过一个问题,已解决:Zookeeper 启动了,查看状态却提示可能未启动vi /etc/hosts
1
2
3
4
5127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.222.128 master
192.168.222.129 master2
192.168.222.130 slave1
SSH 免密验证
所有节点分别执行:
- 生成密钥对
ssh-keygen -t rsa
- 拷贝公钥
ssh-copy-id master
ssh-copy-id master2
ssh-copy-id slave1
* Hadoop 用户
* 将操作 Hadoop 的用户独立出来,可能会更易于管理和维护,但可能会引出一系列权限问题。可选。(root 用户下操作,3 台都要创建)
- 创建 hadoop 用户组
groupadd hadoop
- 创建 hadoop 用户,并加入 hadoop 用户组
useradd -g hadoop hadoop
- 修改 hadoop 用户的密码
passwd hadoop
- 将
/opt
目录所属组改为 hadoop 组chgrp /opt hadoop
* 如需更改其下子目录和文件需加上-r
参数 - 为
/opt
目录的所属组赋予写权限chmod g+w /opt
* 如需更改其下子目录和文件需加上-r
参数
安装 JDK
- JDK 最好使用 1.8 版本,之后的版本有变动,需额外设置才能成功启动 HDFS
从主机复制压缩包至虚拟机
用 SecureCRT 的 SFTP 上传,Alt + P 呼出,拖入文件开始发送。
解压压缩包
把 JDK 文件解压至 /opt
下:tar -xzvf ~/jdk-8u201-linux-x64.tar.gz -C /opt
。
配置 $PATH
在
/etc/profile
中添加:1
2
3
4# Java
export JAVA_HOME=/opt/jdk1.8.0_201
export CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export PATH=$PATH:$JAVA_HOME/bin重新登入或使用 source 让改动生效
source /etc/profile
验证 JDK 是否成功安装
分发文件
- /opt
master2:scp -r /opt master2:/
slave1:scp -r /opt slave1:/
- /etc/profile 和 /etc/hosts
scp /etc/profile master2:/etc
scp /etc/hosts master2:/etc
同样发送到 slave1 上。
下面的步骤将使用 hadoop 用户进行操作
新开一个连接,登入 hadoop 用户:
为 3 个节点的 hadoop 用户做好 SSH 免密验证。
上传 Hadoop,Zookeeper 压缩包
也可使用 wget
直接在虚拟机中下载:
- Hadoop 2.9.2(清华源):
wget http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
- Zookeeper 3.4.14:
上传文件至虚拟机
SFTP 上传:
修改 hadoop 用户的配置文件
/home/hadoop/.bashrc
(或直接在/etc/profile
中一并设置),添加内容:1
2
3
4
5
6
7
8
9# -- HADOOP ENVIRONMENT VARIABLES START -- #
## Hadoop -v3.2.0
export HADOOP_HOME=/opt/hadoop-3.2.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
# Zookeeper -v3.4.13
export ZK_HOME=/opt/zookeeper-3.4.13
export PATH=$PATH:$ZK_HOME/bin
# -- HADOOP ENVIRONMENT VARIABLES FINISH -- #重新登入或
source ~/.bashrc
生效
配置 Hadoop
Hadoop 的 6 个配置文件($HADOOP_HOME/etc/hadoop):
组件 | 配置文件 |
---|---|
HDFS | hadoop-env.sh, core-site.xml, hdfs-site.xml, workers |
MapReduce | mapred-site.xml |
Yarn | yarn-site.xml |
Zookeeper 的配置文件:$ZK_HOME/conf/zoo.cfg
参考自:
切换至配置文件目录:cd $HADOOP_HOME/etc/hadoop
HDFS
hadoop-env.sh
取消注释(删掉前面的#
) JAVA_HOME 和 HADOOP_HOME,填入相应路径
至少要指定 JAVA_HOME
看过 2.9.2 版本的 hadoop-env.sh 文件,默认填了${JAVA_HOME}
,就这样不改,配置完其他文件后试了下没问题。1
2# The java implementation to use.
export JAVA_HOME=${JAVA_HOME}下面是 hadoop-3.2.0 的 hadoop-env.sh 文件,多了个 HADOOP_HOME,顺手填上。
1
2
3
4
5
6
7# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/opt/jdk1.8.0_201
# Location of Hadoop. By default, Hadoop will attempt to determine
# this location based upon its execution path.
export HADOOP_HOME=/opt/hadoop-3.2.0core-site.xml
直接复制的话注意看是否需要修改部分目录配置的值。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43<configuration>
<!-- HDFS的NameService,任意 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://ha-cluster</value>
</property>
<!-- Hadoop存放元数据文件的目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-3.2.0/tmp</value>
</property>
<!-- 流文件的缓冲区大小,单位:KB -->
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
<!-- 指定Zookeeper集群的地址以进行故障自动转移 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>master:2181,master2:2181,slave1:2181</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/hadoop/HA/data/journalnode</value>
</property>
<!-- 将ipc连接重试次数增加到100,sleepTime调到10000,防止因journalnode启动过慢导致namenode启动失败 -->
<property>
<name>ipc.client.connect.max.retries</name>
<value>100</value>
</property>
<property>
<name>ipc.client.connect.retry.interval</name>
<value>10000</value>
</property>
</configuration>hdfs-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92<configuration>
<!-- 指定HDFS的NameServices,需和core-site.xml中保持一致 -->
<property>
<name>dfs.nameservices</name>
<value>ha-cluster</value>
</property>
<!-- 指定ha-cluster下的NameNodes(任取) -->
<property>
<name>dfs.ha.namenodes.ha-cluster</name>
<value>nn1,nn2</value>
</property>
<!-- NameNodes的rpc通信地址 -->
<property>
<name>dfs.namenode.rpc-address.ha-cluster.nn1</name>
<value>master:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.ha-cluster.nn2</name>
<value>master2:8020</value>
</property>
<!-- NameNodes的http通信地址 -->
<property>
<name>dfs.namenode.http-address.ha-cluster.nn1</name>
<value>master:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.ha-cluster.nn2</name>
<value>master2:9870</value>
</property>
<!-- 指定NameNode的元数据在JournalNode上的存放位置 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://master:8485;master2:8485;slave1:8485/ha-cluster</value>
</property>
<!-- HDFS客户端用于联系Active NameNode的Java类,也用于故障转移实现 -->
<property>
<name>dfs.client.failover.proxy.provider.ha-cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 脚本或Java类的列表,用于在故障转移期间屏蔽Active NameNode,多个方法使用换行进行分隔 -->
<!-- sshfence可用于SSH到Active NameNode并终止进程 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 指定SSH密钥文件列表,逗号分隔。sshfence需要免密验证以登录至其他NameNode节点 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
<!-- 可选,配置使用非标准用户或端口来执行SSH -->
<!--<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence(hadoop:22)</value>
</property>-->
<!-- 可选,配置SSH超时时间 单位:毫秒 -->
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
<!-- JournalNode守护程序用于存储其本地状态的路径 -->
<property>
<name>/home/hadoop/HA/data/jn_local</name>
<value></value>
</property>
<!-- 开启NameNode故障自动切换 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- 设置副本数为2 -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>workers(原 slaves)
$HADOOP_HOME/sbin
中的脚本以及hdfs
可通过文件中列出的主机名去启动节点上对应的进程。1
2
3master
master2
slave1
MapReduce
- mapred-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45<configuration>
<!-- 指定mr框架为yarn -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- map任务内存大小,默认1G -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>230</value>
</property>
<!-- reduce任务内存大小,默认1G -->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>460</value>
</property>
<!-- map任务运行的JVM进程内存大小,默认-Xmx200M -->
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx184m</value>
</property>
<!-- reduce任务运行的JVM进程内存大小,默认-Xmx200M -->
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx368m</value>
</property>
<!-- MR AppMaster运行内存,默认1536M -->
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>460</value>
</property>
<!-- MR AppMaster运行的JVM进程内存,默认-Xmx1024m -->
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx368m</value>
</property>
</configuration>
Yarn
- yarn-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45<configuration>
<!-- 分别指定RM的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>slave1</value>
</property>
<!-- 指定ZK集群地址 -->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>master:2181,master2:2181,slave1:2181</value>
</property>
<!-- Shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- RM中分配容器的内存最小值,默认1G -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>230</value>
</property>
<!-- RM中分配容器的最大值,默认8G -->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>700</value>
</property>
<!-- 可用物理内存大小,默认8G -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>700</value>
</property>
<!-- 虚拟内存检查是否开启 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
Zookeeper
- 拷贝一份配置文件模板,重命名为
zoo.cfg
cd $ZK_HOME/conf;cp zoo_sample.cfg zoo.cfg
修改配置文件:
vi zoo.cfg
修改dataDir
,dataLogDir
,添加 server 信息。1
2
3
4
5
6dataDir=/home/hadoop/HA/data/zookeeper
dataLogDir=/home/hadoop/HA/logs/zookeeper
...
server.1=master:2888:3888
server.2=master2:2888:3888
server.3=slave1:2888:3888- 创建
zoo.cfg
中的dataDir
,在该目录下创建myid
文件并添加内容,三个节点中的文件内容分别为1
,2
,3
,对应zoo.cfg
中的 server.X。
如,master
节点中:mkdir -p /home/hadoop/HA/data/zookeeper
echo 1 > /home/hadoop/HA/data/zookeeper/myid
发送文件
配置文件都已修改完毕,将需要的文件和目录都发送到其他节点上。
- Zookeeper(/opt 下)
- Hadoop(/opt 下)
- /home/hadoop/HA
- /home/hadoop/.bash_profile(hadoop 用户的配置文件)
如:
scp -r /opt/zookeeper-3.4.13 master2:/opt
scp /etc/profile slave1:/etc
集群启动
需严格按照步骤执行。
在 Hadoop 3 中,可用命令代替直接执行脚本文件:
hdfs --workers --daemon
=>hadoop-daemons.sh
hdfs --daemon
=>hadoop-daemon.sh
hdfs --daemon
=>hadoop-daemon.sh
- …
- 在所有节点中启动 JournalNode
hadoop-daemons.sh start journalnode
,注意是 s 脚本:-daemons.sh
。
Hadoop 3 可用:hdfs --workers --daemon start journalnode
* 单节点启动:使用 非 s版本的脚本文件,或者:hdfs --daemon start journalnode
jps
查看节点的 JVM 进程中是否有 JournalNode(没有的话应该是没启动成功,到 $HADOOP_HOME/logs 目录下看日志报什么错): - 格式化 Active NameNode(master)
hdfs namenode -format
- 启动 NameNode 守护程序(NameNode Daemon)(Active NameNode 节点,master)
hadoop-daemon.sh start namenode
,注意是 非 s 脚本:-daemon.sh
。
Hadoop 3 可用:hdfs --daemon start namenode
- 在 Standby NameNode(master2)节点复制 Active NameNode(master)的元数据
hdfs namenode -bootstrapStandby
成功的话应该能看到如下信息: - 在 Standby NameNode 节点启动 NameNode 守护程序(NameNode Daemon)
hadoop-daemon.sh start namenode
- 在每个节点启动 Zookeeper 服务
zkServer.sh start
(每个节点执行一次)
正常情况下,Zookeeper 集群状态应该是由 一个 Leader 和 多个 Follower 组成。jps
查看进程会有一个QuorumPeerMain
启动 DataNode 守护程序(DataNode Daemon)
Hadoop 3 可用:hadoop-daemons.sh start datanode
hdfs --workers --daemon start datanode
- 在任意一个 NameNode 上格式化 Zookeeper Failover Controller
hdfs zkfc -formatZK
- 在 Active NameNode 上启动 DFS
start-dfs.sh
- 查看 NameNode 节点状态
hdfs haadmin -getServiceState nn1
- 通过浏览器检查每个 NameNode 的状态
<ip>:<端口>,配置的 http 服务端口为9870
,访问:192.168.222.128:9870
遇到的问题
Zookeeper 启动了,查看状态却提示可能未启动
1 | hadoop@master ~> zkServer.sh status |
看日志:zookeeper.out
(这个日志文件会生成于当时执行 ZK 脚本所处的目录下)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21hadoop@master ~> tail -20 zookeeper.out
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:610)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:838)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:957)
2019-04-01 19:23:36,364 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@184] - Resolved hostname: master2 to address: master2/192.168.222.129
2019-04-01 19:23:36,365 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@584] - Cannot open channel to 3 at election address slave1/192.168.222.130:3888
java.net.ConnectException: 拒绝连接 (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:610)
at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:838)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:957)
2019-04-01 19:23:36,366 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@184] - Resolved hostname: slave1 to address: slave1/192.168.222.130
2019-04-01 19:23:36,366 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@847] - Notification time out: 60000
问题:拒绝连接。
查看防火墙也都已关闭:
百度:Cannot open channel to 3 at election address slave1/192.168.222.130:3888
java.net.ConnectException: 拒绝连接 (Connection refused)
,无果。
百度:Zookeeper 拒绝连接
,看到说是 /etc/hosts
文件的问题,需注释掉 127.0.0.1
行。一开始的 hosts 文件内容:
没按帖子说的直接注释,只删掉了每个 hosts 文件中该行末尾对应的主机名。重启 Zookeeper 后正常:
启动错误:org.apache.hadoop.ipc.Client: Retrying connect to server
问题描述:
- 配置好 HA,启动后发现 NameNode 无法正常启动,且短时间内 NameNode 进程会消失,应该是崩了。
参考:
三种方法解决:
修改 core-site.xml 中的 IPC 参数
调大 namenode 连接 journalnode 的最大时间1
2
3
4
5
6
7
8
9
10<!-- 将ipc连接重试次数增加到100,sleepTime调到10000,防止因journalnode启动过慢导致namenode启动失败 -->
<property>
<name>ipc.client.connect.max.retries</name>
<value>100</value>
</property>
<property>
<name>ipc.client.connect.retry.interval</name>
<value>10000</value>
</property>先启动 JournalNode,再启动 DFS
hadoop-daemons.sh start journalnode
start-dfs.sh
- 直接启动集群,等 NameNode 崩了以后再手动开起来
start-dfs.sh
等 NameNode 崩了以后:hadoop-daemon.sh start namenode
start-dfs.sh
后,集群两个 NameNode 都是 Standby 状态
参考:CSDN
解决:需要先启动 Zookeeper 集群,再启动 DFS。
logs:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
602019-04-11 15:09:22,402 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6002 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:23,408 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7008 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:24,415 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8015 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:25,416 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9016 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:26,420 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10021 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:26,806 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.222.128:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-04-11 15:09:26,826 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave1/192.168.222.130:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-04-11 15:09:26,841 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master2/192.168.222.129:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-04-11 15:09:27,429 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11029 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:28,430 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12030 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:29,433 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13033 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:30,443 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14043 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:31,449 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15050 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:32,460 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16060 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:33,463 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17063 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:34,469 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18069 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:35,471 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19071 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2019-04-11 15:09:36,403 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.222.128:8485, 192.168.222.129:8485, 192.168.222.130:8485]. Skipping.
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:473)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:278)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1590)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1614)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:700)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:322)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1052)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:666)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:728)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:953)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:932)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1673)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1741)
2019-04-11 15:09:36,406 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: No edit log streams selected.
2019-04-11 15:09:36,406 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Planning to load image: FSImageFile(file=/opt/hadoop-2.9.2/tmp/dfs/name/current/fsimage_0000000000000000179, cpktTxId=0000000000000000179)
2019-04-11 15:09:36,537 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode: Loading 2 INodes.
2019-04-11 15:09:36,657 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Loaded FSImage in 0 seconds.
2019-04-11 15:09:36,657 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded image for txid 179 from /opt/hadoop-2.9.2/tmp/dfs/name/current/fsimage_0000000000000000179
2019-04-11 15:09:36,678 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
2019-04-11 15:09:36,678 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 21156 ms via
java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1021)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:261)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1569)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1081)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:666)
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:728)
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:953)
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:932)
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1673)
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1741)
Number of suppressed write-lock reports: 0
Longest write-lock held interval: 21156
2019-04-11 15:09:36,678 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 0 entries 0 lookups
2019-04-11 15:09:36,679 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 21157 msecs
2019-04-11 15:09:36,815 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.222.128:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-04-11 15:09:36,861 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: slave1/192.168.222.130:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
2019-04-11 15:09:36,862 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master2/192.168.222.129:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=100, sleepTime=10000 MILLISECONDS)
记录
- 2019-4-6
一顿操作:
开机;
启动全部 Zookeeper,正常;
启动 HDFS:start-dfs.sh
,看似也正常;
NameNode 状态:
nn1 (master):Standby
nn2 (master2):Active
nn2 的 namenode 日志中有条警告;
其他日志均正常。hadoop-hadoop-namenode-master2.log
:1
2
3
4
5
6
7
8
9
10...
2019-04-06 17:12:37,856 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:469)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:399)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:416)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:484)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:412)
...
备用
- nn1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
232019-04-02 08:12:53,523 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.222.128:8485, 192.168.222.129:8485, 192.168.222.130:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.222.129:8485: Call From master/192.168.222.128 to master2:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.222.130:8485: Call From master/192.168.222.128 to slave1:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.222.128:8485: Call From master/192.168.222.128 to master:8485 failed on connection exception: java.net.ConnectException: 拒绝连接; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)
at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:286)
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)
at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:485)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1673)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1706)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1685)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:703)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:325)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1099)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:716)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:635)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:697)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:940)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:913)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1646)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1713)
nn21
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
322019-04-02 03:08:07,600 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Exception from remote name node RemoteNameNodeInfo [nnId=nn1, ipcAddress=master/192.168.222.128:8020, httpAddress=http://master:9870], try next.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standby. Visit https://s.apache.org/sbnn-error
at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:88)
at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1954)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1442)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4716)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1293)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:148)
at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:14726)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy16.rollEditLog(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:152)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:365)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:362)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:504)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
格式化 ZK 时
1 | =============================================== |