如何确定block损坏的位置和修复

首先通过 hadf fsck 命令帮助

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[hadoop@hadoop ~]$ hdfs fsck

Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks | -replicaDetails | -upgradedomains]]]] [-includeSnapshots] [-storagepolicies] [-blockId <blk_Id>]
<path> start checking from this path
-move move corrupted files to /lost+found
-delete delete corrupted files
-files print out files being checked
-openforwrite print out files opened for write
-includeSnapshots include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
-list-corruptfileblocks print out list of missing blocks and files they belong to
-files -blocks print out block report
-files -blocks -locations print out locations for every block
-files -blocks -racks print out network topology for data-node locations
-files -blocks -replicaDetails print out each replica details
-files -blocks -upgradedomains print out upgrade domains for every block
-storagepolicies print out storage policy summary for the blocks
-blockId print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

Please Note:
1. By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually tagged CORRUPT or HEALTHY depending on their block allocation status
2. Option -includeSnapshots should not be used for comparing stats, should be used only for HEALTH check, as this may contain duplicates if the same file present in both original fs tree and inside snapshots.

Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
command [genericOptions] [commandOptions]

1.现象:

断电 导致HDFS服务不正常或者显示块损坏

2.检查HDFS系统文件健康

1
hdfs fsck /

3.检查hdfs fsck -list-corruptfileblocks

1
2
3
4
5
Connecting to namenode via http://hadoop36:50070/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The list of corrupt files under path '/' are:
blk_1075229920 /hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd
blk_1075229921 /hbase/data/JYDW/WMS_PO_ITEMS/c96cb6bfef12795181c966a8fc4ef91d/0/cf44ae0411824708bf6a894554e19780
The filesystem under path '/' has 2 CORRUPT files

4.分析

MySQL–》大数据平台
​ 只需要从MySQL这个表的数据重新刷新一份到HDFS平台

5.想要知道文件的哪些块分布在哪些机器上面?手工删除linux文件/dfs/dn/…..

hadoop36:hdfs:/var/lib/hadoop-hdfs:>

-files 文件分块信息,
-blocks 在带-files参数后才显示block信息
-locations 在带-blocks参数后才显示block块所在datanode的具体IP位置,
-racks 在带-files参数后显示机架位置

无法显示,无法手工删除块文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
hdfs fsck /hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd -files  -locations -blocks  -racks
Connecting to namenode via http://hadoop36:50070/fsck?ugi=hdfs&locations=1&blocks=1&files=1&path=%2Fhbase%2Fdata%2FJYDW%2FWMS_PO_ITEMS%2Fc71f5f49535e0728ca72fd1ad0166597%2F0%2Ff4d3d97bb3f64820b24cd9b4a1af5cdd
FSCK started by hdfs (auth:SIMPLE) from /192.168.1.100 for path /hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd at Sat Jan 20 15:46:55 CST 2018
/hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd 2934 bytes, 1 block(s):
/hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd: CORRUPT blockpool BP-1437036909-192.168.1.100-1509097205664 block blk_1075229920
MISSING 1 blocks of total size 2934 B

1. BP-1437036909-192.168.1.100-1509097205664:blk_1075229920_1492007 len=2934 MISSING!

Status: CORRUPT
Total size: 2934 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 2934 B)

------

UNDER MIN REPL'D BLOCKS: 1 (100.0 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 1
MISSING BLOCKS: 1
MISSING SIZE: 2934 B
CORRUPT BLOCKS: 1

------

Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 1
Missing replicas: 0
Number of data-nodes: 12
Number of racks: 1
FSCK ended at Sat Jan 20 15:46:55 CST 2018 in 0 milliseconds

The filesystem under path '/hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd' is CORRUPT
hadoop36:hdfs:/var/lib/hadoop-hdfs:>

好的文件是显示块分布情况的:

1
2
3
4
5
6
7
8
9
hadoop36:hdfs:/var/lib/hadoop-hdfs:>hdfs fsck /hbase/data/JYDW/WMS_TO/011dea9ae46dae6c1f1f3a24a75af100/0/1d60f56773984e4cac614a8b5f7e93a6 -files  -locations -blocks  -racks
Connecting to namenode via http://hadoop36:50070/fsck?ugi=hdfs&files=1&locations=1&blocks=1&racks=1&path=%2Fhbase%2Fdata%2FJYDW%2FWMS_TO%2F011dea9ae46dae6c1f1f3a24a75af100%2F0%2F1d60f56773984e4cac614a8b5f7e93a6
FSCK started by hdfs (auth:SIMPLE) from /192.168.1.100 for path /hbase/data/JYDW/WMS_TO/011dea9ae46dae6c1f1f3a24a75af100/0/1d60f56773984e4cac614a8b5f7e93a6 at Sat Jan 20 15:58:25 CST 2018
/hbase/data/JYDW/WMS_TO/011dea9ae46dae6c1f1f3a24a75af100/0/1d60f56773984e4cac614a8b5f7e93a6 1697 bytes, 1 block(s): OK

1. BP-1437036909-192.168.1.100-1509097205664:blk_1075227504_1489591 len=1697 Live_repl=3 [/default/192.168.1.150:50010, /default/192.168.1.153:50010, /default/192.168.1.145:50010]

blk_1075227504_1489591 len=1697 Live_repl=3
[/default/192.168.1.150:50010, /default/192.168.1.153:50010, /default/192.168.1.145:50010]

最终选择一了百了,删除损坏的块文件,然后业务系统数据重刷
hadoop36:hdfs:/var/lib/hadoop-hdfs:>hdfs fsck / -delete

7.假设数据仅有HDFS上 【文件只有hdfs上有;其他来源没有;这个时候如果有副本是完好的;有的副本是损坏的】
7.1 hdfs dfs -ls /xxxx
​ hdfs dfs -get /xxxx ./ 下载好完好的副本到Linux环境
​ hdfs dfs -rm /xxx 删除已有的文件包括损坏的副本文件
​ hdfs dfs -put xxx / 上传完好的副本文件;此时hdfs就会自动完善3个副本。

注意:

log文件丢一丢丢 没有关系
文件是业务数据 订单数据 丢了,需要报告

手动修复损坏的块【hdfs debug】

hdfs命令帮助是没有debug的,但是确实有hdfs debug这个组合命令,切记。 hdfs debug recoverLease -path hdfs文件位置 -retries 10

自动修复

1
2
3
4
5
6
7
8
9
10
11
12
当数据块损坏后,DN节点执⾏行行directoryscan操作之前,都不会发现损坏;
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600
在DN向NN进行blockreport前,都不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
当NN收到blockreport才会进行恢复操作。

注意:手动修复方式,但是前提要手动删除损坏的block块。
切记,是删除损坏block文件和meta⽂文件,而不是删除hdfs文件。
当然还可以先把文件get下载,然后hdfs删除,再对应上传。
切记删除不要执行: hdfs fsck / -delete 这是删除损坏的文件, 那么数据不就丢了了嘛;除非无所谓丢数据,或者有信心从其他地方可以补数据到hdfs!

本文标题:如何确定block损坏的位置和修复

文章作者:skygzx

发布时间:2019年04月10日 - 10:21

最后更新:2019年04月10日 - 10:31

原始链接:http://yoursite.com/2019/04/10/如何确定block损坏的位置和修复/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

-------------本文结束感谢您的阅读-------------
0%