Avoiding spark to read and generate CRC and SUCCESS files
While working with Apache Spark, especially while reading and writing Dataframes or Datasets to the HDFS filesystem or while performing copyFromLocal and copyToLocal.
I have been searching a lot on this if there is any solution to solve this issue. I also found a bug filed for this issue https://issues.apache.org/jira/browse/HADOOP-7199
The bug is still in unresolved status. Therefore, I have been trying a few configs and these configs solved the problem for me.
Hadoop filesystem stores replicas so that it can recover from any corruptions that may occur during the write or read process. When a client finds an error when reading a block, it reports the bad block and the datanode it was trying to read from to the namenode before throwing a ChecksumException.
It is possible to disable verification of checksums by passing false to the setVerifyChecksum() method on FileSystem, before using the open() method to read a file. The same effect is possible from the shell by using the -ignoreCrc option with the -get or the equivalent -copyToLocal command.
While creating the FileSystem object you can set the setVerifyCheckSum(boolean flag) to false. This will solve the issue while writing.
val conf = spark.sparkContext.hadoopConfiguration
val fs = org.apache.hadoop.fs.FileSystem.get(conf)
fs.setVerifyChecksum(false)
The other solution is to add the below config parameter while creating the SparkSession. This will solve the issue while reading.
val spark = SparkSession.builder().appName("test-app").master("local").config("dfs.client.read.shortcircuit.skip.checksum", "true").getOrCreate()
Hope this helps.