Databricks Zip File Validation

In many IoT systems, you will receive feeds that are in a zip format. Sometimes they are empty. We want to validate they contain a file before we start processing them because it involves moving the file around different file systems, running a shell command, and more. See here.

In this example, we have loaded the zip files into Azure Data Lake Gen2. Next, we read through the folder and determine if any zip files are empty. Since PySpark doesn’t natively support zip files, we must validate another way (i.e. without using TRY). We also don’t want to unzip all the files and check for matching CSV files. The approach we want to use is to process the zip file as a binary file.

We have two test files. MyBigData.zip, which contains a CSV file, and RunningOnEmpty.zip, which is empty. Let’s read them and see what the difference is.

PySpark

goodpath=”/mnt/Logfiles/Test/MyBigData.zip”
rdd = sc.binaryFiles(goodpath)
print(rdd.collect())

Output (first line)

[(‘dbfs:/mnt/Logfiles/Test/MyBigData.zip’, b’PK\x03\x04\n\x00\x00\x00\x00\x00wp\xffN\x161~\x84\x1b\x00\x00\x00\x1b\x00\x00\x00\r\x00\x00\x00MyBigData.csvTitle\r\n”Running on Empt

PySpark

badpath=”/mnt/Logfiles/Test/RunningOnEmpty.zip”
rdd = sc.binaryFiles(badpath)
print(rdd.collect())

Output

[(‘dbfs:/mnt/Logfiles/Test/RunningOnEmpty.zip’, b’PK\x05\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')]

As you can see, MyBigData.zip contains the name of the CSV file inside of it as well as the beginning contents of the file. RunningOnEmpty.zip does not contain any zipped files or data. We can now use this to test for an empty zip file.

We will be searching for ‘csv’ inside the zip file. We must convert the data to a string to find and use the binary form of ‘csv’. The final code is below.

path=”/mnt/Logfiles/Test”
filelist=dbutils.fs.ls(path)

for x in filelist:
rdd = sc.binaryFiles(x.path)
for s in rdd.collect():
if s[1].find(b’csv’) < 0:
print(“empty file”, x.path)

(2) Spark Jobs
empty file dbfs:/mnt/Logfiles/Test/RunningOnEmpty.zip

This will print a list of all empty zip files. For production use, you can write out to an error log or invoke additional processing.