Databricks Zip File Validation

In many IoT systems, you will receive feeds that are in a zip format. Sometimes they are empty. We want to validate they contain a file before we start processing them because it involves moving the file around different file systems, running a shell command, and more. See here.

In this example, we have loaded the zip files into Azure Data Lake Gen2. Next, we read through the folder and determine if any zip files are empty. Since PySpark doesn’t natively support zip files, we must validate another way (i.e. without using TRY). We also don’t want to unzip all the files and check for matching CSV files. The approach we want to use is to process the zip file as a binary file.

We have two test files., which contains a CSV file, and, which is empty. Let’s read them and see what the difference is.


rdd = sc.binaryFiles(goodpath)

Output (first line)

[(‘dbfs:/mnt/Logfiles/Test/’, b’PK\x03\x04\n\x00\x00\x00\x00\x00wp\xffN\x161~\x84\x1b\x00\x00\x00\x1b\x00\x00\x00\r\x00\x00\x00MyBigData.csvTitle\r\n”Running on Empt


rdd = sc.binaryFiles(badpath)


[(‘dbfs:/mnt/Logfiles/Test/’, b’PK\x05\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')]

As you can see, contains the name of the CSV file inside of it as well as the beginning contents of the file. does not contain any zipped files or data. We can now use this to test for an empty zip file.

We will be searching for ‘csv’ inside the zip file. We must convert the data to a string to find and use the binary form of ‘csv’. The final code is below.


for x in filelist:
rdd = sc.binaryFiles(x.path)
for s in rdd.collect():
if s[1].find(b’csv’) < 0:
print(“empty file”, x.path)

(2) Spark Jobs
empty file dbfs:/mnt/Logfiles/Test/

This will print a list of all empty zip files. For production use, you can write out to an error log or invoke additional processing.