Use HDFS API to read Azure Blob files in Databricks

Databricks provides a wrapper file system API named DBFS (Databricks File System) to perform any file-level operation such as read, write, move, delete, rename, etc. However, sometimes we may need to read the underlying file system objects directly without using the DBFS wrapper APIs. To do so, we can use HDFS APIs available through py4j gateway server.

Using HDFS API to read Azure Blob objects in Databricks – Using Python

To access the Azure Blob objects using HDFS APIs in Databricks, first, we need to get the FileSystem object that is available in the sparkContext. Then, we can pass the azure container and secret key details as configuration values. The FileSystem API uses these values internally to access the Azure blob object. Finally, to read a file, we need to use a BufferedReader and then loop through each line of the file. Once, the file is completely read, we can close the stream reader. Below is the sample code that can be used to read an Azure Blob file directly using HDFS API in databricks. Here, we are not using the dbutils class provided as a wrapper by DBFS to perform file level operations in databricks.

Below is the sample code in python:

#Import the gateway server and Hadoop configuration to access the underlying FileSystem API
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()

#Set the configuration values to provide account name and access keys to the FileSystem
conf.set(
  "fs.azure.account.key.<azure-account-name>.blob.core.windows.net,
  "<azure-account-access-key>")

#Instantiate the FileSystem to get access to an Azure Blob file directly
fs = Path('wasbs://<azure-container-name>@<azure-account-name>.blob.core.windows.net/<azure-blob-file-path>/').getFileSystem(conf)

#To read the file, use an strem of data using FileSystem.open method
istream = fs.open(Path('wasbs://<azure-container-name>@<account-name>.blob.core.windows.net/<azure-blob-file-path>/'))

#Read the file using BufferedReader class
reader = sc._gateway.jvm.java.io.BufferedReader(sc._jvm.java.io.InputStreamReader(istream))

#Loop through the lines of the file and print it
while True:
  thisLine = reader.readLine()
  if thisLine is not None:
    print(thisLine)
  else:
    break

#Finally close the stream reader
istream.close()

Use HDFS API to read Azure Blob files in Databricks - Flow
Use HDFS API to read Azure Blob files in Databricks – Flow

Thanks for the reading. Please share your inputs in the comment section.

Rate This
[Total: 0 Average: 0]

Leave a Comment

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.