In this post, we will learn how we can download a file from DBFS i.e. Databricks File System to the Local machine. DBFS is the File system that Databricks uses to store its files. It is a distributed file system mounted into a Databricks workspace and it is available on Databricks clusters. To demonstrate how we can download a file from DBFS to our local machine, we will be using the Azure Databricks platform. However, we can use a similar approach to download a file from Databricks hosted in AWS or in any other cloud platform. We will be using Databricks CLI – Databricks command line interface – to download a sample file from DBFS to our local machine. This method is beneficial especially if you have to download a big file from the DBFS.
How to Download a file from Databricks to the Local Machine
In order to use the Databricks CLI, first, we need to install it on our local machine. The Databricks CLI facilitates useful methods that are grouped based on their primary endpoints. This command line interface is created on the top of the Databricks Rest APIs. We can use the Databricks CLI interface to run data processing tasks, execute file system-level commands, etc. The Databricks CLI is an Open Source project and is hosted on GitHub.
As a prerequisite, the Databricks CLI needs a Python installation. Also, the Python version should be compatible with the Databricks CLI. So, in the case of Python 3, it should be Python 3.6 and above, and in the case of Python 2, it should be Python 2.7.9 and above. Below are the steps we will cover in this demo.
- Install the Databricks CLI
- Verify the Databricks CLI installation
- Create a Databricks access token
- Configure the Databricks CLI
- Download a file from DBFS using Databricks CLI
- Upload a file from local to the Databricks file system
Let’s discuss each step mentioned above in detail now.
1. Install Databricks CLI
We know that pip is the package installer for Python and we can use it to install the Databricks CLI from a terminal window. Therefore, we can execute the below pip command to install the Databricks CLI utility.
pip install databricks-cli
As a result, the above pip command will install the Databricks command line interface on our machine. Once it is installed, we can use this utility as an interface to list, analyze, view, and download, Databricks Files.
2. Verify Databricks CLI installation
Once the installation is complete, we can use the below command in a terminal window to verify the installation. This command will print the databricks-cli version in the terminal window.
databricks --version
3. Create a Databricks access token
To access the Databricks files, we need to create an access token in our Databricks workspace. We need the below pieces of information to configure the CLI utility.
- Host name
- Personal access token
The hostname is the workspace URL that we use to access the Databricks platform from a browser. It is formatted like https://<databricks-instance-name>.cloud.databricks.com.
Next, to create a personal access token for authentication purposes, firstly, click on the username dropdown located at the top right corner of the Databricks workspace. Secondly, choose User Settings and click on the personal access token. Thirdly, in the Access tokens tab, click on the Generate new token button. Finally, put a name for the token set the expiration lifetime for the token, and then click on Generate button. Do not forget to copy the token value as Databricks does not allow viewing this value once it is saved.
4. Configure Databricks CLI
In the next step, we need to configure the Databricks CLI using the created personal access token. To start the configuration process of the Databricks CLI, we can use the below command. In this example, we are using the token-based configuration.
databricks --configure token
It will ask for the hostname and token key respectively. Provide these values and hit the enter key.
Hostname -> https://<databricks-instance-name>.cloud.databricks.com
Token -> <your-token-key>
5. Download a file from DBFS using Databricks CLI
Finally, we can execute the file system’s cp command to download a file from the Databricks File system to our local machine. This is the same as the UNIX cp command except for the databricks fs prefix. The syntax of the command is databricks fs cp <source> <destination>.
databricks fs cp dbfs:/FileStore/jars/sample-jar-file.jar /Users/admin/Downloads
Depending on the file size and the network speed, it will take some time to download the file from DBFS to the local system.
6. Upload a file from local to DBFS
Similarly, to upload a file from our local machine to the Databricks File System, we can use the databricks fs cp command by switching the source and destination placeholders in the above command.
databricks fs cp /Users/admin/Downloads/sample-jar-file.jar dbfs:/FileStore/jars/
Upload speed will also depend on the file size and the network speed.
We can also explore some other commands that are available out of the box in the Databricks CLI utility. For example, to list the files in our workspace, we can use the ls command.
databricks workspace ls <dbfs-path>
This command will list all the files for the given path. We can also hit and try some other useful commands provided by the Databricks CLI utility.
Thanks for the reading. Please share your inputs in the comment section.
There’s an error in your instructions. Step 4 should be “`databricks configure –token“`