In this post “Quick guide to Bash commands for Big Data Analysis”, we are going to explore some basic Bash/Linux commands which are very useful in data analysis. Bash is a command line interpreter for the GNU OS(a UNIX like free OS) which typically runs in a command line window. It accepts the command submitted by the end user and transforms it into a machine understandable format and sends it to the kernel. If we want to execute a batch of bash commands in a go, we can wrap a set of bash commands in a text file and save it with a .sh extension and then we can call this file in order to execute it.
Since, Hadoop was developed on top of Linux OS, mostly Linux based machines are used in production environment. Therefore, in order to interact with Hadoop clusters, we must have a good understanding of Bash commands. It is the default command line interpreter shipped with Linux based machines and can be very useful for exploring text files located on a Hadoop cluster.
Let’s have a look at the basic bash commands which are useful for a Big Data developer or for a Data analyst.
Bash commands and their uses:
ls command
ls command is used to list the contents of a directory. For example:
ls /d/BashTest/
ls -l /d/BashTest/
-l option is used to list the contents of a directory in long format which also includes Unix file types, permissions, number of hard links, owner, group, size, last-modified date along with filename.
ls -lh /d/BashTest/
-lh option is used to list the contents of a directory and prints the sizes in a human readable format. (e.g. 10K, 100M, 1G, etc.)
ls -lS /d/BashTest/
-lS option is used to sort the contents based on the file size.
pwd command
This is an abbreviation for “print working directory”, it is used to get the full path name of the current working directory.
pwd
mkdir command
This is an abbreviation for “make directory” and it is used to create a new directory on the file system. Below command will create a new folder named “Test” at D:/BashTest location.
mkdir /d/BashTest/Test
mv command
mv command is used to rename a file or directory or to move them to a different location. To rename a file, the source and the destination should be in the same directory. However, to move a file, the source and destination directory should be at different locations. The syntax is as:
mv old_location new_location
So, if we have a file 01.txt located at D:\BashTest and need to move it to a new location at D:\BashTest\Test, we can use below command.
mv /d/BashTest/01.txt /d/BashTest/Test
rm command
rm command is used to remove files or directories. If we need to delete a file, we can use rm command as:
rm /d/BashTest/Test/01.txt
-r option is used to remove directories and their contents recursively.
rm -r /d/BashTest/Test
-i option is used to delete files in interactive mode. Below command will remove all the text files from D:\BashTest\Test folder in an interactive mode.
rm -i /d/BashTest/Test/*.txt
cd command
cd command is used to change the current working directory.
To change the current directory to the root directory:
cd /
To change the current directory to the parent directory:
cd ..
To change the current directory to the home directory:
cd ~
cp command
cp command is used to create a copy of files and directories.
cp /d/BashTest/TestFile.txt /d/BashTest/TestFolder/
-r option is used to copy directories recursively.
cp -r /d/BashTest/Test /d/BashTest/TestFolder/
cat command
cat command is used to print the contents of a file to the standard output window(command line). We can also use it to copy and or append text files into an existing document.
cat /d/BashTest/TestFile.txt
cut command
cut command is used to cut sections of each line of input files by fields, characters or bytes, separated by a delimiter and writes result to the standard output window. The default delimiter is a tab character.
So, if we have a pipe(|) delimited file (as displayed in cat command’s output) and we need to extract the first column from it, we can use below command:
cut -d “|” -f1 /d/BashTest/TestFile.txt
Note: It will not change the original input file.
grep command
grep command is used to extract each line from the input files which matches with the given regular expression pattern and then writes it to the standard output. It is an abbreviation for “global regular expression print”. So, if we need to extract all the lines which contains ‘line number 1000’, we can use below command:
grep ‘line number 1000’ /d/BashTest/TestFile.txt
Note: It will not change the original input file.
head command
head command is used to write the starting lines of a text file to the standard output. By default, it outputs the first 10 lines of the input file. The syntax is as:
head -n
where n is the required number of lines.
head -10 /d/BashTest/TestFile.txt
tail command
tail command is used to write the lines from the end of a text file to the standard output. By default, it outputs the last 10 lines of the input file. The syntax is as:
tail -n
where n is the required number of lines.
tail -10 /d/BashTest/TestFile.txt
touch command
touch command is used to update the last access and or modification date of a file or directory. We can also use it to create an empty file.
touch /d/BashTest/TestFile.txt
If we want to create a new empty file, we can use a non existing file name instead of an existing file name.
touch /d/BashTest/NewTestFile.txt
tr command
tr command is used to replace or remove specific characters from the standard input and to write it to the standard output. So, if we want to replace string ‘line number’ with string ‘Line Number’ in the output of cat command, we can use below command:
cat /d/BashTest/TestFile.txt | tr ‘line number’ ‘Line Number’
-d option is used to delete characters instead of translating it. If we want to delete all the spaces in the output, we can chain the output of the cat command with tr command as below:
cat /d/BashTest/TestFile.txt | tr -d ‘ ‘
Note: This command will not change the original input file.
wc command
wc command is used to print number of lines, words and bytes for each input file. It is an abbreviation for “word count”.
wc /d/BashTest/TestFile.txt
-c option is used to print only the number of characters.
wc -c /d/BashTest/TestFile.txt
-l option is used to print only the number of lines.
wc -l /d/BashTest/TestFile.txt
sort command
sort command is used to sort the contents of a text file in the standard output.
sort /d/BashTest/TestFile.txt
-r option is used to sort the output in the reverse order.
sort -r /d/BashTest/TestFile.txt
-k option is used to sort the content by column number. Here, we are sorting the file based on the first column.
sort -k 1 /d/BashTest/TestFile.txt
-n option is used to compare according to string numerical value. Below, we are sorting the content based on first column’s numerical value and in reverse order.
sort -k 1nr /d/BashTest/TestFile.txt
Note: This command will not change the original input file.
vim command
vim is a text editor which stands for “vi improved”. It can be used to edit existing files in vim editor.
du command
du command is used to display the file space usage under a particular directory or files on a file system.
du /d/BashTest/
-h option is used to get the file size in human readable format:
du -h /d/BashTest/
df command
df command is used to display the amount of available disk space being used by the file systems.
df
-h option is used to display the file size in human readable format:
df -h
man command
man command is used to get the manual pages about the commands. On windows machine, we can use –help option to get the command documentation. For example: cd –help.
more command
more command is used to display the contents of a text file one screen at a time. We can chain more command to the output of other commands in order to display the results one screen at a time.
less command
less command is similar to more, but it has some extended capabilities of allowing both forward and backward scrolling through the file.
ps command
ps command is used to get the information about the currently running processes with their process identification numbers.
top command
top command is used to produce an ordered list of the running processes selected by user-specified criteria. It also updates it periodically.
kill command
kill command is used to kill a process.
Bash special symbolic prompt operators for Big Data Analysis
Bash also provides some special symbolic prompt operators which can be very handy in data analysis.
pipe (|) prompt operator
| operator is used to convert the output of the first command as the input of the second command. It is very useful for command chaining.
cat /d/BashTest/TestFile.txt | tr ‘line number’ ‘Line Number’
double pipe (||) prompt operator
|| is used when we want to execute the second command only if the execution of the first command fails. It will never execute the second command if the first command gets executed successfully.
> prompt operator
> is used to overwrite the standard output to a file if it exists already or to create a new one. So, if we want to output the result of ls command in a new text file we can use:
ls /d/BashTest/ > /d/BashTest/NewTextFile.txt
>> prompt operator
>> is used to append the standard output to a file if it exists already or to create a new one. So, if we want to append the output of the ls command to an existing text file named as “NewTextFile.txt”, we can use:
ls /d/BashTest/ >> /d/BashTest/NewTextFile.txt
& prompt operator
& is used to run a process in the background.
&& prompt operator
&& is used to execute the second command only if the execution of the first command succeeded.
You can also refer to these links if you want to understand these commands in more detail.
https://www.gnu.org/software/bash/manual/bash.html
Thanks for the reading. Please share your inputs in the comment section.
Great article, just whyat I wanted to find.
web pagfe bet betting match