Blog Archives - Page 5 of 22

Get minimum value from multiple columns in SQL Server

2 Comments / SQL Server / Gopal Krishna Ranjan / Jun 27, 2021 / sql tips

This post will discuss how we can extract the minimum value from multiple columns in SQL Server. For example, we have a table that stores the temperature of multiple cities in columns. The temperature data of each city is stored in a separate column. However, we have to select the minimum temperature value throughout all […]

Get minimum value from multiple columns in SQL Server Read More »

Show full column content in Spark

Leave a Comment / Hadoop, Spark / Gopal Krishna Ranjan / May 28, 2021 / big data processing, pyspark, python

This post briefs how we can display the full contents of data frame columns in Apache Spark. The default behavior of Spark truncates the column values if it is more than 20 characters. However, sometimes we need to display the full values rather than the truncated data. Having truncated data might not be useful in

Show full column content in Spark Read More »

Spark read file with special characters using PySpark

1 Comment / Hadoop, Spark / Gopal Krishna Ranjan / May 24, 2021 / big data processing, pyspark, python

Suppose, we have a CSV file that contains some non-English characters (Spanish, Japanese, and etc.) and we want to read this file into a Spark data frame. If we read this file without using the right character encoding, we will end up with some junk characters (like �) in the data frame. So, the files

Spark read file with special characters using PySpark Read More »

Read CSV file with Newline character in PySpark

Leave a Comment / Hadoop, Spark / Gopal Krishna Ranjan / May 14, 2021 / big data processing, pyspark, python

Apache Spark is a Big Data cluster computing framework that can run on Standalone, Hadoop, Kubernetes, Mesos clusters, or in the cloud. We can read and write data from various data sources using Spark. For example, we can use CSV (comma-separated values), and TSV (tab-separated values) files as an input source to a Spark application.

Read CSV file with Newline character in PySpark Read More »

Sort By, Order By, Distribute By, and Cluster By in Hive

Leave a Comment / Hadoop, Hive / Gopal Krishna Ranjan / May 3, 2021 / big data processing, Hadoop, HiveQL

This post will briefly discuss the difference and similarity between Sort By, Order By, Distribute By, and Cluster By in hive queries. This is one of the most important questions being asked in Big data/Hadoop interviews. These Sort By, Order By, Distribute By, and Cluster By clauses are available in the hive query language and

Sort By, Order By, Distribute By, and Cluster By in Hive Read More »

Grant UPDATE and SELECT on specific columns in a table – SQL Server

Leave a Comment / SQL Server / Gopal Krishna Ranjan / Apr 7, 2021 / query design, sql tips

This post briefs how we can Grant UPDATE and SELECT permissions to specific columns of a table in SQL Server without using a view. So that, this partial vertical access control strategy can help us to manage the permissions directly at the table level. It is always good to set the access permissions at the

Grant UPDATE and SELECT on specific columns in a table – SQL Server Read More »

Get consecutive available seats in a row using SQL query

Leave a Comment / SQL Server / Gopal Krishna Ranjan / Apr 3, 2021 / query design, sql tips

This post briefs how to get consecutive available seats in a row using SQL query for a multiplex cinema theatre that stores its data into a SQL Server database. In other words, we need to write a query to get n number of available consecutive seats for the multiplex seat booking application. However, for this

Get consecutive available seats in a row using SQL query Read More »

Create pair plots using scatter_matrix method in pandas

Leave a Comment / Analytics/ML, Data Analysis, Data Science, Machine Learning, Python / Gopal Krishna Ranjan / Mar 31, 2021 / data analysis, data preprocessing, data science - step by step, EDA, machine learning - step by step

The exploratory data analysis is a very important step in a Data Science project. It helps us to visualize the data and identify any hidden trends that might not be visible with summary statistics alone. So, we can use matplotlib and seaborn libraries to create stunning visuals in Python. However, the pandas.plotting module of the

Create pair plots using scatter_matrix method in pandas Read More »

Plot ECDF in Python

Leave a Comment / Analytics/ML, Data Analysis, Data Science, Machine Learning, Python / Gopal Krishna Ranjan / Feb 28, 2021 / data analysis, data preprocessing, data science - step by step, EDA, machine learning - step by step

We know that EDA (Exploratory Data Analysis), is the process of organizing, plotting, and summarizing the data to find trends, patterns, and outliers using statistical and visual methods. Here, we have already discussed various methods of performing EDA with their pros and cons on an underlying dataset. ECDF plot is another visual method of performing

Plot ECDF in Python Read More »

Interactive Data Analysis with HANA using Jupyter Notebook/Jupyter Lab

Leave a Comment / Data Analysis, Python / Gopal Krishna Ranjan / Jan 31, 2021 / data analysis, hana, jupyter notebook, python, python use case sql

We have discussed that how we can use Jupyter Lab/Jupyter Notebook to do Interactive Data Analysis with SQL Server using Jupyter Notebooks. Jupyter Notebook is a very powerful and useful tool for any Data Analyst/Data Scientist. The Jupyter Lab is the next generation tool for the Jupyter Notebooks. It provides an interface where we can

Interactive Data Analysis with HANA using Jupyter Notebook/Jupyter Lab Read More »