Big Data/Cloud Archives - Page 4 of 5

Continuous Integration and Continuous Deployment (CI/CD) – SQL Server Database project dependency – Part 3

Leave a Comment / Azure, SQL Server / Gopal Krishna Ranjan / Sep 30, 2020 / Azure DevOps, Database CICD, Database Project, DevOps, SSDT

Previously, we have created an Azure DevOps Continuous Integration (CI) and Continuous deployment (CD) pipelines to independently deploy a SQL Server database project. However, in an enterprise data warehouse environment, the databases are mostly dependent on other databases. Because, in DWH, we access objects from multiple databases (like accessing the staging layer objects into the […]

Continuous Integration and Continuous Deployment (CI/CD) – SQL Server Database project dependency – Part 3 Read More »

Continuous Integration and Continuous Deployment (CI/CD) – SQL Server Database CD – Part 2

Leave a Comment / Azure, SQL Server / Gopal Krishna Ranjan / Aug 31, 2020 / Azure DevOps, Database CICD, Database Project, DevOps, SSDT

In the previous post, we have created a CI (Continous Integration) pipeline for a SQL Server database project. For this demo, as like our previous demo, we will be using a SQL Server instance running on an on-prem machine along with a locally installed Azure DevOps Server. Please note that we are not using the

Continuous Integration and Continuous Deployment (CI/CD) – SQL Server Database CD – Part 2 Read More »

Continuous Integration and Continuous Deployment (CI/CD) – SQL Server Database CI – Part 1

1 Comment / Azure, SQL Server / Gopal Krishna Ranjan / Jul 31, 2020 / Azure DevOps, Database CICD, Database Project, DevOps, SSDT

In this post, we are going to discuss how we can enable continuous integration and continuous deployment for a SQL Server Database project using Azure DevOps Server. For this demo, we will be using a SQL Server instance running on-prem along with a locally installed Azure DevOps Server. Continuous integration and continuous delivery (in short

Continuous Integration and Continuous Deployment (CI/CD) – SQL Server Database CI – Part 1 Read More »

Data compression in Hive – An Introduction to Hadoop Data Compression

Leave a Comment / Hadoop, Hive / Gopal Krishna Ranjan / Dec 31, 2019 / big data processing, Hadoop, HiveQL

Data compression is a technique that encodes the original data in such a way so that it can be represented with fewer bits on the disk. The data compression process is used to reduce the size of the data files on the disk. We know that the Hadoop framework is meant for large scale data

Data compression in Hive – An Introduction to Hadoop Data Compression Read More »

Read and write data to SQL Server from Spark using pyspark

5 Comments / Python, Spark, SQL Server / Gopal Krishna Ranjan / Sep 30, 2019 / big data processing, pyspark, sql tips, step by step

Apache Spark is a very powerful general-purpose distributed computing framework. It provides a different kind of data abstractions like RDDs, DataFrames, and DataSets on top of the distributed collection of the data. Spark is highly scalable Big data processing engine which can run on a single cluster to thousands of clusters. To follow this exercise,

Read and write data to SQL Server from Spark using pyspark Read More »

Install Spark on Windows (Local machine) with PySpark – Step by Step

2 Comments / Python, Spark / Gopal Krishna Ranjan / Aug 26, 2019 / pyspark, python, python use case, step by step

Apache Spark is a general-purpose big data processing engine. It is a very powerful cluster computing framework which can run from a single cluster to thousands of clusters. It can run on clusters managed by Hadoop YARN, Apache Mesos, or by Spark’s standalone cluster manager itself. To read more on Spark Big data processing framework,

Install Spark on Windows (Local machine) with PySpark – Step by Step Read More »

RDD, DataFrame, and DataSet – Introduction to Spark Data Abstraction

Leave a Comment / Spark / Gopal Krishna Ranjan / May 31, 2019 / big data processing

Apache Spark is a general purpose distributed computing engine used for Big Data processing – Batch and stream processing. It provides high level APIs like Spark SQL, Spark Streaming, MLib, and GraphX to allow interaction with core functionalities of Apache Spark. Spark also facilitates several core data abstractions on top of the distributed collection of

RDD, DataFrame, and DataSet – Introduction to Spark Data Abstraction Read More »

Big Data processing using Apache Spark – Introduction

Leave a Comment / Spark / Gopal Krishna Ranjan / Apr 30, 2019 / Hadoop, MapReduce

What is Spark Apache spark is an open source general purpose distributed cluster computing framework. It is an unified computing engine for big data processing. Spark is designed for lightning fast cluster computing especially for fast computation. An application can run up to 100 times faster than Hadoop MapReduce using Spark in-memory cluster computing. Also,

Big Data processing using Apache Spark – Introduction Read More »

Understanding Map join in Hive

Leave a Comment / Hadoop, Hive / Gopal Krishna Ranjan / Mar 31, 2019 / Hadoop, HiveQL, MapReduce, performance tuning, query hint

Apache Hive is a big data query language which is used to read, transform and write large datasets in a distributed environment. It has a SQL like syntax which gets translated into a MapReduce job in order to execute on Hadoop clusters. In Hadoop ecosystem, we use Hive for batch processing to extract, transform and

Understanding Map join in Hive Read More »