How to handle duplicate records while inserting data in Databricks

Have you ever faced a challenge where records keep getting duplicated when you are inserting some new data into an existing table in Databricks?

If yes, then this blog is for you.

Let’s start with a simple use case: Inserting parquet data from one folder in Datalake to a Delta table using Databricks.

Follow the steps mentioned below to get this job done.

Step 1:  Read the Source data

Open a Databricks notebook and type in the command to read the data from Source in parquet format and convert it into Temporary view.

Syntax: 
Source = sqlContext.read.parquet("mnt/Ization/Duplicate/Sample/*.parquet")
Source.createOrReplaceTempView("Source")

Step 2: Create the Target table using DELTA

Syntax:
CREATE TABLE IF NOT EXISTS Target(
EmpID  STRING
,Name  STRING
,Division STRING
,DOB STRING
,Address STRING
)
USING DELTA

Step 3: Insert records from Source View to Target table using MERGE statement.

Syntax:
MERGE INTO Target T
USING Source S
ON T.EmpID = S.EmpID
AND T.Name  = S.Name
AND T.Division  = S.Division
AND T.DOB = S.DOB
AND T.Address = S.Address
WHEN MATCHED THEN UPDATE 
SET
T.EmpID = S.EmpID
,T.Name  = S.Name
,T.Division  = S.Division
,T.DOB = S.DOB
,T.Address = S.Address
WHEN NOT MATCHED THEN INSERT
(
EmpID
,Name
,Division
,DOB
,Address
)
VALUES
(
S.EmpID
,S.Name
,S.Division
,S.DOB
,S.Address
)

Step 4: Check Target table for Duplicates.

Syntax:
SELECT 
EmpID, Name, Division, DOB, Address, COUNT(*)
FROM Target
GROUP BY
EmpID, Name, Division, DOB, Address
HAVING COUNT(*)>1

Getting ‘Query returned no results’ means there are no duplicates in the Table.

Thank you for reading and Happy Learning.

Ready to elevate your skills? Click here for my Python book and here for my Machine Learning book on Kindle.

Leave a comment

Create a website or blog at WordPress.com

Up ↑