Have you ever faced a challenge where records keep getting duplicated when you are inserting some new data into an existing table in Databricks?
If yes, then this blog is for you.
Let’s start with a simple use case: Inserting parquet data from one folder in Datalake to a Delta table using Databricks.
Follow the steps mentioned below to get this job done.
Step 1: Read the Source data
Open a Databricks notebook and type in the command to read the data from Source in parquet format and convert it into Temporary view.
Syntax:
Source = sqlContext.read.parquet("mnt/Ization/Duplicate/Sample/*.parquet")
Source.createOrReplaceTempView("Source")

Step 2: Create the Target table using DELTA
Syntax: CREATE TABLE IF NOT EXISTS Target( EmpID STRING ,Name STRING ,Division STRING ,DOB STRING ,Address STRING ) USING DELTA

Step 3: Insert records from Source View to Target table using MERGE statement.
Syntax: MERGE INTO Target T USING Source S ON T.EmpID = S.EmpID AND T.Name = S.Name AND T.Division = S.Division AND T.DOB = S.DOB AND T.Address = S.Address WHEN MATCHED THEN UPDATE SET T.EmpID = S.EmpID ,T.Name = S.Name ,T.Division = S.Division ,T.DOB = S.DOB ,T.Address = S.Address WHEN NOT MATCHED THEN INSERT ( EmpID ,Name ,Division ,DOB ,Address ) VALUES ( S.EmpID ,S.Name ,S.Division ,S.DOB ,S.Address )

Step 4: Check Target table for Duplicates.
Syntax: SELECT EmpID, Name, Division, DOB, Address, COUNT(*) FROM Target GROUP BY EmpID, Name, Division, DOB, Address HAVING COUNT(*)>1

Getting ‘Query returned no results’ means there are no duplicates in the Table.
Thank you for reading and Happy Learning.
Ready to elevate your skills? Click here for my Python book and here for my Machine Learning book on Kindle.
Leave a comment