Data models should be filtered and normalized. This is particularly important if datasets are sourced from high data volume repositories such as enterprise data warehouses.
How to handle duplicate records while inserting data in Databricks
Have you ever faced a challenge where records keep getting duplicated when you are inserting some new data into an existing table in Databricks? If yes, then this blog is for you. Let’s start with a simple use case: Inserting parquet data from one folder in Datalake to a Delta table using Databricks. Follow the... Continue Reading →
How to Copy Files from SharePoint to Datalake using Azure Data factory
Copying files from SharePoint to Datalake or any other target location is one task you cannot ignore as a Data Engineer. Someday someone for sure is going to ask you to do that. So, how can you achieve that? Suppose we need to copy an excel file from SharePoint to Datalake Gen 2. You need... Continue Reading →
How to read .xlsx file in Databricks using Pandas
Step 1: In order to read .xlsx file, you need to have the library openpyxl installed in the Databricks cluster. Steps to install library openpyxl to Databircks cluster: Step 1: Select the Databricks cluster where you want to install the library. Step 2: Click on Libraries. Step 3: Click on Install New. Step 4: Select PyPI. Step 5: Put openpyxl in the text box under Package... Continue Reading →
How to Create Login and Password in SQL Server
This post is to help you create a new login and password for the users who want to use the data in SQL database or Warehouse for Analysis and Reporting.
How to read .csv and .xlsx file in Databricks
How to read .xlsx file: Step 1: In order to read .xlsx file, you need to have the library com.crealytics:spark-excel_2.11:0.12.2 installed in the Databricks cluster. Steps to install library com.crealytics:spark-excel_2.11:0.12.2 to Databircks cluster: Step 1: Select the Databricks cluster where you want to install the library. Step 2: Click on Libraries. Step 3: Click on... Continue Reading →