Python for Data Engineers: 5 Scripts You Must Know

Essential Tools to Automate, Clean, and Move Data Efficiently with Azure SQL Data Warehouse

As a data engineer, your job isn’t just about moving data. It’s about doing it reliably, efficiently, and repeatably, especially when working with cloud data platforms like Azure SQL Data Warehouse (Azure Synapse Analytics). Python is one of the best tools to automate workflows, clean data, and interact with Azure SQL DW seamlessly.

In this post, we’ll cover 5 essential Python scripts every data engineer should know, with a focus on working with Azure SQL Data Warehouse.

1. Data Extraction from APIs

Start by pulling data from external APIs to bring your source data into your workflow.

import requests
import pandas as pd

url = "https://api.example.com/data"
headers = {"Authorization": "Bearer YOUR_API_KEY"}

response = requests.get(url, headers=headers)
data = response.json()

df = pd.DataFrame(data["results"])
df.to_csv("api_data.csv", index=False)

Use case: Automate pulling data from third-party APIs for processing and loading.

2. Automated Data Cleaning Script

Prepare raw data for efficient loading by cleaning and standardizing it.

import pandas as pd

df = pd.read_csv("raw_data.csv")

# Clean column names

df.columns = [col.strip().lower().replace(" ", "_") for col in df.columns]

# Handle missing values

df.fillna({"revenue": 0, "region": "Unknown"}, inplace=True)

# Convert date strings to datetime

df["date"] = pd.to_datetime(df["date"])

df.to_csv("clean_data.csv", index=False)

Use case: Clean data before loading it into Azure SQL DW.

3. Load Data to Azure SQL Data Warehouse

You can use Python’s pyodbc or sqlalchemy libraries to connect to Azure SQL DW and load data efficiently.

Using pyodbc and pandas for batch insert:

import pandas as pd

import pyodbc

# Read cleaned data

df = pd.read_csv("clean_data.csv")

# Azure SQL DW connection details

server = 'your_server.database.windows.net'

database = 'your_database'

username = 'your_username'

password = 'your_password'

driver = '{ODBC Driver 17 for SQL Server}'

conn_str = f'DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}'

conn = pyodbc.connect(conn_str)

cursor = conn.cursor()

# Create table if not exists (example)

cursor.execute("""

CREATE TABLE IF NOT EXISTS sales_data (

    date DATE,

    product VARCHAR(50),

    revenue FLOAT,

    region VARCHAR(50)

)

""")

conn.commit()

# Insert data row by row (for large datasets, consider bulk insert)

for index, row in df.iterrows():

    cursor.execute("""

    INSERT INTO sales_data (date, product, revenue, region)

    VALUES (?, ?, ?, ?)

    """, row['date'], row['product'], row['revenue'], row['region'])

conn.commit()

cursor.close()

conn.close()

Note: For large datasets, consider using Azure Data Factory for optimized bulk loading.

4. File Watcher and Mover Script

Monitor an Azure Blob Storage container or local directory to trigger loads.

Here is a local folder watcher example (modify as needed for cloud storage):

import os

import time

import shutil

watch_folder = "/data/incoming"

processed_folder = "/data/processed"

while True:

    files = os.listdir(watch_folder)

    for file in files:

        if file.endswith(".csv"):

            src = os.path.join(watch_folder, file)

            dest = os.path.join(processed_folder, file)

            shutil.move(src, dest)

            print(f"Moved file: {file}")

    time.sleep(10)

Use case: Trigger pipelines when new data files arrive.

5. Data Quality Validator

Validate your data before loading it into Azure SQL DW.

import pandas as pd

df = pd.read_csv("clean_data.csv")

# Null value check

null_summary = df.isnull().sum()

# Duplicate check

duplicates = df.duplicated().sum()

# Schema check

expected_cols = ["date", "product", "revenue", "region"]

missing_cols = [col for col in expected_cols if col not in df.columns]

print("Null values:\n", null_summary)

print("Duplicates:", duplicates)

print("Missing columns:", missing_cols)

Use case: Ensure data quality before ingestion.

Bonus Tips

Use Azure Blob Storage as staging for large files, then load into Azure SQL DW using PolyBase or COPY statement for speed.
Secure your connection using Azure Active Directory Authentication or Managed Identity where possible.
Use Azure Data Factory or Synapse Pipelines for orchestrating these Python scripts alongside cloud-native activities.
Use Azure Key Vault to manage credentials securely.

Conclusion: Python + Azure SQL Data Warehouse for Modern Data Engineering

Python scripts give you the flexibility to automate, clean, validate, and load data efficiently into Azure SQL Data Warehouse. Master these essential scripts, and you’ll build reliable ETL pipelines ready for modern cloud data platforms.

Share this:

Related

Leave a comment Cancel reply