Running Apache Airflow as a Specific User: A Security Best Practice

Setting up Apache airflow to 

In the world of data engineering, security is paramount. One common best practice is to run services like Apache Airflow under their own non-privileged system user. This strategy helps to improve system security by limiting the potential impact of any security vulnerabilities that might be present in the service.

Why Run Airflow as a Specific User?

Running Airflow as a specific user, separate from the user who is developing the code, provides several benefits:

How to Implement This Strategy

Here’s a step-by-step guide on how to implement this strategy:

Connecting MongoDB to MariaDB using Apache Airflow

Now, let’s discuss how to connect MongoDB to MariaDB using Apache Airflow. Here’s a basic script that reads data from MongoDB and writes it to a MariaDB table:

Python code:

from pymongo import MongoClient

import mariadb

from datetime import datetime


# MongoDB connection

client = MongoClient('mongodb://localhost:27017/')

db = client['your_database']

collection = db['your_collection']


# MariaDB connection

conn = mariadb.connect(user='airflow', password='password', database='your_database')

cur = conn.cursor()


# Create table if not exists

cur.execute("""

    CREATE TABLE IF NOT EXISTS dim_linkedin_post (

        post_url VARCHAR(255),

        post_id BIGINT,

        post_date DATE,

        ds DATE,

        PRIMARY KEY (post_id, ds)

    ) PARTITION BY RANGE COLUMNS(ds) (

        PARTITION p0 VALUES LESS THAN ('2024-01-01'),

        PARTITION p1 VALUES LESS THAN ('2024-02-01'),

        PARTITION p2 VALUES LESS THAN ('2024-03-01')

    );

""")

conn.commit()


# Process MongoDB documents

docs = collection.find({})

for doc in docs:

    # Extract attributes

    post_url = doc['postUrl']

    post_id = doc['postId']

    post_date = doc['postDate']

    ds = datetime.now().date()  # Current date


    # Insert or update in MariaDB

    cur.execute("""

        INSERT INTO dim_linkedin_post (post_url, post_id, post_date, ds)

        VALUES (%s, %s, %s, %s)

        ON DUPLICATE KEY UPDATE

        post_url = VALUES(post_url),

        post_date = VALUES(post_date),

        ds = VALUES(ds)

    """, (post_url, post_id, post_date, ds))


conn.commit()


This script reads documents from a MongoDB collection, extracts the post_url, post_id, post_date, and ds attributes, and inserts them into a MariaDB table called dim_linkedin_post. The ds field is used for partitioning the data by date. If a record with the same post_id and ds already exists, it updates the existing record.

Remember, security is a journey, not a destination. Always stay informed about the latest best practices and continually review and update your security measures as necessary. Happy coding! 😊