Running Apache Airflow as a Specific User: A Security Best Practice
Setting up Apache airflow to
In the world of data engineering, security is paramount. One common best practice is to run services like Apache Airflow under their own non-privileged system user. This strategy helps to improve system security by limiting the potential impact of any security vulnerabilities that might be present in the service.
Why Run Airflow as a Specific User?
Running Airflow as a specific user, separate from the user who is developing the code, provides several benefits:
Security: If a malicious actor gains access to the Airflow service, they would only have the same permissions as the airflow user, which are typically very limited.
Isolation: Running Airflow as a separate user helps to prevent unintentional interference between different parts of your system.
Auditability: It’s easier to track which actions were performed by the Airflow service if it’s run as a separate user.
How to Implement This Strategy
Here’s a step-by-step guide on how to implement this strategy:
Create the airflow user: This can typically be done using the useradd command on Unix-based systems.
Set up your development environment: As user_who_is_developing, write your code and set up your Airflow pipelines. You can use an Integrated Development Environment (IDE) like Visual Studio Code (VS Code) for this.
Run Airflow commands as the airflow user: If user_who_is_developing has sudo privileges, you can use the sudo command to execute Airflow commands as the airflow user. For example, to start the Airflow webserver, you would use the command sudo -u airflow airflow webserver.
Set up VS Code to run commands as the airflow user: You can configure VS Code to run your Airflow commands as the airflow user. This can typically be done in the .vscode/settings.json file in your workspace.
Connecting MongoDB to MariaDB using Apache Airflow
Now, let’s discuss how to connect MongoDB to MariaDB using Apache Airflow. Here’s a basic script that reads data from MongoDB and writes it to a MariaDB table:
Python code:
from pymongo import MongoClient
import mariadb
from datetime import datetime
# MongoDB connection
client = MongoClient('mongodb://localhost:27017/')
db = client['your_database']
collection = db['your_collection']
# MariaDB connection
conn = mariadb.connect(user='airflow', password='password', database='your_database')
cur = conn.cursor()
# Create table if not exists
cur.execute("""
CREATE TABLE IF NOT EXISTS dim_linkedin_post (
post_url VARCHAR(255),
post_id BIGINT,
post_date DATE,
ds DATE,
PRIMARY KEY (post_id, ds)
) PARTITION BY RANGE COLUMNS(ds) (
PARTITION p0 VALUES LESS THAN ('2024-01-01'),
PARTITION p1 VALUES LESS THAN ('2024-02-01'),
PARTITION p2 VALUES LESS THAN ('2024-03-01')
);
""")
conn.commit()
# Process MongoDB documents
docs = collection.find({})
for doc in docs:
# Extract attributes
post_url = doc['postUrl']
post_id = doc['postId']
post_date = doc['postDate']
ds = datetime.now().date() # Current date
# Insert or update in MariaDB
cur.execute("""
INSERT INTO dim_linkedin_post (post_url, post_id, post_date, ds)
VALUES (%s, %s, %s, %s)
ON DUPLICATE KEY UPDATE
post_url = VALUES(post_url),
post_date = VALUES(post_date),
ds = VALUES(ds)
""", (post_url, post_id, post_date, ds))
conn.commit()
This script reads documents from a MongoDB collection, extracts the post_url, post_id, post_date, and ds attributes, and inserts them into a MariaDB table called dim_linkedin_post. The ds field is used for partitioning the data by date. If a record with the same post_id and ds already exists, it updates the existing record.
Remember, security is a journey, not a destination. Always stay informed about the latest best practices and continually review and update your security measures as necessary. Happy coding! 😊