When working with continuously generated data, especially from sources like satellite feeds or automated sensors, managing file uploads based on the most current data can become cumbersome if you're manually setting directory paths for each session. In this blog post, I'll walk you through how to automate the process of selecting the most recently created directory for file uploads (in my case, EMWIN text files for a specific region), using a Bash script designed for a Raspberry Pi collecting data from the GOES-16 satellite.
Background
For this project, we have a Raspberry Pi configured to receive data from the GOES-16 satellite. The data is stored locally in directories named after each day (e.g., 2025-02-01). At 18:00 CST, which corresponds to 00:00 UTC, a new directory is created for the next day, due to the time zone difference with UTC. This setup ensures that data is segmented by the date of collection.
However, a challenge arises when trying to automate the upload of these files to an AWS S3 bucket for further processing or backup. We need the script to automatically identify the latest directory created by the data collection process and use it for uploads, ensuring that no manual adjustments are needed when the date changes.
However, a challenge arises when trying to automate the upload of these files to an AWS S3 bucket for further processing or backup. We need the script to automatically identify the latest directory created by the data collection process and use it for uploads, ensuring that no manual adjustments are needed when the date changes.
Solution: Dynamic Directory Selection
Step 1: Script Setup
The script starts by defining the base directory where new daily directories are created:
To find the most recently created directory, which will contain the latest data, we use the ls command combined with sorting options:
Step 3: Set the Text Directory and S3 Bucket Path
With the latest directory identified, the script sets the TEXT_DIR to this path, and formats the S3 bucket path to mirror the directory structure on the local system:
The script checks for the existence of a log file to track uploaded files. If it doesn't exist, it's created; otherwise, the script logs that it's already present:
The script starts by defining the base directory where new daily directories are created:
BASE_DIR="/home/pi/goes16/emwinTEXT/emwin/"
Step 2: Find the Latest DirectoryTo find the most recently created directory, which will contain the latest data, we use the ls command combined with sorting options:
LATEST_DIR=$(ls -dt $BASE_DIR*/ | head -1)
This command lists directories sorted by modification time (-t), with the most recent first, and head -1 picks the top directory from this list.Step 3: Set the Text Directory and S3 Bucket Path
With the latest directory identified, the script sets the TEXT_DIR to this path, and formats the S3 bucket path to mirror the directory structure on the local system:
TEXT_DIR=$LATEST_DIR
S3_BUCKET_PATH="s3://your_s3_bucket/kmob-texts/$(basename $LATEST_DIR)/"
Step 4: File Handling and UploadThe script checks for the existence of a log file to track uploaded files. If it doesn't exist, it's created; otherwise, the script logs that it's already present:
LOG_FILE="/home/pi/kmob_upload_log_$(basename $LATEST_DIR).txt"
if [ ! -f "$LOG_FILE" ]; then
touch "$LOG_FILE"
chmod 664 "$LOG_FILE"
echo "Created log file at $LOG_FILE"
else
echo "Log file already exists."
fi
Using find, the script searches for files within the TEXT_DIR that match the pattern *mobal*.txt (case-insensitive), indicating they are ready for upload:find "$TEXT_DIR" -type f -iname "*mobal*.txt" -print | while read FILE; do
...
aws s3 cp "$FILE" "$S3_BUCKET_PATH$REL_PATH";
...
done
Conclusion
This automated approach ensures that data from the latest collection day is always uploaded without manual intervention, despite the directory changing daily at 18:00 CST due to the UTC offset. By dynamically selecting the most recent directory, the script effectively handles day transitions and maintains a consistent and up-to-date backup or processing pipeline.
This method not only simplifies the management of time-sensitive data but also ensures that data uploads are as current as possible, which is crucial for timely data analysis and decision-making.
This method not only simplifies the management of time-sensitive data but also ensures that data uploads are as current as possible, which is crucial for timely data analysis and decision-making.
Final Thoughts
Adapting this script to your specific needs might require minor modifications, especially if your directory structure or naming conventions differ. However, the principles demonstrated here should provide a solid foundation for automating data handling in similar scenarios, making your data processing workflows more efficient and reliable.