Use Atlas Data Federation and Atlas Scheduled Triggers to copy data from an Atlas cluster to an AWS S3 bucket in Apache Parquet format. Parquet is a columnar format suited for analytics and machine learning workloads that expect data as files rather than documents. Run copies on a recurring schedule to offload analytics queries from your operational cluster.
About this Task
The tutorial uses a delta approach, which means each Trigger run copies documents from the past 60 seconds. An alternative is a full snapshot, which copies the entire collection each time. The right approach depends on your data volume and the requirements of downstream consumers.
The maxFileSize and maxRowGroupSize values in this tutorial are optimized for testing, not production. For production workloads, review the $out stage options and adjust file sizes and partitioning based on your query patterns.
Before you Begin
Before you start this tutorial, complete the following tasks:
Create an Atlas account with a cluster that has the data you want to copy. To get started, see Create a Cluster.
Create an AWS account with privileges to create IAM roles and S3 buckets. To configure the required permissions for Atlas Data Federation, see Deploy a Federated Database Instance Data Store.
Install and configure the AWS CLI.
Steps
Deploy a Federated Database Instance with S3 and Atlas data stores.
A federated database instance consolidates multiple data sources into a single queryable interface. In this tutorial, you connect your S3 bucket and your Atlas cluster as data stores in the same federated database instance. Connecting both data stores lets the copy Trigger read from the cluster and write to S3.
Deploy a federated database instance with an S3 data store. To learn how, see Deploy a Federated Database Instance Data Store. When you configure the S3 data store, grant the IAM role Read and write access to the bucket so that Atlas Data Federation can write Parquet files.
Add your Atlas cluster as a second data store in the federated database instance.
After you complete these steps, note the name of your federated database instance service. You need this name in a later step.
Create a Scheduled Trigger to insert test documents.
Create a Scheduled Trigger that inserts a new document into your cluster every minute. This generates test data so you can verify that the copy Trigger works.
In Atlas, go to the Triggers page.
If it's not already displayed, select the organization that contains your project from the Organizations menu in the navigation bar.
If it's not already displayed, select your project from the Projects menu in the navigation bar.
In the sidebar, click Triggers under the Streaming Data heading.
The Triggers page displays.
Click Add Trigger.
Select Scheduled as the Trigger Type.
In Trigger Details, set the following configuration:
SettingValueTrigger NameCreate_Event_Every_Min_TriggerSchedule TypeBasicIntervalEvery
1minuteEvent TypeFunctionIn the Function section, select + New Function and enter the following code. Replace the placeholder values with the names of your Atlas service, database, and collection.
exports = function () { const mongodb = context.services.get( "NAME_OF_YOUR_ATLAS_SERVICE" ); const db = mongodb.db("NAME_OF_YOUR_DATABASE"); const events = db.collection( "NAME_OF_YOUR_COLLECTION" ); const event = events.insertOne({ time: new Date(), aNumber: Math.random() * 100, type: "event" }); return JSON.stringify(event); }; Click Save.
After the Trigger runs, confirm that new documents appear in your cluster collection every minute.
Create a Scheduled Trigger to copy data to S3.
Create a Scheduled Trigger that runs an aggregation pipeline using the $out stage to copy recent documents from your cluster to your S3 bucket in Parquet format every minute.
On the Triggers page, click Add Trigger.
Select Scheduled as the Trigger Type.
In Trigger Details, set the following configuration:
SettingValueTrigger NameCopy_Events_To_S3_TriggerSchedule TypeBasicIntervalEvery
1minuteEvent TypeFunctionIn the Function section, select + New Function and enter the following code. Replace the placeholder values with the names of your federated database instance service, virtual database, virtual collection, S3 bucket, and AWS region.
exports = function () { const service = context.services.get( "NAME_OF_YOUR_FEDERATED_DATA_SERVICE" ); const db = service.db( "NAME_OF_YOUR_VIRTUAL_DATABASE" ); const events = db.collection( "NAME_OF_YOUR_VIRTUAL_COLLECTION" ); const pipeline = [ { $match: { "time": { $gt: new Date( Date.now() - 60 * 1000 ), $lt: new Date(Date.now()) } } }, { "$out": { "s3": { "bucket": "YOUR_S3_BUCKET_NAME", "region": "YOUR_AWS_REGION", "filename": "events", "format": { "name": "parquet", "maxFileSize": "10GB", "maxRowGroupSize": "100MB" } } } } ]; return events.aggregate(pipeline); }; Click Save.
After the Trigger runs, confirm that a Parquet file named
eventsappears in your S3 bucket.