Overview
Atlas Data Federation supports AWS S3 buckets as federated database instance stores. You must define mappings in your federated database instance to your AWS S3 bucket to run queries against your data.
Configuration File Format
To define a federated database instance store for an AWS S3 bucket, you can specify the configuration parameters in JSON. The configuration contains the AWS data store and maps it to virtual collections that you can query.
The JSON configuration for data in AWS S3 bucket uses the following fields:
1 { 2 "stores" : [ 3 { 4 "name" : "<string>", 5 "provider": "<string>", 6 "region" : "<string>", 7 "bucket" : "<string>", 8 "additionalStorageClasses" : ["<string>"], 9 "prefix" : "<string>", 10 "includeTags": <boolean>, 11 "delimiter": "<string>", 12 "public": <boolean> 13 } 14 ], 15 "databases" : [ 16 { 17 "name" : "<string>", 18 "collections" : [ 19 { 20 "name" : "<string>", 21 "dataSources" : [ 22 { 23 "storeName" : "<string>", 24 "path" : "<string>", 25 "defaultFormat" : "<string>", 26 "provenanceFieldName": "<string>", 27 "omitAttributes": true | false 28 } 29 ] 30 } 31 ], 32 "maxWildcardCollections" : <integer>, 33 "views" : [ 34 { 35 "name" : "<string>", 36 "source" : "<string>", 37 "pipeline" : "<string>" 38 } 39 ] 40 } 41 ] 42 }
The JSON configuration for AWS S3 data store contains two
top-level objects: stores
and databases
.
stores
The stores
object defines each data store associated with the
federated database instance. This store captures files in an AWS S3 bucket.
Data Federation can only access data stores defined within the stores
object.
The stores
object contains the following fields:
1 "stores" : [ 2 { 3 "name" : "<string>", 4 "provider" : "<string>", 5 "region" : "<string>", 6 "bucket" : "<string>", 7 "additionalStorageClasses" : ["<string>"], 8 "prefix" : "<string>", 9 "delimiter" : "<string>", 10 "includeTags": <boolean>, 11 "public": <boolean> 12 } 13 ]
The following table describes the fields in the stores
object:
Field | Type | Necessity | Description |
---|---|---|---|
array | required | Array of objects where each object represents a data store to
associate with the federated database instance. The store captures files in an
AWS S3 bucket. Atlas Data Federation can only access data stores
defined in the | |
string | required | Name of the federated database instance store. The | |
string | required | Defines where the data is stored. Value must be | |
string | required | Name of the AWS region in which the AWS S3 bucket is hosted. For a list of valid region names, see Amazon Web Services (AWS). | |
string | required | Name of the AWS S3 bucket. Must exactly match the name of an AWS S3 bucket which Atlas Data Federation can access with the configured AWS IAM credentials. | |
array | optional | Array of AWS S3 storage classes. Atlas Data Federation will include the files in these storage classes in the query results. Valid values are:
IMPORTANT: Files in the Standard storage class are supported by default. | |
string | optional | Adds a prefix to search paths for files in the AWS S3 bucket.
Atlas Data Federation prepends the value of If omitted, Atlas Data Federation searches all files from the root of the AWS S3 bucket. | |
string | optional | Sets a delimiter that separates path segments in the federated database instance store.
Data Federation uses the delimiter to efficiently traverse AWS S3 buckets
with a hierarchical directory structure. You can specify any
character supported by the AWS S3 object keys
as the delimiter. For example, you can specify an underscore
( If omitted, defaults to | |
boolean | optional | Determines whether or not to use AWS S3 tags on the files in the
given path as additional partition attributes. Valid values are
If omitted, defaults to If set to
WARNING: If set to | |
boolean | optional | Specifies whether the bucket is public. If set to If omitted, defaults to |
databases
The databases
object defines the mapping between each
federated database instance store defined in stores
and MongoDB collections
in the databases.
The database
object contains the following fields:
1 "databases" : [ 2 { 3 "name" : "<string>", 4 "collections" : [ 5 { 6 "name" : "<string>", 7 "dataSources" : [ 8 { 9 "storeName" : "<string>", 10 "defaultFormat" : "<string>", 11 "path" : "<string>", 12 "provenanceFieldName": "<string>", 13 "omitAttributes": <boolean> 14 } 15 ] 16 } 17 ], 18 "maxWildcardCollections" : <integer>, 19 "views" : [ 20 { 21 "name" : "<string>", 22 "source" : "<string>", 23 "pipeline" : "<string>" 24 } 25 ] 26 } 27 ]
The following table describes the fields in the database
object:
Field | Type | Necessity | Description | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
array | required | Array of objects where each object represents a database, its
collections, and, optionally, any views on
the collections. Each database can have multiple | |||||||||||||
string | required | Name of the database to which Atlas Data Federation maps the data contained in the data store. | |||||||||||||
array | required | Array of objects where each object represents a collection and data
sources that map to a | |||||||||||||
string | required | Name of the collection to which Atlas Data Federation maps the data contained in
each You can generate collection names dynamically from file paths by
specifying | |||||||||||||
array | required | Array of objects where each object represents a
| |||||||||||||
string | required | ||||||||||||||
string | required | Controls how Atlas Data Federation searches for and parses files in the
For example, consider an S3 bucket
A A If the Appending the
See Define Path for S3 Data for more information. When specifying the
When specifying attributes of the same type, do any of the following:
Default format that Data Federation assumes
if it encounters a file without an extension while searching the
The following values are valid for the
IMPORTANT: If your file format is If omitted, Data Federation attempts to detect the file type by processing a few bytes of the file. | |||||||||||||
string | optional | Name for the field that includes the provenance of the documents in the results. If you specify this setting in the storage configuration, Atlas Data Federation returns the following fields for each document in the result:
You can't configure this setting using the Visual Editor in the Atlas UI. | |||||||||||||
boolean | optional | Flag that specifies whether to omit the attributes (key and value pairs) that Atlas Data Federation adds to documents in the collection. You can specify one of the following values:
If omitted, defaults to For example: Consider a file named | |||||||||||||
integer | optional | Optional. Maximum number of wildcard | |||||||||||||
array | optional | Array of objects where each object represents an aggregation pipeline on a collection. To learn more about views, see Views. | |||||||||||||
string | required | Name of the view. | |||||||||||||
string | required | Name of the source collection for the view. If you want to create a view with a $sql stage, you must omit this field as the SQL statement will specify the source collection. | |||||||||||||
array | optional | Array of Aggregation pipeline stage(s) to apply to the
|
Example Configuration for S3 Data Store
Example
Consider a S3 bucket datacenter-alpha
containing data
collected from a datacenter:
|--metrics |--hardware
The /metrics/hardware
path stores JSON files with metrics
derived from the datacenter hardware, where each filename is
the UNIX timestamp in milliseconds of the 24 hour period
covered by that file:
/hardware/1564671291998.json
The following configuration:
Defines a federated database instance store on the
datacenter-alpha
S3 bucket in theus-east-1
AWS region. The federated database instance store is specifically restricted to only datafiles in themetrics
folder path.Maps files from the
hardware
folder to a MongoDB databasedatacenter-alpha-metrics
and collectionhardware
. The configuration mapping includes parsing logic for capturing the timestamp implied in the filename.
{ "stores" : [ { "name" : "datacenter-alpha", "provider" : "s3", "region" : "us-east-1", "bucket" : "datacenter-alpha", "additionalStorageClasses" : [ "STANDARD_IA" ], "prefix" : "/metrics", "delimiter" : "/" } ], "databases" : [ { "name" : "datacenter-alpha-metrics", "collections" : [ { "name" : "hardware", "dataSources" : [ { "storeName" : "datacenter-alpha", "path" : "/hardware/{date date}" } ] } ] } ] }
Atlas Data Federation parses the S3 bucket datacenter-alpha
and processes all
files under /metrics/hardware/
. The collections
uses the
path parsing syntax to map the filename to
the date
field, which is an ISO-8601 date, in each document. If
a matching date
field does not exist in a document, it will be
added.
Users connected to the federated database instance can use the MongoDB Query Language
and supported aggregations to analyze data in the AWS S3 bucket
through the datacenter-alpha-metrics.hardware
collection.