What is AWS DataSync

AWS Datasync is a secure encrypted service that automates and accelerates the movement of data between on premise file storage to online AWS storage services like Amazon S3 or Elastic File System. DataSync will also transfer data between AWS services within an AWS account

You can connect Datasync to NFS (Network File System) shares, SMB (Server Message Block) shares, Hadoop distributed file systems, self managed object storage, AWS Snowcone and move data to S3, Amazon elastic file systems, FSx for Windows file server and FSx for Lustre file systems.

DataSync_Home

There are numerous reasons why you would deploy DataSync to move on premise data out to AWS infrastructure or move data between AWS services.

If you intend moving data permanently to AWS, you can quickly move file and object data to AWS using DataSync. Data in transit is secured using in-flight encryption and end-to-end data validation.

You can also securely replicate your on premise data into AWS storage, which can be used for high availability offsite backup purposes or to one of the S3 storage tiers to suit your budget.

If you have data you don’t need fast access to, but want to archive at very low storage costs, Datasync can send data to AWS Glacier which has the lowest storage costs of all the AWS storage offerings.

If you have hybrid workflows where multiple disparate systems need access to the same data, datasync can seamlessly move data between multiple on-premise file systems and AWS.

Datasync can also be used to move or sync data between AWS storage services like S3, EFS or FSx in the same or different accounts so you can archive, replicate or share application data in a secure and scalable way.

DataSync_Overview

DataSync makes it easier and faster to transfer terabytes of data to and from AWS storage services.

This is achieved by using an AWS-designed transfer protocol that is unrelated to the storage protocols it is connecting to which accelerates the movement of data. The DataSync protocol as well as handling the in-transit encryption also calculates the optimal volume, timing and data types to transmit over the network. DataSync might perform incremental transfers, perform in-line compression and real-time checksum validation as data is moved.

When you connect a local DataSync agent to cloud based storage destinations, the connection is multi-threaded which assists with maximising the performance of the transfer.

DataSync handles the transfer process, so you do not have to write and optimize your own copy scripts, or deploy and fine-tune commercial data transfer tools. The built in monitoring ensures the data integrity of moved files and objects and employs automatic retry mechanisms so that what arrives at the destination file storage matches the original file.

DataSync is capable of consuming a maximum network bandwidth of up to 10Gbps which is great from a file transfer perspective if you need to move lots of data in a hurry, but not so great if you need the network bandwidth for other workloads. DataSync has a number of granular controls to help optimise bandwidth consumption, including transfer bandwidth throttling during the times of the day where network throughput is required for other work.

As you would expect, DataSync has a built in scheduler which allows you to set up periodic data transfer tasks including incremental copying of files that have changed in your storage system. Tasks can be scheduled from the console or from the AWS CLI programmatically.

Scheduling options include hourly, daily or weekly.

Data transferred using DataSync is encrypted in transit using TLS and the DataSync agent supports using the default encryption for S3 buckets, Amazon EFS file system data encryption of data at rest and FSx encryption of data both in transit and at rest.

DataSync will retain the metadata and file permissions for files and objects copied between supported AWS destinations and host file systems.

Native AWS security is fully integrated with DataSync which simplifies data movement from a security, monitoring and audit perspective. In addition to the native integration with the AWS file storage options, DataSync supports VPC endpoints using PrivateLink so you can move data directly into a VPC and use IAM to securely control DataSync access.

Monitoring DataSync

You can use Amazon CloudWatch to monitor transfers that are in progress and check on completed transfer history. The logs will show details of the times and files transferred and the status of data integrity verification. Cloudwatch events are triggered upon transfer completion, which you can use to launch automations dependent on the transfer being completed.

DataSync Pricing

DataSync pricing is easy to understand. There is a flat fee for the number of GB you transfer. The price is roughly USD$0.0125 per GB transferred but may vary depending on the region the data is located.

How to deploy AWS DataSync

Starting with AWS DataSync is reasonably straightforward.

Moving Data from on-premise to AWS

First of all you need to deploy a DataSync agent in your on premise network and specify the file system or storage array you want to connect which will need to be either :

A file system using the NFS or SMB protocols.
A self managed object storage API endpoint
A Hadoop cluster HDFS configuration.

DataSync_Agent

You then select a destination which can be :

Amazon S3 (any tier)
Amazon Elastic File System (EFS)
Amazon FSx for Windows Fileserver
Amazon FSx for Lustre

Moving Data between AWS storage services in the same account

No service agent is required when you are moving data within your AWS account using DataSync. You simply set up your source and destination using the AWS DataSync console or API and start the data transfer.

DataSync_AWS_Source

This is a run down of the process to transfer between AWS storage services

This is a demonstration from AWS introducing the DataSync service more broadly :

If you are building on AWS, you can automatically diagram your cloud infrastructure using Hava. When you connect an AWS account to Hava, several diagrams are auto generated including VPC infrastructure, security groups and ports, best practice compliance and containerized workloads.

Once created, your diagrams are automatically kept up to date without having to manually intervene. Hava polls your AWS config, detects changes and updates your diagrams automatically, so you always have up to date visualizations of your AWS infrastructure on hand.

(Hava also supports GCP and Azure)

You can try Hava for free, learn more using the button below.

6 min read