AWS Data Ingestion Methods

4 min readMay 1, 2021

After going through this medium post, you would have a good idea to choose a AWS service for data ingestion. You must spend few minutes on flow diagrams which help in understanding data flow and each step.

Don’t forget to read the conclusion!

How AWS helps in data ingestion

AWS architecture offers services and capabilities to quickly and easily ingest multiple types of data, such as real-time streaming data and bulk data assets from on-premises storage platforms, as well as data generated and processed by legacy on-premises platforms, such as mainframes and data warehouses.

There are 3 services offered by AWS for data ingestion

Amazon Kinesis Firehose
AWS Snowball
AWS Storage Gateway

Amazon Kinesis Firehose

Amazon Kinesis Firehose is a fully managed service for delivering real-time streaming data directly to Amazon S3.
Kinesis Firehose automatically scales to match the volume and throughput of streaming data, and requires no ongoing administration
Kinesis Firehose can also be configured to transform streaming data before it’s stored in Amazon S3. Its transformation capabilities include compression, encryption, data batching, and Lambda functions.

Note: Kinesis Firehose can concatenate multiple incoming records, and then deliver them to Amazon S3 as a single S3 object. This is an important capability because it reduces Amazon S3 transaction costs and transactions per second load.

Kinesis Firehose can invoke Lambda functions to transform incoming source data and deliver it to Amazon S3. Common transformation functions include transforming Apache Log and Syslog formats to standardized JSON and/or CSV formats.

AWS Kinesis Firehose

Snowball is a petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data into and out of the AWS cloud. Using Snowball addresses common challenges with large-scale data transfers including high network costs, long transfer times, and security concerns. Migrate bulk data from on-premises storage platforms and Hadoop clusters to S3 buckets.

Follow below Steps:

Create a job in the AWS management console for data transfer using Snowball.
Snowball appliance will be automatically shipped to your address.
After a Snowball arrives, connect it to your local network
Install the Snowball client on your on-premises data source.
Use the Snowball client to select and transfer the file directories to the Snowball device.
Ship the device back to AWS.
Once AWS receives the device, data is then transferred from the Snowball device to S3 bucket and stored as S3 objects in their original/native format.

Notes: The Snowball client uses AES-256-bit encryption. Encryption keys are never shipped with the Snowball device, so the data transfer process is highly secure.

AWS Storage gateway

Integrate legacy on-premises data processing platforms with AWS S3 (Data lakes) using AWS Storage gateway. It uses NFS connection to write the files on mount points.

Files written to this mount point are converted to objects stored in Amazon S3 in their original format.
Integrate applications and platforms that don’t have native Amazon S3 capabilities — such as on-premises lab equipment, mainframe computers, databases, and data warehouses with Amazon S3.

Note: This also allows data transfer from an on-premises Hadoop cluster to an S3 bucket.

Useful link

https://www.youtube.com/watch?v=QaCfOatTIDA

Conclusion

Everyone would have a question at last after reading this.

Which one you should prefer for my business requirements?

A Simple Answer is “It depends”

When you have real time streaming data and you would like to transform, encrypt or compress on the fly, then your preferred choice should be Amazon kinesis firehose.
Incase of large amount of data in petabytes, then instead of transferring massive data on network which consumes network bandwidth and can cost you a lot. Then you should go for AWS Snowball.
When you would like to transfer data to AWS S3 or FSx using SMB protocol or NFS. You can create a storage gateway and join it with active directory domain. Finally, mount storage gateway endpoint in existing on premise virtual machine.