Amazon Web Services provides Simple Storage Service, widely known as S3. Amazon S3 provides highly scalable, accessible, reliable object storage through web interfaces. The platform offers flexibility in data management for cost optimization and access control along with comprehensive security and compliance capabilities.
Most of the people using AWS have used S3. The usage becomes costly as the data grows or the team scales up. At larger scales, there are potential chances of making costly mistakes. Often, we come across the ways to do things differently when we have made mistakes.
Here are the ten tips that will help you avoid expensive mistakes.
What is Amazon S3? Top 10 things to be Careful about it
1. Move Data into and out of S3 faster
While using S3, uploading and downloading files takes time. If files are being moved on a frequent basis, there are good chances that one can improve engineering productivity significantly. S3 is highly scalable, and with the use of big enough pipe or enough instances, one can achieve high throughput. Nevertheless, there are certain hidden aspects, as listed below, which can become the bottleneck.
Regions and Connectivity: Moving data between servers at various locations takes into account the size of pipe between the source and S3. So typically if your EC2 instance and S3 region do not correspond, then you tend to suffer from bandwidth issues. More surprisingly, the speed of moving data within the same region, for example, Oregon (a newer region) shows up faster than Virginia. If the servers are in different location, you may consider using DirectConnect ports or S3 Transfer Acceleration to improve bandwidth.
Instance types: The choice of EC2 instances can be made based on your bandwidth network connectivity requirement. AWS provides the comparison while making choices.
Concurrency level of object transfer: This determines the overall throughput in moving many objects. Each S3 operation involves latency and adds up if you are dealing with many objects, also, one at a time. S3 provides libraries to make concurrent connections to one instance to allow parallelism.
2. Assess Data and its lifecycles upfront
Before one chooses to put something in S3, there are few important points to consider.
Assessing Data lifecycles: Large datasets tend to expire after some time. Some objects are used for a shorter period unless processed. It is unlikely that you want raw, unprocessed logs or archives forever. The underlying tip here is to think through what’s expected to happen with the data over time.
Data organization based on its lifecycles: Most S3 users pay less attention inspecting data lifecycles and end up mixing short-lived files with ones that have a longer life. This way, one incurs significant technical debt around data organization.
Manage data lifecycles: S3 provides object tagging feature to categorize storage based on your object lifecycle policies. You want to delete or archive some data after a period, just use tags.
Compression Schemes: Large data sets can be compressed to benefit from S3 cost and bandwidth. Which format to use for compression can be thought by keeping in mind the tools that will read it.
Objects mutability: Generally, the approach is to store the objects that can never be modified and only deleted per need. However, mutable objects become necessary at times. In such cases, one should consider bucketing objects based on versions.
3. Understand data access, encryption, and compliance requirements
The data you are storing into S3 may be subjected to access control and specific compliance requirements. Before moving data into S3, ask yourself the following questions:
Are there people who should not be able to read or modify this data?
Do the access rules likely to change in future?
Is there a need for data encryption (for example, customers are promised that their data is secured)? If Yes, how to manage the encryption keys?
Does the data contain personal information about users or customers?
Do you have PCI, HIPAA, SOX, or EU Safe Harbor compliance requirements?
Often, the businesses have sensitive data. Managing the sensitivity arises the need for a documented procedure for storage, encryption and access control. One way to do the same using S3 can be to categorize the data based on its different needs.
4. Structure data well for faster S3 operations
Latency on S3 operations also depends on key names. If your workload against S3 is going to exceed 100 requests per second, prefix similarities will become a bottleneck. For high volume operations, naming schemes become relevant. As an example, more variability in initial characters of the key names allows even distribution across multiple index partitions.
The idea of structuring the data well and thinking about it upfront is very much important when dealing with millions of objects. A sane tagging or well-organized data makes parallelism possible, or otherwise, it is extremely slow crawling through millions of objects.
5. Save money with S3 classes
S3 offers a range of storage classes based on how frequently you access the data. There are three ways you can upkeep your data in S3 once the data lifecycle is set.
Reduced Redundancy Storage: It provides lower levels of redundancy than S3’s Standard storage. Here the durability is also less (99.99%, only four nines), which means the chances are good that you will lose some amount of data. For non-critical data, which has more statistical importance, this is a reasonable trade-off.
S3’s Infrequent Access: When your data is accessed less frequently, except rapid access at times, this is a cheaper storage option. A suitable example would be storing logs here that you might want to look later.
Glacier: Ideal for storing archives, gives much cheaper storage. Retrieval is costly, and access is slow.
6. Far-sighted S3 data organization
It is always better to be cognizant of your data. While organizing data into different buckets, the right approach is to consider thinking around the following axes:
– Sensitivity: Data access rights to the people?
– Compliance: Necessary controls and processes?
– Lifecycle: When and what expires?
– Realm: What is the use? Used internally or for external purpose?
– Visibility: Do you want to track the data usage?
The first three have already bee discussed. Talking about the realm, the underlying concept here is to think of your data regarding the processes. So, if your data has to do with the development process, then it should be categorized and stored under that bucket. No one should confuse and misplace with the production bucket.
The visibility point has to do with the tracking of data usage. AWS offers several mechanisms to generate and view usage reports based on how you want to analyze. So if your data is organized well in the meaningful buckets, tallying usage with the prefixes or buckets will be a lot easier and strategic.
7. Do not hardcode S3 locations
At times you may want to deploy multiple productions or staging environments. Or, in future, you may want to move all the objects to a different S3 location. If your code is tied up with the deployment details, realizing these things will be cumbersome. On the similar lines, doing audits to inspect which data is accessed by a said piece of code will again hurt you.
If you are following Tip 6, a code decoupled from the S3 location will help you in test releases or integration testing.
8. You can use your local environments for testing or production alternatives to S3
There are various tools compatible with S3 APIs available for helping you with testing or migration to local storage. Common examples for small test deployment are S3Proxy (Java) and FakeS3 (Ruby). These tools make it faster and easier to test S3-dependent code in isolation.
Many enterprises opt to deploy AWS-compatible cloud components in their private clouds. Eucalyptus and OpenStack are some of the examples.
9. Look for newer tools to mount S3 as a file system
Using S3 as a file system can be complex but possible. One of the solutions which have been around for a long time is S3FS-FUSE. It lets you mount S3 as a regular filesystem but not a robust solution. It has certain drawbacks regarding performance and file operations. More recent implementations are Goofys and Riofs that are improvements on S3FS. ObjectiveFS is a commercial solution and offers lots of filesystem features. If you are looking filesystem backup solutions, zbackup, rclone and borg are open source backup and sync tools.
10. Drop S3 if another solution is better
There are many cheaper variants of S3 that can solve the purpose depending on your use case. As discussed in Tip 5, Glacier is a great choice for cheaper pricing. EBS and EFS can be more suitable for random-access data but are costlier than S3. EBS with the regular snapshot is a good choice for filesystem abstraction in AWS. With EBS, you can attach only one EC2 instance. After the release of EFS, one can attach thousands of EC2 instance if the budget allows.
Lastly, if you do not want to store data in AWS, the other promising options include Google Cloud Storage, Azure Blob Storage, EMC Atmos and Rackspace Cloud Files.