AWS: for Deep Learning — Terms to explore
With cloud deep learning, you can request as many GPU machines as needed, and scale up and down on demand. Amazon Web Services (AWS) provides an extensive ecosystem of services to support deep learning applications like SageMaker and Deep Learning Containers.
Any deep learning project requires three essential resources — storage, compute, and networking.
STORAGE
Amazon Simple Storage Service (S3): It stores a massive amount of data for DL projects and forms the base for data ingestion, ETL, ad hoc data querying, and data wrangling. We also connect data analysis and visualization tools to S3 to make sense of the data before pushing in a Dl project.
Amazon Elastic Block Storage (EBS): During training, data is streamed from S3 to EBS volumes that are attached to the training machines in Amazon EC2. This reduces latency in data-access during training.
Amazon Elastic File System (EFS): Amazon EFS is probably the best storage option for large-scale batch processing, or when multiple training jobs need access to the same data. It allows developers and data scientists to access large amounts of data directly from their workstation or a code repository, with unlimited disk space and no need to manage network file shares.
Amazon FSx for Lustre is another high-performance file system solution suitable for compute-intensive workloads like deep learning.
COMPUTE
DL models require millions of matrix and vector operations. GPU hardware parallelizes them with its many cores and improves performance.
Amazon went through four generations of GPU instances — the latest generation, called P4, was released in November 2020.
EC2 stands for Elastic Compute Cloud. A web service (Virtual Machine in cloud), it provides resizable compute capacity in cloud.
DEEP LEARNING SERVICES
Beyond offering the building blocks for deep learning applications, Amazon also offers end-to-end deep learning solutions. We’ll cover three options.
AWS SageMaker: Amazon SageMaker is a fully managed machine learning service, which enables data scientists and developers to create and train machine learning models, including deep learning architectures, and deploys them into a hosted production environment.
SageMaker provides an integrated Jupyter notebook, allowing data scientists to access data sources easily without needing to manage server infrastructure. It makes it easy to run common ML and DL algorithms, pre-optimized to run in a distributed environment.
AWS Deep Learning AMI (DLAMI): AWS DLAMI is a custom EC2 machine image that can be used with multiple instance types, like GPU and CPU. Developers and data scientists can use it to instantly set up a pre-configured DL environment on Amazon, including CUDA, cuDNN, and popular frameworks like PyTorch.
AWS Deep Learning Containers: AWS Deep Learning Containers are a pre-installed deep learning Docker image that includes a complete deep learning development environment. It comes pre-installed with TensorFlow and PyTorch, and can be deployed on SageMaker or Amazon container services, like EKS and ECS.
Amazon Elastic Inference: Elastic Inference is a method for attaching GPU-powered acceleration to regular Amazon EC2 instances. It provides significant cost savings, by allowing you to run deep learning and SageMaker instances on regular compute instances, which are significantly cheaper than GPU instances.