Home 论坛 AWS MLS use Pipe mode with CSV datasets for faster training on AWS

use Pipe mode with CSV datasets for faster training on AWS

正在查看 3 帖子:1-3 (共 3 个帖子)
  • 作者
    帖子
  • #1336

    aluck
    参与者

    Amazon SageMaker built-in algorithms now support Pipe mode for fetching datasets in CSV format from Amazon Simple Storage Service (S3) into Amazon SageMaker while training machine learning (ML) models.

    With Pipe input mode, the data is streamed directly to the algorithm container while model training is in progress. This is unlike File mode, which downloads data to the local Amazon Elastic Block Store (EBS) volume prior to starting the training. Using Pipe mode your training jobs start faster, use significantly less disk space and finish sooner. This reduces your overall cost to train machine learning models. In some of our internal benchmarks that trained a regression model with the Amazon SageMaker Linear Learner algorithm on a 3.9 GB CSV dataset, the overall time to train the model was reduced by up to 40 percent by using Pipe mode instead of File mode.

    #1337

    aluck
    参与者

    The following Amazon SageMaker built-in algorithms now have full support for training with datasets in CSV format using Pipe input mode:

    Principal Component Analysis (PCA)
    K-Means Clustering
    K-Nearest Neighbors
    Linear Learner (Classification and Regression)
    Neural Topic Modelling
    Random Cut Forest

    #1338

    aluck
    参与者

    Amazon SageMaker supports two mechanisms for transferring training data: File mode and Pipe mode. In File mode, the training data is downloaded first to an encrypted EBS volume attached to the training instance prior to commencing the training. However, in Pipe mode the input data is streamed directly to the training algorithm while it is running. This continuous streaming of data enables a few significant advantages. First, the startup time of a training job becomes independent of the size of the input data, resulting in much quicker startup, especially while training on gigabyte- and petabyte-scale datasets. Furthermore, you don’t have to pay for a large disk volume to download large datasets. Finally, if your training algorithm is I/O-bound, the highly concurrent, high-throughput reading mechanism employed by Pipe mode can significantly speed up your model training.

正在查看 3 帖子:1-3 (共 3 个帖子)
  • 抱歉,回复话题必需登录。
error: Content is protected !!