AWS CLI, AWS SDKのリトライ処理の実装について

AWS CLI, AWS SDKのリトライ処理について、API実行時は、エクスポネンシャルバックオフのアルゴリズムによるリトライ処理を実行しますが、その具体的な内容について追ってみました。

リトライ処理は、リトライの基準、エクスポネンシャルバックオフの挙動等、CLIやSDKの言語、サービスによって異なることがあります。

AWS CLI or Python SDK

!!! 以下の記載内容は現在では古い記載もあります。今現在は、試行回数などの設定ができるようなっています。詳細は、下記ドキュメントをご参照ください。 !!!

AWS CLI retries

デフォルトのリトライ設定について

https://github.com/boto/botocore/blob/1.16.1/botocore/data/_retry.json#L91-L113 ランダムな振れ幅を持ったエクスポネンシャルバックオフアルゴリズムで、5回のリトライ処理。

  "retry": {
    "__default__": {
      "max_attempts": 5,
      "delay": {
        "type": "exponential",
        "base": "rand",
        "growth_factor": 2
      },
      "policies": {
          "general_socket_errors": {"$ref": "general_socket_errors"},
          "general_server_error": {"$ref": "general_server_error"},
          "bad_gateway": {"$ref": "bad_gateway"},
          "service_unavailable": {"$ref": "service_unavailable"},
          "gateway_timeout": {"$ref": "gateway_timeout"},
          "limit_exceeded": {"$ref": "limit_exceeded"},
          "throttling_exception": {"$ref": "throttling_exception"},
          "throttled_exception": {"$ref": "throttled_exception"},
          "request_throttled_exception": {"$ref": "request_throttled_exception"},
          "throttling": {"$ref": "throttling"},
          "too_many_requests": {"$ref": "too_many_requests"},
          "throughput_exceeded": {"$ref": "throughput_exceeded"}
      }
    },

retryの挙動を変更するには、jsonを変更するか以下のように気合い。

https://github.com/boto/botocore/issues/882

client = boto3.client('ec2', region_name='us-west-2', config=boto3_config) client.meta.events._unique_id_handlers['retry-config-ec2']['handler']._checker.__dict__['_max_attempts'] = 20

どのように呼び出しているか確認。

https://github.com/boto/botocore/blob/develop/botocore/client.py create_clientから_load_service_modelを呼び出す

def create_client(self, service_name, region_name, is_secure=True,

                  endpoint_url=None, verify=None,
                credentials=None, scoped_config=None,
                api_version=None,
                client_config=None):
  service_model = self._load_service_model(service_name, api_version)
  cls = self._create_client_class(service_name, service_model)
  endpoint_bridge = ClientEndpointBridge(
      self._endpoint_resolver, scoped_config, client_config,
      service_signing_name=service_model.metadata.get('signingName'))
  client_args = self._get_client_args(
      service_model, region_name, is_secure, endpoint_url,
      verify, credentials, scoped_config, client_config, endpoint_bridge)
  service_client = cls(**client_args)
  self._register_s3_events(
      service_client, endpoint_bridge, endpoint_url, client_config,
      scoped_config)
  return service_client

_load_service_modelから_register_retriesを呼び出す

def _load_service_model(self, service_name, api_version=None):

    json_model = self._loader.load_service_model(service_name, 'service-2',
                                               api_version=api_version)
  service_model = ServiceModel(json_model, service_name=service_name)
  self._register_retries(service_model)
  return service_model

_register_retriesからload_dataを呼び出す

def _register_retries(self, service_model):

    endpoint_prefix = service_model.endpoint_prefix

    # First, we load the entire retry config for all services,
  # then pull out just the information we need.
  original_config = self._loader.load_data('_retry')
  if not original_config:
      return

    retry_config = self._retry_config_translator.build_retry_config(
      endpoint_prefix, original_config.get('retry', {}),
      original_config.get('definitions', {}))

    logger.debug("Registering retry handlers for service: %s",
               service_model.service_name)
  handler = self._retry_handler_factory.create_retry_handler(
      retry_config, endpoint_prefix)
  unique_id = 'retry-config-%s' % endpoint_prefix
  self._event_emitter.register('needs-retry.%s' % endpoint_prefix,
                               handler, unique_id=unique_id)

https://github.com/boto/botocore/blob/develop/botocore/loaders.py

ここでjsonのデータを登録する。

def load_data(self, name):

    """Load data given a data path.
  This is a low level method that will search through the various
  search paths until it's able to load a value.  This is typically
  only needed to load *non* model files (such as _endpoints and
  _retry).  If you need to load model files, you should prefer
  ``load_service_model``.
  :type name: str
  :param name: The data path, i.e ``ec2/2015-03-01/service-2``.
  :return: The loaded data.  If no data could be found then
      a DataNotFoundError is raised.
  """
  for possible_path in self._potential_locations(name):
      found = self.file_loader.load_file(possible_path)
      if found is not None:
          return found
  # We didn't find anything that matched on any path.
  raise DataNotFoundError(data_path=name)

リトライのの挙動は、create_retry_handler()関数で返している。 https://github.com/boto/botocore/blob/develop/botocore/retryhandler.py#L72-L77

def create_retry_handler(config, operation_name=None):
    checker = create_checker_from_retry_config(
        config, operation_name=operation_name)
    action = create_retry_action_from_config(
        config, operation_name=operation_name)
    return RetryHandler(checker=checker, action=action)

def create_retry_action_from_config(config, operation_name=None):
    # The spec has the possibility of supporting per policy
    # actions, but right now, we assume this comes from the
    # default section, which means that delay functions apply
    # for every policy in the retry config (per service).
    delay_config = config['__default__']['delay']
    if delay_config['type'] == 'exponential':
        return create_exponential_delay_function(
            base=delay_config['base'],
    growth_factor=delay_config['growth_factor'])


def create_exponential_delay_function(base, growth_factor):
    """Create an exponential delay function based on the attempts.
    This is used so that you only have to pass it the attempts
    parameter to calculate the delay.
    """
    return functools.partial(
        delay_exponential, base=base, growth_factor=growth_factor)

https://github.com/boto/botocore/blob/develop/botocore/retryhandler.py#L39-L58

デフォルトでは、rand を base としているため、base = random.random() で取得した base の値から、リトライの時間間隔を base * (growth_factor ** (attempts - 1)) で計算する。

def delay_exponential(base, growth_factor, attempts):
    """Calculate time to sleep based on exponential function.
    The format is::
        base * growth_factor ^ (attempts - 1)
    If ``base`` is set to 'rand' then a random number between
    0 and 1 will be used as the base.
    Base must be greater than 0, otherwise a ValueError will be
    raised.
    """
    if base == 'rand':
        base = random.random()
    elif base <= 0:
        raise ValueError("The 'base' param must be greater than 0, "
                         "got: %s" % base)
    time_to_sleep = base * (growth_factor ** (attempts - 1)) 
    return time_to_sleep

Java SDK

Java SDK デフォルトのClientConfigurationオブジェクトの内容について

デフォルトのリトライポリシーは、PredefinedRetryPolicies.DEFAULTで指定されている内容になる。　そのため、設定内容は以下のとおりとなる。

リトライ回数は3回
リトライする条件はDEFAULT_RETRY_CONDITIONにて指定され、HTTP StatusCodeが500/503、スロットリングによる400エラー、Clock skewエラー時、IOエラー時。
リトライの待ち時間は、DEFAULT_BACKOFF_STRATEGYにて指定され、Exponential Backoff(BASE_DELAY 100ms、MAX_BACKOFF_IN_MILLISECONDS 20秒)となる
タイムアウトは、ClientConfigurationのDEFAULT_CONNECTION_TIMEOUT(10秒)、DEFAULT_SOCKET_TIMEOUT(50秒)となる

public class PredefinedRetryPolicies {

    (省略)
 /* SDK default */

    /** SDK default max retry count **/
 public static final int DEFAULT_MAX_ERROR_RETRY = 3;

    /**
  * SDK default retry policy (except for AmazonDynamoDBClient,
  * whose constructor will replace the DEFAULT with DYNAMODB_DEFAULT.)
  */
  public static final RetryPolicy DEFAULT;

以下の箇所について、 DEFAULT_RETRY_CONDITIONには、リトライの判定条件に関するオブジェクトが格納されており、 DEFAULT_BACKOFF_STRATEGYには、リトライをどのように行うかを判定するクラスのオブジェクトが格納されている。

/**

 * The SDK default retry condition, which checks for various conditions in
 * the following order:

**Never retry on requests with non-repeatable content; *
*Retry on client exceptions caused by IOException; *
*Retry on service exceptions that are either 500 internal server * errors, 503 service unavailable errors, service throttling errors or * clock skew errors. *

 */

public static final RetryPolicy.RetryCondition DEFAULT_RETRY_CONDITION = new SDKDefaultRetryCondition();

/**

 * The SDK default back-off strategy, which increases exponentially up to a max amount of delay. It also applies a larger
 * scale factor upon service throttling exception.
 */

public static final RetryPolicy.BackoffStrategy DEFAULT_BACKOFF_STRATEGY =

        new PredefinedBackoffStrategies.SDKDefaultBackoffStrategy();

上記の設定を元に、getDefaultRetryPolicy()関数によって RetryPolicy クラスのオブジェクトを返す関数を定義する。

/**

 * Returns the SDK default retry policy. This policy will honor the
 * maxErrorRetry set in ClientConfiguration.
 *
 * @see ClientConfiguration#setMaxErrorRetry(int)
 */

public static RetryPolicy getDefaultRetryPolicy() {

    return new RetryPolicy(DEFAULT_RETRY_CONDITION,
                        DEFAULT_BACKOFF_STRATEGY,
                        DEFAULT_MAX_ERROR_RETRY,
                        true);

}

getDefaultRetryPolicy()関数は以下の箇所で呼び出されて、DEFAULT 変数に格納されている。ここで格納された RetryPolicy クラスの DEFAULT 変数は別のクラス等から参照され、リトライ処理の挙動を決定する。

    static {
     DEFAULT = getDefaultRetryPolicy();
     DYNAMODB_DEFAULT = getDynamoDBDefaultRetryPolicy();
 }

ClientConfiguration クラスについて

ClientConfiguration クラスにつきましては、リトライ処理等のデフォルト設定を上書きして、クライアントサイドで値等をカスタマイズしていただけるクラスとなっている。

https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/ClientConfiguration.java

以下で各種初期値の設定をしている。

@NotThreadSafe public class ClientConfiguration {

    /** The default timeout for creating new connections. */
 public static final int DEFAULT_CONNECTION_TIMEOUT = 10 * 1000;

    /** The default timeout for reading from a connected socket. */
 public static final int DEFAULT_SOCKET_TIMEOUT = 50 * 1000;

    /**
  * The default timeout for a request. This is disabled by default.
  */
 public static final int DEFAULT_REQUEST_TIMEOUT = 0;

    /**
  * The default timeout for a request. This is disabled by default.
  */
 public static final int DEFAULT_CLIENT_EXECUTION_TIMEOUT = 0;

    /** The default max connection pool size. */
 public static final int DEFAULT_MAX_CONNECTIONS = 50;

以下の箇所で、先ほどの PredefinedRetryPolicies クラスの DEFAULT の値が代入され、retryPolicy の変数として格納されている。

public static final RetryPolicy DEFAULT_RETRY_POLICY = PredefinedRetryPolicies.DEFAULT;

/** The retry policy upon failed requests. **/ private RetryPolicy retryPolicy = DEFAULT_RETRY_POLICY;

クライアント側で、ClientConfiguration クラスの設定をいただいた場合、以下の箇所で設定が上書きされる。

public ClientConfiguration(ClientConfiguration other) {

    this.connectionTimeout = other.connectionTimeout;
 this.maxConnections = other.maxConnections;
 this.maxErrorRetry = other.maxErrorRetry;
 this.retryPolicy = other.retryPolicy;

以上から、ClientConfiguration クラスを設定していない場合におきましてもリトライ処理は実行される。

なお、リトライ処理の判定に関しては、PredefinedRetryPolicies クラスと同一ファイル上にある SDKDefaultRetryCondition クラスの shouldRetry() 関数で判定が行われている。

@Override

    public boolean shouldRetry(AmazonWebServiceRequest originalRequest,
                            AmazonClientException exception,
                            int retriesAttempted) {
     // Always retry on client exceptions caused by IOException
     if (exception.getCause() instanceof IOException) return true;

        // Only retry on a subset of service exceptions
     if (exception instanceof AmazonServiceException) {
         AmazonServiceException ase = (AmazonServiceException)exception;

            /*
          * For 500 internal server errors and 503 service
          * unavailable errors, we want to retry, but we need to use
          * an exponential back-off strategy so that we don't overload
          * a server with a flood of retries.
          */
         if (RetryUtils.isRetryableServiceException(ase)) return true;

            /*
          * Throttling is reported as a 400 error from newer services. To try
          * and smooth out an occasional throttling error, we'll pause and
          * retry, hoping that the pause is long enough for the request to
          * get through the next time.
          */
         if (RetryUtils.isThrottlingException(ase)) return true;

            /*
          * Clock skew exception. If it is then we will get the time offset
          * between the device time and the server time to set the clock skew
          * and then retry the request.
          */
         if (RetryUtils.isClockSkewError(ase)) return true;
     }

        return false;

}

こちらでリトライ処理の判定が行われているが、自分でもリトライ処理を実装していただくことでより確実にリトライ処理を行うことが可能となっている。

以下で、デフォルトでどのようなアルゴリズムでリトライ処理が実装されるかが定義されている。スロットリングしていないときは、Full Jitter Backoffで、スロットリングしているときは、Equal Jitter Backoffとなっている。

https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/retry/PredefinedBackoffStrategies.java

/**

 * A private class that implements the default back-off strategy.
 **/

static class SDKDefaultBackoffStrategy extends V2CompatibleBackoffStrategyAdapter {

    private final BackoffStrategy fullJitterBackoffStrategy;
 private final BackoffStrategy equalJitterBackoffStrategy;

    SDKDefaultBackoffStrategy() {
     fullJitterBackoffStrategy = new PredefinedBackoffStrategies.FullJitterBackoffStrategy(
             SDK_DEFAULT_BASE_DELAY, SDK_DEFAULT_MAX_BACKOFF_IN_MILLISECONDS);
     equalJitterBackoffStrategy = new PredefinedBackoffStrategies.EqualJitterBackoffStrategy(
             SDK_DEFAULT_THROTTLED_BASE_DELAY, SDK_DEFAULT_MAX_BACKOFF_IN_MILLISECONDS);
 }

    SDKDefaultBackoffStrategy(final int baseDelay, final int throttledBaseDelay, final int maxBackoff) {
     fullJitterBackoffStrategy = new PredefinedBackoffStrategies.FullJitterBackoffStrategy(
             baseDelay, maxBackoff);
     equalJitterBackoffStrategy = new PredefinedBackoffStrategies.EqualJitterBackoffStrategy(
             throttledBaseDelay, maxBackoff);
 }

    @Override
 public long computeDelayBeforeNextRetry(RetryPolicyContext context) {
     /*
      * We use the full jitter scheme for non-throttled exceptions and the
      * equal jitter scheme for throttled exceptions.  This gives a preference
      * to quicker response and larger retry distribution for service errors
      * and guarantees a minimum delay for throttled exceptions.
      */
     if (RetryUtils.isThrottlingException(context.exception())) {
         return equalJitterBackoffStrategy.computeDelayBeforeNextRetry(context);
     } else {
         return fullJitterBackoffStrategy.computeDelayBeforeNextRetry(context);
     }
 }

}

JavaScript SDK

JavaScript SDKでは

リトライ回数3回(DynamoDBでは10回)
リトライ基準は、ステータスコードが、5XXエラー全般、もしくは429のときか、タイムアウト時。

JavaScript SDKのリトライ処理は、lib/直下のservice.jsで基本的に定義されている。

リトライ回数

https://github.com/aws/aws-sdk-js/blob/73d1c78f21793206e9db0b54161b64db9ab54ff2/lib/service.js

AWS.Serviceでデフォルトでは、defaultRetryCountとして3回と定義。

  defaultRetryCount: 3,

ただし、DynamoDB に関しては、デフォルトで10回と定義されている。

  defaultRetryCount: 10,

numRetries()関数で実際のリトライ回数を返すようになっている。

  /**
   * How many times a failed request should be retried before giving up.
   * the defaultRetryCount can be overriden by service classes.
   *
   * @api private
   */
  numRetries: function numRetries() {
    if (this.config.maxRetries !== undefined) {
      return this.config.maxRetries;
    } else {
      return this.defaultRetryCount;
    }
  },

リトライ間隔

retryDelays()関数から、AWS.util.calculateRetryDelay()関数を呼び出している。

  /**
   * @api private
   */
  retryDelays: function retryDelays(retryCount) {
    return AWS.util.calculateRetryDelay(retryCount, this.config.retryDelayOptions);
  },

AWS.util.calculateRetryDelay()関数の処理内容について、設定内容から確認。

https://github.com/aws/aws-sdk-js/blob/9f1237b605f60d70753f5d1c7ac7bffe4d4430d5/lib/config.d.ts#L139-L150

ConfigurationOptionsクラスという抽象クラスで retryDelayOptions のプロパティを持つ

    /**
     * Returns A set of options to configure the retry delay on retryable errors.
     */
    retryDelayOptions?: RetryDelayOptions

RetryDelayOptionsはbaseというエクスポネンシャルバックオフのベースの時間(ミリ秒単位, デフォルト 100ms)とバックオフアルゴリズムをカスタマイズする場合はそちらを定義したクラスを定義できるようになっている。

export interface RetryDelayOptions {
    /**
     * The base number of milliseconds to use in the exponential backoff for operation retries.
     * Defaults to 100 ms.
     */
    base?: number
    /**
     * A custom function that accepts a retry count and returns the amount of time to delay in milliseconds.
     * The base option will be ignored if this option is supplied.
     */
    customBackoff?: (retryCount: number) => number
}

https://github.com/aws/aws-sdk-js/blob/2b6bcbdec1f274fe931640c1b61ece999aae7a19/lib/util.js#L833-L852

最大リトライ回数は、各サービスで定義しるものに準じる。デフォルト設定は0。リトライの時間間隔は、calculateRetryDelay()関数で計算。

  /**
   * @api private
   */
  handleRequestWithRetries: function handleRequestWithRetries(httpRequest, options, cb) {
    if (!options) options = {};
    var http = AWS.HttpClient.getInstance();
    var httpOptions = options.httpOptions || {};
    var retryCount = 0;

    var errCallback = function(err) {
      var maxRetries = options.maxRetries || 0;
      if (err && err.code === 'TimeoutError') err.retryable = true;
      if (err && err.retryable && retryCount < maxRetries) {
        retryCount++;
        var delay = util.calculateRetryDelay(retryCount, options.retryDelayOptions);
        setTimeout(sendRequest, delay + (err.retryAfter || 0));
      } else {
        cb(err);
      }
  };

retryDelayOptionsのbaseで設定した値をベースに、リトライ回数の2のべき乗の積に0~1の間のランダムな小数の積をかけられた時間だけ待機。

  /**
   * @api private
   */
  calculateRetryDelay: function calculateRetryDelay(retryCount, retryDelayOptions) {
    if (!retryDelayOptions) retryDelayOptions = {};
    var customBackoff = retryDelayOptions.customBackoff || null;
    if (typeof customBackoff === 'function') {
      return customBackoff(retryCount);
    }
    var base = typeof retryDelayOptions.base === 'number' ? retryDelayOptions.base : 100;
    var delay = Math.random() * (Math.pow(2, retryCount) * base);
    return delay;
  },

リトライ基準

ステータスコードが5XXエラーか429エラーのとき。もしくはタイムアウトの時。

            var err = util.error(new Error(),
              { retryable: statusCode >= 500 || statusCode === 429 }
            );

 if (err && err.code === 'TimeoutError') err.retryable = true;

タイムアウトは、calculateRetryDelay()関数で計算したdelayの時間に、retry-after ヘッダーで定義された秒数を足したもの。

 setTimeout(sendRequest, delay + (err.retryAfter || 0));

 var retryAfter = parseInt(httpResponse.headers['retry-after'], 10) * 1000 || 0;

エクスポネンシャルバックオフアルゴリズムの待機時間の決定式について

純粋なエクスポネンシャルバックオフアルゴリズム

sleep = min(cap, base * 2 ** attempt)

Full Jitter

sleep = random_between(0, min(cap, base * 2 ** attempt))

Equal Jitter

temp = min(cap, base * 2 ** attempt)
sleep = temp / 2 + random_between(0, temp / 2)

https://www.awsarchitectureblog.com/2015/03/backoff.html

PreviousAWS Misc NextAWS CLI バージョンアップでエラー発生を解消

Last updated 1 month ago