Streaming large files between S3 and GCS (python)

5 1 vote
Article Rating

Streaming files between s3 and gcs can be a pain sometimes (1GB+). To stream data, your first choice should be Google Storage Transfer Service or gsutil. Transfer service should be used for transferring large files between hyperscalers. gsutil should be used to transfer under 1TB files. Custom coding can be done in python or some other language can be used to transfer data.

Note: Both services mentioned above work pretty good and you don’t need to reinvent the wheel. Further, a reliable streaming solution takes time to mature especially for large files.

A few examples of gsutil

# Help is your friend here
gsutil --help

# List s3 files
gsutil ls s3://path/to/your/s3/bucket

# List GCS files
gsutil ls gs://path/to/your/gcs/bucket

# Copy S3 files to GCS
gsutil cp s3://source/path/to/s3/file.gz gs://destination/bucket/in/gcs/with/name.gz

There are still cases, where gsutil or transfer service will fall short. Both of the tools need list permissions on root bucket and if your file resides inside a sub-directory, both will not work and you will get errors like the following. For example s3://my_bucket/foo/bar.gz, if you don’t have permissions on my_bucket, you won’t be able to use these tools to download `bar.gz`.

AccessDeniedException: 403 SignatureDoesNotMatch

I hope Google fixes these issues. Some relevant issues can be found here and here. Drop a message, if you think that situation has improved :).

So, either you get yourself permissions on the root bucket or you do a custom transfer. For custom transfer, following can help you.

For streaming files between s3 and gcs using python, you would need the following modules installed

pip install boto3
pip install smart-open
pip install smart_open[aws]
pip install smart_open[gcp]
pip install hurry.filesize

smart-open is a pythonic way to open files residing in hyperscalers like aws and gcs.

hurry.filesize is just a helper to nicely print out the sizes of files.

import os
import boto3
from smart_open import open
from hurry.filesize import size

# AWS Credentials
session = boto3.Session(
     aws_access_key_id='<YOUR S3 KEY>',
     aws_secret_access_key='<YOUR S3 SECRET>',
)

def read_in_chunks(file_object, chunk_size=1024):
  """Lazy function (generator) to read a file piece by piece.
  Default chunk size: 1k."""
  while True:
      data = file_object.read(chunk_size)
      if not data:
          break
      yield data

CHUNK_SIZE = 256 * 1024 * 1024 # 256MB
PART_SIZE = 256 * 1024 * 1024  # 256MB

source_s3_url = 's3://path/to/s3/file.gz'
destination_gcp_url = 'gs://path/to/gcs/file.gz'

chunk_index = 0
print('Starting the sink')

with open(destination_gcp_url, 'wb', transport_params={'min_part_size' :PART_SIZE}) as gcp_sink:
   with open(source_s3_url, 'rb', transport_params={'session': session}, ignore_ext=True) as s3_source:
     for piece in read_in_chunks(s3_source, CHUNK_SIZE):
         print('Read: ' + size(chunk_index * CHUNK_SIZE) + " ("+ str(chunk_index) + ")")
         gcp_sink.write(piece)
         
         chunk_index = chunk_index + 1

print('done')

The above code is simple, you open a file in gcs with mode wb (write binary) and also open the source file in s3 with mode rb (read binary). Then read in chunks of 256MB from s3 and transfer them to gcs. Repeat until the transfer is done.

For Google credentials, expose an environment variable and make sure it is available when you run this script.

# See for more info: https://cloud.google.com/docs/authentication/getting-started
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service/account/credentials"

One interesting parameter is ignore_ext=True, smart-open comes with decompression capabilities. For example, if you are reading a .gz file, then smart-open can decompress the chunks automatically so that you have a decompressed file in destination. So, if you don’t want decompression then you can set the paramter to True otherwise keep it set to False.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x