Amazon S3 simple object storage represents the industry standard for cheap storage of large files. All modern cloud vendors do offer S3 alternative services that offer the same access and nearly the same conditions in terms of features and prices.
One can say that S3 represents the backbone of modern cloud storages offering a scalable, high-performance storage for a wide range of use cases, such as backups, disaster recovery, big data analytics, and archiving.
This blog post shows how to access Google GCP S3 alternative service (called Google Cloud Storage) by using a simple Python script in order to store and receive a file.
The idea here is to use Google Cloud Storage for exporting data from an observability platform for the purpose of offline AI/ML training.
Install Google GCP client CLI
As a first step you need to install the Google GCP CLI that will allow you to create a credentials file that we will then need to access Google Cloud Storage with our application Python script.
Read about how to download and install the Google Cloud CLI.
The Python to Google Storage import/export script
Now we will implement a simplistic Python script for importing and exporting data to Google Cloud Storage.
Refer to the Google Cloud Storage API help page for the full details of the available API.
See the below Python script that loads the ADC credentials JSON file and offers two functions for uploading any given data into a Google Cloud Store bucket and another function for loading the data back from the given bucket:
# Imports the Google Cloud client library
from google.cloud import storage
import os
import sys
# Locate your Google Cloud ADC credential JSON file
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=".../gcp/application_default_credentials.json"
def upload(project, bucket, blobname, data):
print("upload")
# Initialize the client by specifying your Google GCP project id
storage_client = storage.Client(project=project)
# Opens the existing bucket
bucket = storage_client.bucket(bucket)
blob = bucket.blob(blobname)
with blob.open("w") as f:
f.write(data)
print(f"Data stored in bucket: {bucket.name}.")
def load(project, bucket, blobname):
print("load")
# Initialize the client by specifying your Google GCP project id
storage_client = storage.Client(project=project)
bucket = storage_client.bucket(bucket)
blob = bucket.blob(blobname)
with blob.open("r") as f:
print(f.read())
upload("myplayground-3", "train-data-22342343242", "train.csv", "test, test, test")
load("myplayground-3", "train-data-22342343242", "train.csv")
You can also download the Python file from GitHub.
If we run the script, we will get following error informing us that the script is missing the necessary authorization file:
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application.
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application.
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application.
You can easily check if a local authorization file was already set or not by checking the environment variable ‘GOOGLE_APPLICATION_CREDENTIALS’:
print(os.environ['GOOGLE_APPLICATION_CREDENTIALS']).
In Python you can set the environment variable within your Python program to the GCP credentials file:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/path/to/file.json"
Create the ADC credential JSON file
To run the script we need to first generate the ADC credential JSON file by using the authorization mechanism of the Google Cloud CLI.
Create the credentials file
Create your own GCP ADC credentials file for your local development environment by using credentials associated with your Google Account.
- Install and initialize the gcloud CLI
- Run the following gcloud command to create your credential file:
gcloud auth application-default login
Your browser is opened and shows you the Google Cloud login screen. After a successful login, a credentials file is created and stored into a local json file.
See the gcloud command line process below:
./gcloud auth application-default login
Your browser has been opened to visit:
https://accounts.google.com/oauth2/auth?………&code_challenge_method=S256
Credentials saved to file: [/…/.config/gcloud/application_default_credentials.json]
These credentials will be used by any library that requests Application Default Credentials (ADC).
Quota project “playground” was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.
Summary
Google Cloud Storage offers a convenient way to persist and share large amounts of data at a moderate price. It’s perfect for keeping your AI/ML training data or to store a trained AI/ML model.
By using the dedicated language clients, such as Python, it’s pretty simple to upload and download your training sets and to work with Google Cloud Storage as the data backend.