Data Transfer from AWS Redshift to BigQuery
Databases are like Pokémon. Gotta pass data between em’ all!
There are many methods of sending data to and from BigQuery. Some are even documented!
In this post I’ll share a simple method that I use to copy tables from Redshift into BigQuery using only SQL and command line tools.
Data Synchronization
While it makes little sense “on paper” to maintain identical copies of the same dataset in separate databases, that is exactly the situation I’ve found myself in.
It’s easy for organizations to become deeply invested in specific technologies. This can be said about programming languages like Java, Python or C++ and similarly it can be said about cloud tech providers like DigitalOcean, GCP or AWS.
In attempting to transition between two such tech providers, I found myself creating components that allow my pipelines to synchronize data between Redshift and BigQuery. Here’s how they work:
- Unload data from Redshift table to s3.
- Copy those CSVs into GCP’s cloud storage.
- Delete data from BigQuery table.
- Upload CSVs from GCP cloud storage into BigQuery table.
This is all done programmatically and happens quite trivially (mostly using an Airflow bash operator in my case) but the devil is definitely in the details here.
All it takes is one poorly escaped record and the whole thing can come crashing down. That’s why I think it could be valuable to share the code that I’ve been using to reliably copy over tables with millions of rows.
Copying a table to BigQuery from Redshift
Here is the full process broken down into three steps
- Install and configure AWS and GCP command line tools. This will involve setting up IAM users with CLI access credentials.
- Redshift unload
my_table_name
to s3
3. Run bash script for Google Cloud Storage / BigQuery upload
Where you will need to replace the wildcards above with your variables (e.g. <name>, <project>, <date_range_where_clause>)
In the section below, I’ll walk through each step and provide more details. I did mention that the devil is in the details right ;)
Step 1
Self explanatory. In practice, you may only know if your setup is working or not as you run through the first 6 commands in step 3.
Step 2
You’ll need to plug in your AWS credentials (or pass unload credentials another way) and set the table range. I usually unload a “lookback period” date range (e.g. the last 5 days) to update BigQuery with fresh data in my daily pipelines, although some of my tables need to be copied over completely.
Note the specified separation character: \t
Step 3
This is where 90% of the work is done, so we can walk through it line by line:
#1. Copy unloaded CSV from s3 to local temp folder.
#2–4. Set gcloud account, project and service account (can skip if already done).
#5. Copy CSV from local temp folder to Google Storage.
#6. Create dataset in BigQuery (if not exists).
#7. Create table in BigQuery with the schema specification in my_table_schema.json
, e.g.
[
{
"mode": "NULLABLE",
"name": "vegetable",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "score",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "score_date",
"type": "DATE"
}
]
#8. Delete rows from target table in BigQuery if needed. The table must already exist.
#9–12. Load the CSV files from Google Storage into BigQuery, setting the separation character as \t
#13–14. Delete Google Cloud and local CSV files.
Conclusion
I shared the process that I use to copy records from Redshift tables to BigQuery tables. This can be run periodically in order to keep two tables from the different databases in sync with each other, or just once to copy one over.
As always, thanks for reading. Now get back to your code! Your projects are missing you ;)