πΎHow you can use DuckDB for the data stored in S3?πΎ
π€ Everyone heard at least something about DuckDB during the last several weeks, Iβm not an exception. So I decided to give it a try and explore all the capabilities related to reading, writing, and creating views in S3.
For those, who by accident havenβt heard about DuckDB yet, long story short: DuckDB is a SQL database management system, that allows you to decouple storage and compute. You can store your data in blob storage, create views for this data, and store it as a database in the same blob storage. While your data gets updates, your database gets updates as well. Moreover, you can use your laptop as a compute to query and process the data.
Sounds interesting, isnβt it? Letβs dive deeper into how you can use it.
Prerequisites
There are two prerequisites to query data from S3 using Python DuckDB:
Install AWS CLI, you can do that using the documentation.
Install DuckDB using the following command:
pip install duckdb
Use case 1: Read data from S3 and process it using SQL
You can read .parquet and .json files from S3 and use SQL queries to aggregate, filter, and process them. You can use the code snippet below to experiment with DuckDB.
import duckdb
# Create connection and pass AWS credentials
con = duckdb.connect()
creds_query = "CALL load_aws_credentials()"
con.execute(creds_query)
# Read one parquet file from S3 as a table
con.sql("""
CREATE TABLE my_table AS
SELECT * FROM read_parquet('s3://bucket/file.parquet')
""")
# Read several files
con.sql("""
CREATE TABLE my_table AS
SELECT * FROM read_parquet('s3://bucket/*.parquet')
""")
# Read json files from S3 as a table
con.sql("""
CREATE TABLE my_table AS
SELECT * FROM read_json('s3://bucket/file.json')
""")
After running this code you will get my_table table. You can continue running queries against my_table and each time you will get the result nicely printed, for example, I explored dataset for Amazon bestsellers:
ββββββββββββββββββββ¬ββββββββ¬βββββββββββββββββββββββ¬βββββββββ
β time β Price β Product β Stars β
β varchar β float β varchar β double β
ββββββββββββββββββββΌββββββββΌβββββββββββββββββββββββΌβββββββββ€
β 2020-07-31 00:38 β 180.0 β Echo Dot (3Βͺ GeraΓ§β¦ β NULL β
β 2020-07-31 00:38 β 24.0 β Kindle 10a. geraçã⦠β NULL β
β 2020-07-31 00:38 β 305.0 β Kindle Paperwhite β¦ β NULL β
β 2020-07-31 00:38 β 234.0 β Echo Show 5 - Smarβ¦ β NULL β
β 2020-07-31 00:38 β 15.0 β Capa Nupro para Kiβ¦ β NULL β
ββ-βββββββββββββββββ΄ββββββββ΄βββββββββββββββββββββββ΄βββββββββ
β οΈDuckDB doesnβt support reading very large files in one go, so with large file you can get error of exceeding16777216 bytes.
Use case 2: Create views for your data and share them
You can create views from your data and share them as db files. For example, you may want to create a view with only the top 10 best-selling products with a maximum price of 100$.
import duckdb
# Connect to the database you will create
con = duckdb.connect("best_sellers.db")
# Choose which data you want to share
con.sql("""
CREATE VIEW top_10
AS SELECT Product, Stars
FROM 's3://bucket/*.parquet'
WHERE Price <= 100
""")
con.close()
This code will save a file in the current directory which you can upload to your S3 bucket, and other people will be able to open your view of data using:
import duckdb
con = duckdb.connect()
con.sql("""ATTACH 's3://bucket/best_sellers.db'AS top_10 (READ_ONLY)""")
β οΈYou cannot read databases and views from private S3 buckets, only public ones.
Thank you for reading, letβs chat π¬
π¬ Have you heard about DuckDB before reading this post?
π¬ Have you tried to use it?
π¬ Which features do you think are essential to add?
I love hearing from readers π«Άπ» Please feel free to drop comments, questions, and opinions belowππ»