🐾How you can use DuckDB for the data stored in S3?🐾

Jun 10, 2024

🤓 Everyone heard at least something about DuckDB during the last several weeks, I’m not an exception. So I decided to give it a try and explore all the capabilities related to reading, writing, and creating views in S3.

For those, who by accident haven’t heard about DuckDB yet, long story short: DuckDB is a SQL database management system, that allows you to decouple storage and compute. You can store your data in blob storage, create views for this data, and store it as a database in the same blob storage. While your data gets updates, your database gets updates as well. Moreover, you can use your laptop as a compute to query and process the data.

Sounds interesting, isn’t it? Let’s dive deeper into how you can use it.

Prerequisites

There are two prerequisites to query data from S3 using Python DuckDB:

Install AWS CLI, you can do that using the documentation.
Install DuckDB using the following command:

pip install duckdb

Use case 1: Read data from S3 and process it using SQL

You can read .parquet and .json files from S3 and use SQL queries to aggregate, filter, and process them. You can use the code snippet below to experiment with DuckDB.

import duckdb

# Create connection and pass AWS credentials
con = duckdb.connect()
creds_query = "CALL load_aws_credentials()"
con.execute(creds_query)

# Read one parquet file from S3 as a table
con.sql("""
CREATE TABLE my_table AS
SELECT * FROM read_parquet('s3://bucket/file.parquet')
""")

# Read several files
con.sql("""
CREATE TABLE my_table AS
SELECT * FROM read_parquet('s3://bucket/*.parquet')
""")

# Read json files from S3 as a table
con.sql("""
CREATE TABLE my_table AS
SELECT * FROM read_json('s3://bucket/file.json')
""")

After running this code you will get my_table table. You can continue running queries against my_table and each time you will get the result nicely printed, for example, I explored dataset for Amazon bestsellers:

┌──────────────────┬───────┬──────────────────────┬────────┐
│       time       │ Price │       Product        │ Stars  │
│     varchar      │ float │       varchar        │ double │
├──────────────────┼───────┼──────────────────────┼────────┤
│ 2020-07-31 00:38 │ 180.0 │ Echo Dot (3ª Geraç…  │   NULL │
│ 2020-07-31 00:38 │  24.0 │ Kindle 10a. geraçã…  │   NULL │
│ 2020-07-31 00:38 │ 305.0 │ Kindle Paperwhite …  │   NULL │
│ 2020-07-31 00:38 │ 234.0 │ Echo Show 5 - Smar…  │   NULL │
│ 2020-07-31 00:38 │  15.0 │ Capa Nupro para Ki…  │   NULL │
└─-────────────────┴───────┴──────────────────────┴────────┘

⚠️DuckDB doesn’t support reading very large files in one go, so with large file you can get error of exceeding16777216 bytes.

Use case 2: Create views for your data and share them

You can create views from your data and share them as db files. For example, you may want to create a view with only the top 10 best-selling products with a maximum price of 100$.

import duckdb

# Connect to the database you will create
con = duckdb.connect("best_sellers.db")

# Choose which data you want to share
con.sql("""
    CREATE VIEW top_10
    AS SELECT Product, Stars
    FROM 's3://bucket/*.parquet'
    WHERE Price <= 100
""")
con.close()

This code will save a file in the current directory which you can upload to your S3 bucket, and other people will be able to open your view of data using:

import duckdb

con = duckdb.connect()
con.sql("""ATTACH 's3://bucket/best_sellers.db'AS top_10 (READ_ONLY)""")

⚠️You cannot read databases and views from private S3 buckets, only public ones.

Thank you for reading, let’s chat 💬

💬 Have you heard about DuckDB before reading this post?
💬 Have you tried to use it?
💬 Which features do you think are essential to add?

I love hearing from readers 🫶🏻 Please feel free to drop comments, questions, and opinions below👇🏻

🐾How you can use DuckDB for the data stored in S3?🐾

Prerequisites

Use case 1: Read data from S3 and process it using SQL

Use case 2: Create views for your data and share them

Thank you for reading, let’s chat 💬

Discussion about this post