🐾Intro into the world of Delta Lakes🐾
🌐 Delta Lake is one of the most popular terms in data world, you can hear people talking about it all the time. In case you haven’t discovered it by yourself yet, I will provide an intro into the world of Delta Lakes🔍✨
How Delta Lake works:
🔹Delta Lake stores data in parquet format. Each bucket represents one table and data can be partitioned using prefixes.
🔹It creates log records to store information about actions applied to the table. Log records are stored as .json files and named with numerical IDs.
🔹Log records are periodically compressed into checkpoints. Checkpoints are stored in .parquet format. Redundant actions are removed from checkpoints for optimisation, e.g. add action with consequent remove action.
🔹Last checkpoint file contains information about the version of the last checkpoint.
🔹Vacuum process helps to remove stale or unused files, e.g. files that are no longer used by Delta tables or files for broken transactions.
Why to use Delta Lake:
✅ It lets you query point-in-time snapshots or roll back erroneous updates.
✅ Job writes small objects into the table and combines them into larger objects later using Optimize command. It provides you with efficient streaming I/O.
✅ Objects in a Delta Lakes are immutable, so you can safely cache them
✅ It automatically optimizes the size of objects in a table and the clustering of data records without impacting running queries.
✅ It can continue reading old Parquet files without rewriting them if a table’s schema changes.
✅ Audit logging based on the transaction log.
🤓If you want to learn more about DeltaLakes, I can recommend this white paper:
Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores
Thank you for reading, let’s chat 💬
💬 Have you faced any challenges while migrating to Delta Lake?
💬 For those using Delta Lake, what improvements have you observed?
💬 Any tips you would recommend to others considering using Delta Lake?
I love hearing from readers 🫶🏻 Please feel free to drop comments, questions, and opinions below👇🏻