Over the past two years, we have seen significant support emerging for Apache Iceberg, a table format originally developed by Netflix that was open-sourced as an Apache incubator project in 2018 and graduated from the incubator program in 2020. Ability to “time travel” across the table to view data at a given point in timeĬhoosing which table format to use is an important decision because it can enable or limit the features available.Faster performance due to better filtering or partitioning.Therefore, applying a table format to the data can offer a number of advantages, such as: Moreover, the table metadata can be saved in a way that offers more fine-grained partitioning. Instead of applying a schema when the data is read, clients already know the schema before the query is run. Table formats explicitly define a table, its metadata, and the files that compose the table. This is where applying table formats to data becomes extremely useful. All of a sudden, what seemed like an easy-to-implement data architecture becomes more difficult than expected. Each query engine must have its own view of how to query the files. Additionally, files by themselves do not make it easy to change schemas of a table, or to “time travel” over it. Typical blob storage systems in the cloud lack the information required to show relationships between files or how they correspond to a table, making the job of query engines that much harder. But for very large volumes of data, generic cloud storage also presents challenges and limitations in how that data can be accessed, managed, and used. The cloud has allowed data teams to collect vast quantities of data and store it at reasonable cost, opening the door to new analytics use cases that leverage data lakes, data mesh, and other modern architectures.
0 Comments
Leave a Reply. |