기사

What Are Open Table Formats (OTFs)?

개요

Open table formats (OTFs) are pivotal in managing large datasets efficiently, offering a layer of abstraction over data lakes and introducing database-like functionalities. They support transactional consistency across multiple data applications, enhancing accessibility and meaningfulness of data. OTFs boast benefits such as compatibility, cost-effectiveness, and interoperability, making them ideal for complex, data-heavy environments.

The open-source nature of OTFs encourages collaborative innovation, ensuring users benefit from the latest data management advancements. Prominent OTFs like Apache Iceberg and Delta Lake offer advanced solutions for data integrity and management. With OTFs, organizations can significantly enhance their data analysis and management capabilities.

Open table formats are open-source, standard table formats for working with very large datasets in a performant way. They provide a layer of abstraction on top of data lakes and bring database-like features to them. OTFs enable multiple data applications to work on the same data in a transactionally consistent manner.

Organizations can leverage OTFs to enhance their data processing capabilities, ensuring data is accessible and meaningful. Benefits of open table formats include:

Compatibility
Cost-effectiveness
Efficiency
Flexibility
Governance
Interoperability
Security

These benefits make OTFs versatile choices for companies operating in multifaceted, data-intensive environments.

Why use an open table format?

In data engineering the choice of data storage and management solutions is central to the success of data-driven initiatives. Open table formats offer a compelling array of benefits that address many of the challenges faced by data professionals today. One of the primary advantages of using an OTF is its ability to streamline data management processes. This includes simplifying data ingestion, storage, and access across diverse data ecosystems. By employing open table formats, organizations can reduce complexity, improve data quality, and accelerate time to insight, enhancing decision-making processes and operational efficiency.

Another significant benefit of open table formats is their support for schema evolution and multi-tenancy. As data structures evolve over time, the ability to adapt without extensive rework or downtime is invaluable. Furthermore, by facilitating multi-tenancy, OTFs enable organizations to efficiently manage data from multiple sources or departments within a single framework. This not only optimizes resource utilization but also ensures data security and governance are maintained at a high standard.

Lastly, the open-source nature of many open table formats fosters a collaborative environment where innovations and improvements are continuously integrated. This aspect ensures that organizations using OTFs benefit from the latest advancements in data management technology. Open-source formats are supported by a vast community of developers and data professionals who contribute to their development, stability, and security. This collective effort results in robust, cutting-edge solutions that can adapt to the ever-changing landscape of data technology. By choosing an open table format, businesses align themselves with a dynamic, forward-thinking approach to data management that is both scalable and sustainable.

Open table format features

Open table formats are engineered to enhance data management capabilities significantly. One of the cornerstone features of these formats is support for full create, read, update, and delete (CRUD) operations. This comprehensive functionality allows for flexible data manipulation and ensures that data lakes and warehouses can be updated in real time, reflecting the most current state of information. The ability to perform updates and deletes sets open table formats apart from traditional file-based storage systems, where such operations are cumbersome and inefficient.

Performance and scalability are other notable features that open table formats bring to the table. These formats are designed to excel in environments where data volumes are massive and continue to grow. They employ various optimization techniques, such as indexing, partitioning, and caching, to expedite data retrieval and processing. This not only improves query performance but also ensures that the system can scale horizontally to accommodate increasing data loads without a significant degradation in performance. As a result, organizations can manage their data ecosystems more effectively, making data-driven insights more accessible and actionable.

Transactional support with ACID compliance is another key feature of open table formats. This ensures that all data transactions are processed reliably, maintaining data integrity and consistency across the board. ACID compliance is particularly important in scenarios where multiple transactions occur simultaneously or when the system needs to recover from partial failures. OTFs guarantee that each transaction is completed successfully or fully rolled back, providing an essential level of data reliability and trustworthiness for critical business operations. This feature is instrumental in supporting complex data workflows and ensuring that data lakes and warehouses can serve as a single source of truth for organizations.

Main types of open table formats

Apache Iceberg and Delta Lake are among the most prominent formats, offering advanced solutions for managing large-scale data lakes and ensuring data integrity.

Apache Iceberg focuses on enhancing data reliability and scalability in data lakes. It offers robust schema evolution capabilities, allowing for seamless modifications to data structures without disrupting existing data or queries. Iceberg's table format is engineered to improve query performance, making it easier to handle complex analytical workloads. Its compatibility with various computing engines—including Apache Spark, Apache Flink, and Presto—further enhances its versatility.

Delta Lake introduces a transactional storage layer that brings ACID transactions to Apache Spark and big data workloads. Delta Lake's ability to ensure data integrity, even in the face of concurrent reads and writes, makes it a powerful tool for data engineers. Its support for schema enforcement and time travel (the ability to query previous versions of the data) provides additional layers of data management and analysis capabilities.

The choice of one type over another may depend on specific use cases and requirements. For instance, organizations focused on scalability and complex analytics could find Apache Iceberg most suitable. Delta Lake, with its strong emphasis on ACID transactions and data integrity, could be the preferred choice for applications where consistency and reliability are paramount. The decision ultimately hinges on aligning the format's strengths with the organization's data strategy and operational needs.

Common open data table architectures

The architecture of open data tables is central to how data is stored, accessed, and managed within an organization's data ecosystem. These architectures are designed to optimize data processing and ensure seamless integration with existing data management tools and frameworks. A common architecture involves layering the open table format atop a distributed file storage system, such as Amazon Simple Storage Service (S3), Microsoft Azure Data Lake Storage Gen2, or Google Cloud Storage. This setup allows for the efficient handling of vast amounts of data while leveraging the scalability and durability of object storage services.

Another key aspect of open data table architectures is the use of metadata to manage data files. Metadata—which includes data file information like schema details, partitioning information, and change logs—is utilized in optimizing data access and query performance. By maintaining a centralized metadata store, open table formats can efficiently track changes to the data, support schema evolution, and enable features like time travel and incremental processing. These OTF capabilities can enable new workloads, such as AI use cases and model training.

Frequently asked questions

How do table formats streamline data lakes?

Table formats work to enhance the efficiency and effectiveness of data lakes. By providing a structured approach to data storage and management, open table formats introduce a layer of organization that is often missing in traditional data lakes. They provide a layer of abstraction on top of data lakes and bring database-like features to them. This structured approach enables more efficient data querying and analysis, as data is stored in a manner optimized for access patterns and query performance.

One of the key ways table formats streamline data lakes is by enabling schema-on-read capabilities. This allows data lakes to accommodate data from various sources with different formats and structures, without the need for up-front schema definition. As a result, data engineers and analysts can focus on deriving insights from the data, rather than spending time on data preparation and transformation tasks. Furthermore, the ability to enforce schema validation at write time ensures data quality and consistency, reducing the likelihood of errors and anomalies in the data.

Table formats also introduce transactional support and ACID compliance to data lakes, ensuring data integrity and consistency. This is particularly important in environments where data is frequently updated or where multiple users access and modify the data concurrently. By supporting atomic transactions, open table formats ensure that data lakes can serve as a reliable source of truth for the organization, facilitating accurate and timely decision-making. Additionally, features like incremental processing and time travel enhance the flexibility of data lakes, allowing organizations to track changes over time and access historical data as needed. These capabilities make open table formats an indispensable tool for optimizing data lake operations and unlocking the full potential of data assets.

How should I choose an open table format?

There is functional parity between three common open table formats in the industry today: Apache Iceberg, Linux Foundation Delta Lake, and Apache Hudi. Their ecosystems, developers, and contributor communities differ, so it may make sense to choose an OTF based on the available and supported ecosystem for your use cases and specific requirements for your workloads. All three OTFs support ACID transactions and versioning, schema evolution, and time travel, and all three can handle complex query workloads with high performance and writes from many concurrent users.

The most open and connected ecosystem for Trusted AI

Teradata provides an open ecosystem for OTFs, catalogs, and cloud service providers (CSPs) in multi-cloud and multi-data lake environments.

This unique, open, and connected approach to supporting OTFs enables cross-read, cross-write, and cross-query of data stored in Apache Iceberg and Delta Lake tables using open catalogs such as Amazon Web Services (AWS) Glue, Hive Metastore, or Unity.

This future-ready approach allows enterprises to employ a truly modern data strategy, with unmatched agility and flexibility to deliver Trusted AI at scale—all without the need to move, replicate, or transform data.