Elephant Below the Waterline: Hadoop Dependencies in Iceberg

2024-11-08 • 10 minute read • David Handermann

Background

Apache Iceberg is an open source project that provides an interoperable specification for scalable analytic tables. With support for standard SQL commands across storage systems, file formats, and solution providers, Iceberg has gained widespread adoption. Building on logical separation between metadata tracking and content storage, Iceberg Catalogs provide a flexible abstraction for managing table information. Apache Polaris provides a REST catalog service for Iceberg, enabling mediated access to large datasets from an array of programming languages. As an open table format, Iceberg also supports several file formats, including Apache Avro, Apache ORC, and Apache Parquet. In contrast to row-oriented Avro files, the column-oriented structure of Parquet has made it a leading solution for Iceberg data files.

Introduction

With documented integrations across numerous commercial and open source engines, Iceberg has emerged as a powerful solution for storing and accessing structured data at scale. With a Java API built on minimal dependencies, the iceberg-api module serves as the foundation for an extensive ecosystem. The modular structure of the Iceberg source repository highlights an architectural strategy that considers the impact of integrating with disparate services. Despite building on a modular design, however, the popularity of Parquet for Iceberg file storage brings along significant historical baggage from Apache Hadoop.

Legacy libraries and technical debt from earlier design decisions now present security and maintainability concerns for Iceberg solution providers. The runtime requirements of Iceberg support for Parquet illustrate the danger of unmaintained dependencies. At the same time, the level of interest and investment in Iceberg provide an opportunity to decouple both Parquet and Iceberg from Hadoop libraries. The current relationship between Iceberg, Parquet, and Hadoop presents a case study in dependency management and a pointer for potential improvements.

Dependency Review

As a large modular project, Iceberg includes a number of required libraries for various integrations. The HadoopCatalog represents a direct link to the Hadoop Distributed File System, but it is one of several Catalog implementations. Although packaged in the iceberg-core library, the Hadoop Catalog does not present the tight coupling concerns that exist at the level of file formats.

Tracing the structural connection between Iceberg and Hadoop is both clear and convoluted. The basic relationship between Iceberg and Parquet is clear in the iceberg-parquet module through the parquet-avro library. The link between Parquet and Hadoop is also clear with parquet-avro having multiple Hadoop dependencies including hadoop-common and hadoop-client.

The confusion comes in the form of dependency scopes and dependency exclusions. Although untangling these dependency relationships is not straightforward, it is a useful exercise for both current integration and future improvements to the Iceberg ecosystem.

Iceberg Dependencies on Parquet and Hadoop

Reviewing the relationship graph from the perspective of the iceberg-parquet library with the Maven Dependency Plugin does not provide the full picture due to the use of the provided scope for Hadoop dependencies in Parquet libraries. This is essential to understanding the runtime implications of integrating with Iceberg and Parquet. Without Hadoop dependencies declared and included during runtime, attempting to read or write Parquet files will fail with class definition errors.

The complete relationship between Iceberg and Hadoop through Parquet can be represented as a list with Iceberg at the beginning and Hadoop dependencies at the end.

iceberg-parquet
- parquet-avro
  - parquet-hadoop

It is important to note that the Hadoop libraries listed are direct dependencies of both the parquet-avro and parquet-hadoop libraries. This direct relationship between parquet-avro and Hadoop is apparent in classes such as AvroReadSupport and others in the org.apache.parquet.avro directory.

Although Hadoop libraries are transitive dependencies of the iceberg-parquet library, multiple Iceberg classes such as ParquetIO have direct imports of the Hadoop Configuration class. The ParquetIO class also reflects the historical connection to Hadoop in methods responsible for converting between types of input and output files. This direct link between Iceberg and Hadoop highlights the potential for transitive dependencies to become direct dependencies. This danger is present any in large project, but the nature of Hadoop configuration and stream components makes it particularly prone to such issues.

Parquet Dependencies on Hadoop

Moving from direct Iceberg references to Parquet itself, the coupling to Hadoop becomes clearer. The parquet-avro library depends on several Hadoop libraries, declaring the provided scope for each library. The parquet-hadoop library includes the same Hadoop dependencies plus hadoop-annotations and also sets the provided scope for each library. Based on the use of the provided scope, however, declaring a dependency on either of these Parquet libraries would not surface the runtime requirements for Hadoop. The same is true when declaring a dependency on Iceberg libraries for Parquet. This fact requires integration developers to go down the path of introducing direct dependencies on Hadoop.

Before proceeding to evaluate the Hadoop dependencies, it is important to review the dependency declarations within the Parquet modules. The Parquet libraries define specific exclusions for each of their Hadoop dependencies. These exclusions cover the SLF4J logging framework and the reload4j replacement for log4j version 1. This means that projects depending on Parquet can also exclude these logging libraries. The rest of the Hadoop dependency structure, however, requires additional analysis.

Hadoop Libraries and Dependencies

The Hadoop dependency landscape reflects its historical roots dating back to 2006. Although its continued use points to its powerful influence on data engineering, it also showcases the ripple effect of poor dependency management on modern projects. The most prominent example is the hadoop-common library, which is a direct dependency of Parquet and also a shared dependency of the other Hadoop libraries.

Hadoop Common Dependencies

Version 3.4.1 of the Hadoop Common library has over 80 direct and transitive dependencies.

These dependencies include both Gson and Jackson for JSON processing, as well as Jettison, which supports reading JSON using the Java streaming interface for XML.

Hadoop Common dependencies include two copies of Google Guava through the official version and the hadoop-shaded-guava version. The Guava library is over 3 MB, making the duplication particularly impactful. Iceberg has its own shaded version of Guava packaged as iceberg-bundled-guava unrelated to the Hadoop versions.

Several more concerning dependencies include Jetty Server 9.4 and Commons Collections 3.2.2. Maintainers declared end of community support for Jetty 9 in June 2022.

The Apache Commons project released Commons Collections 3.2.2 in November 2015 before moving development to version 4. The Checkmarx Developer Hub reports a community security vulnerability for Commons Collections 3.2.2 related to a potential StackOverflowError for self-referencing operations. The Commons Collections dependency is noteworthy because it was not replaced with version 4 until September 2024 for HADOOP-15760 in Hadoop 3.4.2.

Beyond duplicative and dated dependencies, Hadoop Common includes libraries for Apache Curator, Apache Kerby, Apache ZooKeeper, Eclipse Jersey, and Netty. The wide array of network services listed is notable in the context of Iceberg and Parquet because none of these libraries relate to file operations.

Although not an exhaustive enumeration of dependencies in Hadoop Common, the foregoing summary illustrates several significant issues with lack of maintenance and lack of constraints in Hadoop libraries. It also points to options for excluding unused dependencies in current projects or potential refactoring of the Hadoop Common library.

Hadoop Client Dependencies

Hadoop Common brings along the largest set of additional dependencies, but the hadoop-client and hadoop-mapreduce-client-core libraries push the number of direct and transitive dependencies over 100. Both libraries bring in dependencies on Hadoop YARN, which requires Jetty 9.4 WebSocket libraries. Jetty has released several major versions in recent years, with Jetty 12 released in August 2023. As indicated by the major version changes, however, Jetty 12 includes a number of significant changes, including Java 17 as the minimum required version.

The Hadoop MapReduce Client module also has a dependency on the complete set of Netty libraries, which brings along components for several unused network services like Redis and Memcached. The colocation of MapReduce client components with stream processing classes is another point at which Parquet, and thus Iceberg, carry the weight of unnecessary transitive dependencies.

Suboptimal Solutions

The tight coupling between Parquet and Hadoop has prompted a number of change requests and dependency exclusion strategies. For services not tethered to the Hadoop ecosystem, selective inclusion of transitive dependencies is a common tactical solution. Reviewing existing requests and responses underscores the issues with the current state of dependency relationships.

Requested Changes

Both Iceberg and Parquet have received multiple requests to decouple the relationship to Hadoop libraries. The level of detail varies, but multiple requests share the same theme over several years.

GitHub issues for Iceberg related to Hadoop dependencies include several associated with Apache Flink and general support for writing Parquet files.

Issue 10180 for writing Parquet files with Hadoop Configuration
Issue 7332 for making Hadoop an optional dependency with Flink
Issue 3117 for decoupling Hadoop libraries from Flink integration

GitHub issues for Parquet include a number of open and resolved items for moving away from direct links to Hadoop.

Issue 1497 for reading and writing Parquet files without Hadoop
Issue 2473 for integrating with Parquet without Hadoop
Issue 2938 for writing Parquet files without Hadoop
Issue 3028 for implementing a Java NIO class without Hadoop

The level of interest in support for Parquet without Hadoop is evident in both Iceberg and Parquet communities. Changes in recent versions of Parquet include incremental progress in this direction, and comments on open issues indicate ongoing work.

Shading Hadoop Dependencies

As mentioned in some issue discussions, the hadoop-client-api library is one attempt to package common Hadoop components in a single JAR without transitive dependencies. However, the JAR shades dependencies with a combined size of 20 MB.

Shading simplifies the dependency tree, but it removes the ability to address vulnerabilities in transitive libraries. Dependency scanning software that goes beyond artifact name and version resolution still surfaces issues with shaded dependencies, making the single JAR approach insufficient.

Selective Inclusion

Based on the current state of Parquet and its links to Hadoop, selective dependency inclusion presents a viable workaround in some cases. Excluding dependencies can be dangerous because it impacts runtime behavior and requires careful review when upgrading direct or transitive versions. Unit testing can help provide a level of protection against unexpected behavior, but there is no substitute for exercising code in the context of runtime class loading.

Blake Smith published an article on How to use Parquet Java without Hadoop that provides a detailed walkthrough of Parquet code and Maven configuration required. As outlined in the example Maven configuration, the transitive dependencies of Hadoop that Parquet requires at runtime are limited. In addition to Commons Collections 3.2.2, the Woodstox XML libraries and the shaded version of Google Guava are the only requirements aside from the direct Hadoop dependencies. As the article describes, this is a significant reduction from the over 100 transitive dependencies linked to Hadoop.

This strategy to support direct Parquet integration can also apply to Iceberg. With the relationship between iceberg-parquet and parquet-avro not including Hadoop dependencies due to the provided scope, projects integrating with Iceberg and Parquet can declare a limited set of Hadoop dependencies with similar exclusions. In the context of Iceberg and Parquet, direct Hadoop dependencies can be limited to hadoop-common with all transitive dependencies excluded.

Although hadoop-common has a direct dependency on version 5.4.0 of woodstox-core released in October 2022, version 7.0.0 provides enough compatibility to depend on more recent versions. The Hadoop Configuration class that Parquet requires has a single reference to one method in the UnmodifiableMap class of Commons Collections 3, making it possible to eliminate the Commons Collections dependency using a placeholder class.

The selective inclusion strategy is not trivial, but it is an option for projects looking to integrate Iceberg and Parquet without bundling excessive dependencies from Hadoop.

Conclusion

Creating and integrating with large-scale systems often involves building with available materials. New open source solutions benefit from the work of past projects, both directly or indirectly. The presence of existing options for reading and writing Parquet files is a prime example of the positives and negatives of reusing available libraries. With that background, every project is responsible for evaluating the tradeoffs of using current capabilities or building something new. The importance of dependency management cannot be underestimated. For projects like Iceberg with significant and growing adoption, the sustainability of the ecosystem depends not only on direct contributions but also on maintaining healthy dependencies.