ExceptionFactory

Producing content that a reasonable developer might want to read

How Apache NiFi 2 Integrates Apache Iceberg

NiFi Iceberg Java

2026-01-26 • 7 minute read • David Handermann

Background

Apache Iceberg has experienced widespread adoption as a unifying storage solution for structured data at scale. With a standard specification for Catalogs, Tables, and Schemas, Iceberg enables declarative access to analytic datasets using pluggable implementations and formats. The Iceberg REST Catalog Specification provides an important integration layer, abstracting table metadata operations using a common vocabulary of HTTP methods. Apache Parquet consistently ranks among the most popular file formats for Iceberg Tables, with other current and emerging alternatives available. Integration with Iceberg includes a large number of open source and commercial products.

Introduction

Apache NiFi 2.7.0 introduced the PutIcebergRecord Processor to support writing structured Records to Iceberg Tables. Although NiFi 1 supported similar capabilities through the PutIceberg Processor, the initial release of Apache NiFi 2 removed the Processor based on tight coupling to Apache Hadoop and Apache Hive dependencies in Iceberg Catalog integrations. The redesigned integration approach aligns Apache NiFi 2 Controller Service interfaces with Apache Iceberg interfaces. The initial implementation supports Iceberg REST Catalogs and Parquet file formatting with several storage services, enabling multiple integration options with a path for future extension.

Structured Implementation

Iceberg support in NiFi 2 consists of multiple API and implementation bundles. This structure provides integration boundaries that follow the pattern of the Iceberg modules themselves. Although the modular architecture requires some additional configuration steps across multiple Controller Services, it sets the foundation for both pluggable configuration and extensible implementation patterns.

Controller Service Interfaces

Forming the contract for extension, several Controller Services define the methods required for configuring and running the PutIcebergRecord Processor. The nifi-iceberg-service-api-nar bundles the following Controller Service interfaces:

Defined in the nifi-iceberg-service-api module, these Controller Services limit dependency references to iceberg-api to maintain alignment with Iceberg API and implementation distinctions.

Iceberg Catalog Controller Service

The IcebergCatalog Controller Service interface abstracts access to the central Catalog interface, which supports standard Table operations. Based on this abstraction, Apache NiFi implementations can support Iceberg REST Catalogs as well as other open source or commercial catalogs. Controller Services that implement the IcebergCatalog interface abstract the configuration properties for the Catalog, maintaining the boundary between access to the Catalog and other operations.

Iceberg FileIO Provider Controller Service

As the name implies, the IcebergFileIOProvider Controller Service interface provides instances of the Iceberg FileIO, which supports reading and writing formatted files using concrete object storage services. Implementations of FileIO can be configured with credentials supplied from an Iceberg Catalog. Storage services from major cloud service providers often require libraries with extensive dependency trees, so encapsulating access to and configuration of FileIO instances through the IcebergFileIOProvider interface enables class loading isolation for different services.

Iceberg Writer

The IcebergWriter Controller Service interface supplies instances of the IcebergRowWriter which handles serializing individual Iceberg Record objects into one or more Iceberg DataFile objects according to supported file formats, such as Parquet or Avro.

The NiFi IcebergRowWriter interface follows the pattern of the Iceberg TaskWriter. However, with the Iceberg TaskWriter being part of the iceberg-core library instead of the iceberg-api, redefining the interface in the NiFi module maintains alignment with the Iceberg API and avoids additional dependencies present in the Iceberg Core library.

Controller Service Implementations

Building against Controller Service interfaces allows NiFi components to follow the structure of Iceberg implementation modules. Initial capabilities released in NiFi 2.7 focus on support for the most common integration patterns. These service implementations do not preclude introducing additional modules, but prioritize standard features over comprehensive options.

REST Iceberg Catalog

The RESTIcebergCatalog is a concrete implementation of the IcebergCatalog Controller Service that supports the Iceberg REST Catalog Specification. The nifi-iceberg-rest-catalog-nar bundles the RESTIcebergCatalog along with Iceberg client libraries. The Controller Service class uses the Iceberg RESTClient class from the iceberg-core library and supports a number of configuration properties.

For authentication to Iceberg REST Catalogs, the NiFi Controller Service supports either Bearer Authentication with a static token or OAuth 2.0 using the Client Credentials Grant Type with a configurable Authorization Server. These authentication strategies align with supported methods in Apache Polaris, Apache Gravitino, and other implementations. OAuth 2.0 authentication also supports configurable Access Token Scopes for minimizing privileges associated with provisioned credentials.

As the central access point for Iceberg Table operations, the RESTIcebergCatalog has a configuration property for the File IO Provider that supports pluggable storage service implementations. For storage services that require authentication, the Access Delegation Strategy supports requesting Vended Credentials from the configured REST Catalog. Selecting the Vended Credentials strategy enables the Controller Service client to send the X-Iceberg-Access-Delegation HTTP request header. REST Catalogs that support the vended-credentials header value return scoped credentials as part of the Iceberg Table metadata, providing a simplified and secure configuration strategy.

Azure Data Lake Storage FileIO Provider

The ADLSIcebergFileIOProvider integrates Iceberg metadata and record persistence with Azure Data Lake Storage using the iceberg-azure library. The nifi-iceberg-azure-nar bundles the Azure SDK for Data Lake Storage and associated dependencies, with specific transitive dependency exclusions for unused features. The ADLSIcebergFileIOProvider supports a configurable Authentication Strategy that defaults to Vended Credentials for authentication provided by the configured Iceberg REST Catalog. The Controller Service also supports configuring a static token using Azure shared access signatures.

S3 Iceberg FileIO Provider

The S3IcebergFileIOProvider supports storing Iceberg metadata and records in Amazon S3. The iceberg-aws library provides the implementation based on version 2 of the AWS SDK and the nifi-iceberg-aws-nar bundles the required dependencies. The NiFi module uses selective dependency exclusions to minimize the number of transitive dependencies bundled in the NAR, reducing the archive size while supporting required features.

The Authentication Strategy property of the S3IcebergFileIOProvider controls several dependent properties. The default value of Vended Credentials avoids the need to configure other authentication properties. The optional Basic Credentials and Session Credentials options support configuration of static Access Key ID and Secret Access Key properties, along with the Client Region of the S3 bucket location.

Parquet Iceberg Writer

The ParquetIcebergWriter provides support for serializing Iceberg Records as Apache Parquet data files. The Controller Service implementation does not include any direct configuration properties in the initial version, instead relying on Catalog Table definitions to supply settings to the Parquet Writer. Abstracting support for Iceberg File Formats to Controller Services extends the configuration surface, but provides the isolated class loading and modular dependency structure needed for pluggable composability.

The nifi-iceberg-parquet-writer-nar contains the ParquetIcebergWriter class along with the minimum set of dependencies required. The Apache Parquet libraries for Java have an extensive dependency tree due to tight coupling to Apache Hadoop which required careful selection and exclusion of libraries linked through iceberg-parquet. The Maven configuration for nifi-iceberg-parquet-writer defines the limited inclusion of Apache Hadoop Common with minimal transitive dependencies.

Processor Configuration

Bringing these Controller Service interfaces together, the PutIcebergRecord Processor requires a NiFi Record Reader for input FlowFiles, an Iceberg Writer for serializing Iceberg Records, and an Iceberg Catalog for managing table operations. The Namespace and Table Name properties configure the destination Iceberg Table and support FlowFile attributes for resolving different destinations in a shared Processor. Following the pattern of most Processors, PutIcebergRecord defines standard success and failure relationships. The Processor also increments counters for Data Files Processed and Records Processed to support basic behavior tracking.

Bundled in the nifi-iceberg-processors-nar, PutIcebergRecord implements transparent batching of multiple FlowFiles, depending on the size and number of FlowFiles queued. Optimal Iceberg persistence depends on larger file sizes, so batching larger numbers of records from multiple FlowFiles during initial ingest provides a better starting point for subsequent adjustments. NiFi Processors such as MergeContent and MergeRecord should be configured in front of PutIcebergRecord to optimize Iceberg data file sizes.

Conclusion

Integration with Apache Iceberg continues among numerous streaming and structured data services, highlighting the power of a common API with multiple points of extension. Apache NiFi 2 support for Iceberg does not cover every potential use case, but the standard interfaces and alignment with Iceberg API surfaces set a solid foundation for integration and extension. Component design often presents tradeoffs between simplicity and configurability, for both engineers and end users. NiFi support for Iceberg reflects some of the complexities inherent in Iceberg itself. With the flexibility of the NiFi framework, new features can be more or less opinionated. By following the modular pattern of the Iceberg project, the NiFi 2 integration incorporates current capabilities and sets a path for future enhancements.