How Apache NiFi 2 Integrates Apache Iceberg
Background
Apache Iceberg has experienced widespread adoption as a unifying storage solution for structured data at scale. With a standard specification for Catalogs, Tables, and Schemas, Iceberg enables declarative access to analytic datasets using pluggable implementations and formats. The Iceberg REST Catalog Specification provides an important integration layer, abstracting table metadata operations using a common vocabulary of HTTP methods. Apache Parquet consistently ranks among the most popular file formats for Iceberg Tables, with other current and emerging alternatives available. Integration with Iceberg includes a large number of open source and commercial products.
Introduction
Apache NiFi 2.7.0 introduced the
PutIcebergRecord Processor to
support writing structured Records to Iceberg Tables. Although NiFi 1 supported similar capabilities through the
PutIceberg Processor, the initial release of Apache NiFi 2 removed the Processor based on tight coupling to
Apache Hadoop and Apache Hive dependencies in Iceberg Catalog
integrations. The redesigned integration approach aligns Apache NiFi 2
Controller Service interfaces with
Apache Iceberg interfaces. The initial implementation supports Iceberg REST Catalogs and Parquet file formatting with
several storage services, enabling multiple integration options with a path for future extension.
Structured Implementation
Iceberg support in NiFi 2 consists of multiple API and implementation bundles. This structure provides integration boundaries that follow the pattern of the Iceberg modules themselves. Although the modular architecture requires some additional configuration steps across multiple Controller Services, it sets the foundation for both pluggable configuration and extensible implementation patterns.
Controller Service Interfaces
Forming the contract for extension, several Controller Services define the methods required for configuring and running the PutIcebergRecord Processor. The nifi-iceberg-service-api-nar bundles the following Controller Service interfaces:
Defined in the nifi-iceberg-service-api module, these Controller Services limit dependency references to iceberg-api to maintain alignment with Iceberg API and implementation distinctions.
Iceberg Catalog Controller Service
The IcebergCatalog
Controller Service interface abstracts access to the central
Catalog
interface, which supports standard Table operations. Based on this abstraction, Apache NiFi implementations can support
Iceberg REST Catalogs as well as other open source or commercial catalogs. Controller Services that implement the
IcebergCatalog interface abstract the configuration properties for the Catalog, maintaining the boundary between
access to the Catalog and other operations.
Iceberg FileIO Provider Controller Service
As the name implies, the
IcebergFileIOProvider
Controller Service interface provides instances of the Iceberg
FileIO, which
supports reading and writing formatted files using concrete object storage services. Implementations of FileIO can
be configured with credentials supplied from an Iceberg Catalog. Storage services from major cloud service providers
often require libraries with extensive dependency trees, so encapsulating access to and configuration of FileIO
instances through the IcebergFileIOProvider interface enables class loading isolation for different services.
Iceberg Writer
The IcebergWriter Controller Service interface supplies instances of the IcebergRowWriter which handles serializing individual Iceberg Record objects into one or more Iceberg DataFile objects according to supported file formats, such as Parquet or Avro.
The NiFi IcebergRowWriter interface follows the pattern of the Iceberg
TaskWriter.
However, with the Iceberg TaskWriter being part of the
iceberg-core library instead of the
iceberg-api, redefining the interface in the
NiFi module maintains alignment with the Iceberg API and avoids additional dependencies present in the Iceberg Core
library.
Controller Service Implementations
Building against Controller Service interfaces allows NiFi components to follow the structure of Iceberg implementation modules. Initial capabilities released in NiFi 2.7 focus on support for the most common integration patterns. These service implementations do not preclude introducing additional modules, but prioritize standard features over comprehensive options.
REST Iceberg Catalog
The
RESTIcebergCatalog
is a concrete implementation of the IcebergCatalog Controller Service that supports the Iceberg
REST Catalog Specification. The
nifi-iceberg-rest-catalog-nar
bundles the RESTIcebergCatalog along with Iceberg client libraries. The Controller Service class uses the Iceberg
RESTClient
class from the iceberg-core library and
supports a number of configuration properties.
For authentication to Iceberg REST Catalogs, the NiFi Controller Service supports either Bearer Authentication with
a static token or OAuth 2.0 using the
Client Credentials Grant Type with a configurable Authorization
Server. These authentication strategies align with supported methods in Apache Polaris,
Apache Gravitino, and other implementations. OAuth 2.0 authentication also supports
configurable Access Token Scopes for minimizing privileges associated with provisioned credentials.
As the central access point for Iceberg Table operations, the RESTIcebergCatalog has a configuration property for the
File IO Provider that supports pluggable storage service implementations. For storage services that require
authentication, the Access Delegation Strategy supports requesting Vended Credentials from the configured REST
Catalog. Selecting the Vended Credentials strategy enables the Controller Service client to send the
X-Iceberg-Access-Delegation HTTP request header. REST Catalogs that support the vended-credentials header value
return scoped credentials as part of the Iceberg Table metadata, providing a simplified and secure configuration
strategy.
Azure Data Lake Storage FileIO Provider
The
ADLSIcebergFileIOProvider
integrates Iceberg metadata and record persistence with
Azure Data Lake Storage using
the iceberg-azure library. The
nifi-iceberg-azure-nar bundles the
Azure SDK for Data Lake Storage and associated dependencies, with
specific transitive dependency exclusions for unused features. The ADLSIcebergFileIOProvider supports a configurable
Authentication Strategy that defaults to Vended Credentials for authentication provided by the configured Iceberg
REST Catalog. The Controller Service also supports configuring a static token using Azure
shared access signatures.
S3 Iceberg FileIO Provider
The S3IcebergFileIOProvider supports storing Iceberg metadata and records in Amazon S3. The iceberg-aws library provides the implementation based on version 2 of the AWS SDK and the nifi-iceberg-aws-nar bundles the required dependencies. The NiFi module uses selective dependency exclusions to minimize the number of transitive dependencies bundled in the NAR, reducing the archive size while supporting required features.
The Authentication Strategy property of the S3IcebergFileIOProvider controls several dependent properties. The
default value of Vended Credentials avoids the need to configure other authentication properties. The optional
Basic Credentials and Session Credentials options support configuration of static Access Key ID and
Secret Access Key properties, along with the Client Region of the S3 bucket location.
Parquet Iceberg Writer
The ParquetIcebergWriter provides support for serializing Iceberg Records as Apache Parquet data files. The Controller Service implementation does not include any direct configuration properties in the initial version, instead relying on Catalog Table definitions to supply settings to the Parquet Writer. Abstracting support for Iceberg File Formats to Controller Services extends the configuration surface, but provides the isolated class loading and modular dependency structure needed for pluggable composability.
The
nifi-iceberg-parquet-writer-nar
contains the ParquetIcebergWriter class along with the minimum set of dependencies required. The Apache Parquet
libraries for Java have an extensive dependency tree due to
tight coupling to Apache Hadoop which
required careful selection and exclusion of libraries linked through
iceberg-parquet. The
Maven configuration
for nifi-iceberg-parquet-writer
defines the limited inclusion of Apache Hadoop Common with minimal transitive dependencies.
Processor Configuration
Bringing these Controller Service interfaces together, the
PutIcebergRecord Processor
requires a NiFi Record Reader for input FlowFiles, an Iceberg Writer for serializing Iceberg Records, and an
Iceberg Catalog for managing table operations. The Namespace and Table Name properties configure the destination
Iceberg Table and support FlowFile attributes for resolving different destinations in a shared Processor. Following the
pattern of most Processors, PutIcebergRecord defines standard success and failure relationships. The Processor
also increments counters for Data Files Processed and Records Processed to support basic behavior tracking.
Bundled in the
nifi-iceberg-processors-nar,
PutIcebergRecord implements transparent batching of multiple FlowFiles, depending on the size and number of FlowFiles
queued. Optimal Iceberg persistence depends on larger file sizes, so batching larger numbers of records from multiple
FlowFiles during initial ingest provides a better starting point for subsequent adjustments. NiFi Processors such as
MergeContent and
MergeRecord should be configured
in front of PutIcebergRecord to optimize Iceberg data file sizes.
Conclusion
Integration with Apache Iceberg continues among numerous streaming and structured data services, highlighting the power of a common API with multiple points of extension. Apache NiFi 2 support for Iceberg does not cover every potential use case, but the standard interfaces and alignment with Iceberg API surfaces set a solid foundation for integration and extension. Component design often presents tradeoffs between simplicity and configurability, for both engineers and end users. NiFi support for Iceberg reflects some of the complexities inherent in Iceberg itself. With the flexibility of the NiFi framework, new features can be more or less opinionated. By following the modular pattern of the Iceberg project, the NiFi 2 integration incorporates current capabilities and sets a path for future enhancements.