This article describes the basics of Apache Flume and illustrates how to quickly set up Flume agents for collecting fast-moving data streams and pushing the data into Hadoop’s filesystem. By the time we’re finished, you should be able to configure and launch a Flume agent and understand how multi-hop and fan-out flows are easily constructed from multiple agents.
Anatomy of a Flume Agent
Flume deploys as one or more agents, each contained within its own instance of the Java Virtual Machine (JVM). Agents consist of three pluggable components: sources, sinks, and channels. An agent must have at least one of each in order to run. Sources collect incoming data as events. Sinks write events out, and channels provide a queue to connect the source and sink. (Figure 1.)
Figure 1: Flume Agents consist of sources, channels, and sinks.
Put simply, Flume sources listen for and consume events. Events can range from newline-terminated strings in
stdout to HTTP
POSTs and RPC calls — it all depends on what sources the agent is configured to use. Flume agents may have more than one source, but must have at least one. Sources require a name and a type; the type then dictates additional configuration parameters.
On consuming an event, Flume sources write the event to a channel. Importantly, sources write to their channels as transactions. By dealing in events and transactions, Flume agents maintain end-to-end flow reliability. Events are not dropped inside a Flume agent unless the channel is explicitly allowed to discard them due to a full queue.
Channels are the mechanism by which Flume agents transfer events from their sources to their sinks. Events written to the channel by a source are not removed from the channel until a sink removes that event in a transaction. This allows Flume sinks to retry writes in the event of a failure in the external repository (such as HDFS or an outgoing network connection). For example, if the network between a Flume agent and a Hadoop cluster goes down, the channel will keep all events queued until the sink can correctly write to the cluster and close its transactions with the channel.
Channels are typically of two types: in-memory queues and durable disk-backed queues. In-memory channels provide high throughput but no recovery if an agent fails. File or database-backed channels, on the other hand, are durable. They support full recovery and event replay in the case of agent failure.
Sinks provide Flume agents pluggable output capability — if you need to write to a new type storage, just write a Java class that implements the necessary classes. Like sources, sinks correspond to a type of output: writes to HDFS or HBase, remote procedure calls to other agents, or any number of other external repositories. Sinks remove events from the channel in transactions and write them to output. Transactions close when the event is successfully written, ensuring that all events are committed to their final destination.
Setting up a Simple Agent for HDFS
A simple one-source, one-sink Flume agent can be configured with just a single configuration file. In this example, I’ll create a text file named
sample_agent.conf — it looks a lot like a Java properties file. At the top of the file, I configure the agent’s name and the names of its source, sink, and channel.
hdfs-agent.sinks = hdfs-write
This defines an agent named
hdfs-agent and the names of the sources, sinks, and channels; keep the name in mind, because we’ll need it to start the agent. Multiple sources, sinks, and channels can be defined on these lines as a whitespace-delimited list of names. In this case, the source is named
netcat-collect, the sink
hdfs-write, and the channel is named
memory-channel. The names are indicative of what I’m setting up: events collected via
netcat will be written to HDFS and I will use a memory-only queue for transactions.
Next, I configure the source. I use a
netcat source, as it provides a simple means of interactively testing the agent. A
netcat source requires a type as well as an address and port to which it should bind. The
netcat source will listen on localhost on port 11111; messages to
netcat will be consumed by the source as events.
With the source defined, I’ll configure the sink to write to HDFS. HDFS sinks support a number of options, but by default, the HDFS sink writes Hadoop
SequenceFiles. In this example, I’ll specify the sink write raw textfiles to HDFS so they can be easily inspected; I’ll also set a roll interval, which forces Flume to commit writes to HDFS every 30 seconds. File rolls can be configured based on time, size, or a combination of the two. The file rolls are particularly important in environments for which HDFS does not support appending to files.
Finally, I’ll configure a memory-backed channel to transfer events from source to sink and connect them together. Keep in mind that if I exceed the channel capacity, Flume will drop events. If I need durability in a file or JDBC, channel should be used instead.
With the configuration complete, I start the Flume agent from a terminal:
flume-ng agent -f /path/to/sample_agent.conf -n hdfs-agent
In the Flume agent’s logs, I look for indication that the source, sink, and channel have successfully started. For example:
In a separate terminal, connect to the agent via
netcat and enter a series of messages.
> nc localhost 11111
cat, I’ll find the messages from
More Advanced Deployments
Regardless of source, direct writers to HDFS are too simple to be suitable for many deployments: Application servers may reside in the cloud while clusters are on-premise, many streams of data may need to be consolidated, or events may need to be filtered during transmission. Fortunately, Flume easily enables reliable multi-hop event transmission. Fan-in and fan-out patterns are readily supported via multiple sources and channel options. Additionally, Flume provides the notion of interceptors, which allow the decoration and filtering of events in flight.
Flume provides multi-hop deployments via Apache Avro-serialized RPC calls. For a given hop, the sending agent implements an Avro sink directed to a host and port where the receiving agent is listening. The receiver implements an Avro source bound to the designated host-port combination. Reliability is ensured by Flume’s transaction model. The sink on the sending agent does not close its transaction until receipt is acknowledged by the receiver. Similarly, the receiver does not acknowledge receipt until the incoming event has been committed to its channel.
Figure 2: Multihop event flows are constructed using RPCs between Avro sources and sinks.
Fan-In and Fan-Out
Fan-in is a common case for Flume agents. Agents may be run on many data collectors (such as application servers) in a large deployment, while only one or two writers to a remote Hadoop cluster are required to handle the total event throughput. In this case, the Flume topology is simple to configure. Each agent at a data collector implements the appropriate source and an Avro sink. All Avro sinks point to the host and port of the Flume agent charged with writing to the Hadoop cluster. The agent at the Hadoop cluster simply configures an Avro source on the designated host and port. Incoming events are consolidated automatically and are written to the configured sink.
Fan-out topologies are enabled via Flume’s source selectors. Selectors can be replicating — sending all events to multiple channels — or multiplexing. Multiplexed sources can be partitioned by mappings defined on events via interceptors. For example, a replicating selector may be appropriate when events need to be sent to HDFS and to flat log files or a database. Multiplexed selectors are useful when different mappings should be directed to different writers; for example, data destined for partitioned Hive tables may be best handled via multiplexing.
hdfs-agent.sources.netcat-collect.selector.type = replicating
hdfs-agent.sources.r1.channels = mchannel1 mchannel2
Flume provides a robust system of interceptors for in-flight modification of events. Some interceptors serve to decorate data with metadata useful in multiplexing the data or processing it after it has been written to the sink. Common decorations include timestamps, hostnames, and static headers. It’s a great way to keep track of when your data arrived and from where it came.
More interestingly, interceptors can be used to selectively filter or decorate events. The Regex Filtering Interceptor allows events to be dropped if they match the provided regular expression. Similarly, the Regex Extractor Interceptor decorates event headers according to a regular expression. This is useful if incoming events require multiplexing, but static definitions are too inflexible.
hdfs-agent.sources.netcat-collect.interceptors = filt_int
There are lots of ways to acquire Big Data with which to fill up a Hadoop cluster, but many of those data sources arrive as fast-moving streams of data. Fortunately, the Hadoop ecosystem contains a component specifically designed for transporting and writing these streams: Apache Flume. Flume provides a robust, self-contained application which ensures reliable transportation of streaming data. Flume agents are easy to configure, requiring only a property file and an agent name. Moreover, Flume’s simple source-channel-sink design allows us to build complicated flows using only a set of Flume agents. So, while we don’t often address the process of acquiring Big Data for our Hadoop clusters, doing so is as easy and fun as taking a log ride.