How to Ingest Data from Kafka into ClickHouse
Learn how to ingest data from a Kafka topic into ClickHouse using the Kafka table engine, materialized views, and MergeTree tables.
Overview
This article walks through the process of sending data from a Kafka topic to a ClickHouse table. We’ll use the Wiki recent changes feed, which provides a stream of events that represent changes made to various Wikimedia properties. The steps include:
- How to setup Kafka on Ubuntu
- Ingest a stream of data into a Kakfa topic
- Create a ClickHouse table that subscribes to the topic
1. Setup Kafka on Ubuntu
- Create an Ubuntu ec2 instance and SSH on to it:
- Install Kafka (based on the instructions here: https://www.linode.com/docs/guides/how-to-install-apache-kafka-on-ubuntu/):
- Start ZooKeeper:
- Open a new console and launch Kafka:
- Open a third console and create a topic named wikimedia:
- You can verify it was created successfully by:
2. Ingest the Wikimedia Stream into Kafka
- We need some utilities first:
- The data is sent to Kafka using a clever curl command that grabs the latest Wikimedia events, parses out the JSON data and sends that to the Kafka topic:
- You can "describe" the topic:
- Let's verify everything is working by consuming some events:
- Hit Ctrl+c to kill the previous command.
3. Ingest the Data into ClickHouse
- Here is what the incoming data looks like:
- We will need the Kafka table engine to pull the data from the Kafka topic:
- For some reason the Kafka table engine seems to take the public ec2 URL and convert it to the private DNS name, so I had to add that to my local /etc/hostsfile:
- You can read from a Kafka table, you just have to enable a setting:
The rows should come back nicely parsed based on the columns defined in the wikiQueue table:
- We need a MergeTree table to store these incoming events:
- Let's define a materialized view that gets triggered when an insert occurs on the Kafka table and sends the data to our rawEvents table:
- You should start seeing data going into rawEvents almost immediately:
- Let's view some of the rows:
- Let's see what types of events are coming in:
Let's define a materialized view chained to our current materialized view. We will keep track of some aggregated stats per minute:
- We will need -Merge functions to view the results:
 · 6 min read
