Are you looking to get started with Apache Kafka for building streaming data pipelines and applications? If so, you‘ll need to download and install Kafka first before you can start producing and consuming data.
In this comprehensive guide, I‘ll provide detailed steps for installing Kafka on both Windows and Linux operating systems. I‘ll also share my insights as a data analyst on everything you need to know to run Kafka for the first time and get up and running quickly.
By the end, you‘ll have Kafka running locally so you can start developing streaming applications with confidence. Let‘s get started!
An Introduction to Apache Kafka
But before we dive into installing Kafka, let‘s do a quick overview of what Kafka is and why it‘s become so popular.
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies including Uber, Netflix, and Spotify. It acts as a real-time, fault-tolerant messaging system that can process massive volumes of data with low latency.
Kafka provides these core capabilities:
- Publish and subscribe to streams of events or messages
- Store streams of events durably and reliably
- Process streams as they occur in real-time
Kafka‘s publish-subscribe architecture. Image source: Real Python
Some common use cases that Kafka excels at include:
- Messaging – Kafka replaces traditional message brokers like RabbitMQ
- Activity Tracking – Collect user activity events from a website or app
- Metrics Collection – Gather system and application metrics for monitoring
- Log Aggregation – Collect logs from many servers into a central place
- Stream Processing – Analyze real-time data streams to take action
- Data Integration – Reliably move data between systems
Companies like Netflix and Spotify use Kafka to process billions of events per day for real-time analytics and data pipelines.
Kafka provides high throughput, low latency, and scalability for event streaming. Some key features include:
- Distributed and partitioned – Kafka runs in a cluster and partitions topic data across brokers and disks. This allows for scalability as the system grows.
- Fault tolerant – Data is replicated to prevent data loss. Kafka can sustain node failures and automatically recover and rebalance partitions.
- Durability – Events are written to disk for durability and replayed when needed. This prevents data loss.
- High performance – Kafka handles millions of events per second with very low end-to-end latency.
With Kafka, you can build real-time data pipelines that move data reliably between systems. And Kafka Streaming enables you to build applications that react to data streams in real-time.
Now that you understand Kafka‘s capabilities, let‘s go through the steps to download and install it.
Step-by-Step Guide to Installing Kafka on Windows
The first thing you‘ll need to install Kafka is to have Java 8 or higher on your machine. Kafka is written in Java and Scala, so Java is required.
Here are the detailed steps to install Kafka on a Windows OS:
1. Install Java JDK
To check if Java is already installed, open a command prompt and type:
java -version
This will print the Java version if it‘s installed:
java version "1.8.0_102"
Java(TM) SE Runtime Environment (build 1.8.0_102-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.102-b14, mixed mode)
If you get an error like ‘java‘ is not recognized
, then Java is not installed yet.
To install Java, go to AdoptOpenJDK.net and download the latest HotSpot JDK 8 or 11 build for Windows x64.
Run the executable installer and follow the prompts to install Java. Choose the default options.
Once installed, verify the install worked by running java -version
again in your command prompt. You should see the Java version print.
2. Download the Kafka Binary Release
Go to the Apache Kafka downloads page and select the latest stable binary release.
Download the binary file in tgz
format for the latest Kafka version. For example kafka_2.12-2.5.0.tgz
.
Save the download to a directory on your Windows machine, such as C:\kafka
.
3. Extract the Downloaded Binary
The Kafka binary download is in compressed Tar/Gzip format and needs to be extracted to install Kafka.
To extract the files, use an archive tool like 7-Zip or WinRAR.
For example, open 7-Zip and navigate to the folder you downloaded the Kafka tarball.
Then right-click the file and choose "Extract Here" to extract the files in place.
This will extract the kafka
directory containing the Kafka runtime, libs, config files and scripts.
4. Start the ZooKeeper Server
Kafka requires Apache ZooKeeper to run coordination and configuration management services for the Kafka cluster.
So we need to first start a ZooKeeper server instance.
Open a new command prompt and change to the extracted Kafka directory:
cd C:\kafka\kafka_2.12-2.5.0
Then run this command to launch ZooKeeper:
bin\windows\zookeeper-server-start.bat config\zookeeper.properties
This will start a ZooKeeper server running in the foreground. Leave this command prompt open.
5. Start the Kafka Broker Service
Now we can startup the Kafka broker service that handles the events streaming.
Open a new command prompt again and change to the Kafka directory:
cd C:\kafka\kafka_2.12-2.5.0
Run this command to launch a Kafka broker:
bin\windows\kafka-server-start.bat config\server.properties
This will start Kafka running in the foreground. Kafka and ZooKeeper are now fully running!
You now have an environment to start developing Kafka producers and consumers on your local Windows machine.
When you are done using Kafka, you can stop the services by pressing Ctrl + C in each command prompt to terminate the processes.
Now let‘s go through how to install Kafka on Linux machines.
Step-by-Step Guide to Installing Kafka on Linux
Kafka is written in Java and Scala, so it can run equally well on Linux and Unix systems like Ubuntu, Debian, CentOS, RHEL etc.
Here is how to install Kafka on a Linux OS like Ubuntu or Debian:
1. Install Latest Java JDK
First check if you already have Java installed:
java -version
If Java is not installed, run:
sudo apt-get install default-jdk
To install the latest OpenJDK Java Development Kit.
Verify with java -version
that Java was installed properly.
2. Download Kafka Binary Release
Go to https://kafka.apache.org/downloads and download the latest stable Kafka release for your platform.
For example, kafka_2.12-2.5.0.tgz
for Linux x64.
Save the download to your home directory or wherever you wish to install Kafka.
3. Extract the Downloaded Binary
The Kafka binary download is compressed, so we need to extract the files first.
Open a terminal and cd
to the directory you downloaded Kafka.
Then extract with this command:
tar -xzf kafka_2.12-2.5.0.tgz
This will extract the files into a directory named kafka_2.12-2.5.0
.
cd kafka_2.12-2.5.0
4. Start ZooKeeper
Start a ZooKeeper server, which is required for Kafka:
bin/zookeeper-server-start.sh config/zookeeper.properties
5. Start Kafka Server
Open a new terminal tab or window.
Change to the Kafka directory, then start it:
bin/kafka-server-start.sh config/server.properties
Kafka and ZooKeeper are now up and running!
You now can start building producers and consumers to stream data.
When finished, you can stop Kafka and ZooKeeper by pressing Ctrl + C in their respective terminal windows.
And that‘s it! By now you should have Kafka installed and running on either Windows or Linux.
Helpful Tips for Common Issues
Here are some tips for resolving common problems people run into:
-
Firewall blocking access – Kafka‘s default port is 9092. Make sure your firewall allows connections on this port.
-
Need to allow port in Windows Firewall – When starting Kafka on Windows, you may need to allow port 9092 in Windows Firewall for external connections.
-
Java version mismatch – Error
java.lang.UnsupportedClassVersionError
means you have incompatible Java/Scala versions. Double check you are using Java 8 or 11. -
ZooKeeper connection issues – The Kafka server needs to connect to its ZooKeeper cluster on startup. Ensure the
zookeeper.connect
value inserver.properties
is correct. -
Permissions error – The user running Kafka may not have write access to its
logs
directory. Set permissions on the Kafka folder to allow read/write access. -
Cannot start Kafka more than once – Kafka should only be started once per folder location. You cannot run multiple Kafka instances on the same ports and data folders.
-
Cannot access Kafka from other machines – By default, Kafka only allows connections from localhost. Update the
listeners
value inserver.properties
to allow external connections. -
Blocks on startup – Kafka requires ZooKeeper to start before it can successfully startup. Always start ZooKeeper first, ensure it is up, then start Kafka.
Paying attention to the error messages and logs will help troubleshoot issues. The Troubleshooting Kafka guide provides more tips.
Where to Go After Installing Kafka
Once you confirm that Kafka is up and running, here are some suggestions on next steps:
- Play with Kafka command line tools to create topics, produce and consume test events
- Build your own Kafka producers and consumers in your language of choice to move data
- Start developing stream processing applications with Kafka Streams API
- Containerize your Kafka brokers with Docker for simplified configuration
- Create Kubernetes StatefulSets to manage scalable Kafka clusters
- Use Kafka connector APIs to get data into Kafka from other systems
- Monitor Kafka metrics and utilization with tools like Prometheus and Grafana
- Tune Kafka configuration settings for performance
- Check out Confluent and AWS managed Kafka services like MSK, if you don‘t want to self-manage
The Kafka Quickstart guide provides an excellent introduction to start using Kafka for streaming data pipelines.
For a hands-on tutorial, I recommend trying the Kafka Python examples to write your first producers and consumers.
I hope you found this guide helpful for installing Kafka and getting started streaming data. Let me know in the comments if you have any other questions!