Apache kafka

2018-03-21

Introduction

LinkedIn用于存储在线离线日志的消息队列.
当前很多的消息队列服务提供可靠交付保证，并默认是即时消费（不适合离线）。高可靠交付对linkedin的日志不是必须的，故可通过降低可靠性来提高性能，同时通过构建分布式的集群，允许消息在系统中累积，使得kafka同时支持离线和在线日志处理.
Kafka is generally used for two broad classes of applications:
Building real-time streaming data pipelines that reliably get data between systems or applications
Building real-time streaming applications that transform or react to the streams of data
Core concepts:

Kafka is run as a cluster on one or more servers that can span multiple datacenters.
The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.
Core API usage
The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol.
Clients for different Language(Aim at Java)
We are instead moving to the redis/memcached model which seems to work better at supporting a rich ecosystem of high quality clients.

Quick start

Should use zookeeper to manage these records
// generate a topic for mq, allow at most 1 duplicate
bin/kafka-topics.sh –create –zookeeper localhost:2181 –replication-factor 1 –partitions 1 –topic test
// Run the producer and then type a few messages into the console to send to the server.
bin/kafka-console-producer.sh –broker-list localhost:9092 –topic test
//Kafka also has a command line consumer that will dump out messages to standard output.
bin/kafka-console-consumer.sh –bootstrap-server localhost:9092 –topic test –from-beginning
类似一个管道模型，producer发信息到server，消费信息print到screen

For many systems, instead of writing custom integration code you can use Kafka Connect to import or export data.
It is an extensible tool that runs connectors, which implement the custom logic for interacting with an external system. In this quickstart we’ll see how to run Kafka Connect with simple connectors that import data from a file to a Kafka topic and export data from a Kafka topic to a file.

Use Kafka streams to process data

Kafka Streams is a client library for building mission-critical real-time applications and microservices, where the input and/or output data is stored in Kafka clusters. Kafka Streams combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka’s server-side cluster technology to make these applications highly scalable, elastic, fault-tolerant, distributed, and much more.

Design motivation

Motivation

It would have to have high-throughput to support high volume event streams such as real-time log aggregation.

It would need to deal gracefully with large data backlogs to be able to support periodic data loads from offline systems.

It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.

We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our partitioning and consumer model.

Finally in cases where the stream is fed into other data systems for serving, we knew the system would have to be able to guarantee fault-tolerance in the presence of machine failures.

Supporting these uses led us to a design with a number of unique elements, more akin to a database log than a traditional messaging system. We will outline some elements of the design in the following sections.

Persoistence

Kafka relies heavily on the filesystem for storing and caching messages. 主要分叉点在于random write比sequencial write慢很多.
Furthermore, we are building on top of the JVM, and anyone who has spent any time with Java memory usage knows two things:
The memory overhead of objects is very high, often doubling the size of the data stored (or worse).
Java garbage collection becomes increasingly fiddly and slow as the in-heap data increases.
This suggests a design which is very simple: rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel’s pagecache.

Properties pros = new Properties();//util.java properties
ProducerConfig config = new ProducerConfig(props);
// 创建producer
Producer producer = new Producer(config);
KeyedMessage data = new KeyedMessage(“page_visits”, ip, msg);
producer.send(data);