Discovering Vector

TL/DR

Logstash + Rust = Vector ?

We can actually describe Vector as a Logstash built in Rust, with all the language’s advantages for free (high performance, cross-compilation), but the flaws of a still young product (few modules, no known references). Note that the documentation seems well done, and the first existing modules allow you to manage a lot of simple monitoring use-case. As we can agree that Logstash is not the best product in the world, let’s hope that Vector will find its place in the community in the coming months. In addition, Vector natively offers modules equivalent to the Logstash’s one, which means that the migration will not be complicated!

And so what ? We have time, right ? #LockedDown

Vector could be defined as an high-performance observability data router that makes transforming, collecting, and sending events (logs & metrics) easy.

Concept

Basically, it’s an ETL based on the following concepts:

Source (aka. E / Extract)

Reading raw data from the source. For example, we could read log into a file, listen a Kafka topic or get StatsD metrics

Transform (aka. T / Transform)

Transform raw data, or complete data stream. For example, we could filter entries or parse a log using a regular expression

Sink (aka. L / Load)

Destination for events. Each module’s transmission method is dictated by the downstream service it is interacting with (ie. individual events, bulk or stream). For example, we could save raw data into Amazon S3, indexing them into Elasticsearch or expose to Prometheus

Features

Fast

Built in Rust, Vector is fast and memory-efficient, all without runtime or garbage collector

One only tool, from source to destination

Vector is designed to be used by everyone, whatever the context, by offering several deployment strategies:

Daemon

In this case, it serves as serves as an light-weight agent by running in the background, in its own process, for collecting all data for that host.

Sidecar

Here, it serves also an an agent, but we will have one process by service.

Service

In ths case, Vector is a separate service designed to receive data from an upstream source and fan-out to one or more destinations.

By using and/or combining theses strategies, we can define several architecture topologies to collect our data.

Distributed

In this topology, each Vector instance will directly send data to downstream services. It’s the simplest topology, and it will easily scale with our architecture. However, it can impact local performance or lead to data losses.

Centralized

Here, each agent will send data to a dedicated centralized Vector instance, which will responsible to do the most expensive operations. So, it’s more efficient for client nodes, but a dedicated centralized service as a SPOF which could lead to data losses.

Stream based

Variant of the previous topology, in which we will add a broker upstream of the centralized service in order to remove the SPOF. This topology is the most scalable and reliable, but also the most complex and expensive.

Easy deployment

Built with Rust, Vector cross-compiles to a single static binary without any runtime.

Well, but does it really works ?

I will be inspired by a previous blog post : An ELK stack from scratch, with Docker

Proof of concept architecture

In this case, we will use:

Elasticsearch, search engine which provide full text search & analytics,
Kibana, which provide an UI for exploring data, and create interactive dashboards
Vector, as central service, to transform events and sending them to Elasticsearch,
Kafka, as an upstream broker
Vector, as an agent, to ingest raw source data and sending them to Kafka

So here, we are under a Stream based topology

Services and interactions are described in a docker-compose.yml file:

      
        version: "3.7"
      
        services:
      
          zookeeper:
      
            image: confluentinc/cp-zookeeper:5.4.0
      
            hostname: zookeeper
      
            container_name: zookeeper
      
            ports:
      
              - "2181:2181"
      
            environment:
      
              ZOOKEEPER_CLIENT_PORT: 2181
      
              ZOOKEEPER_TICK_TIME: 2000
      
          kafka:
      
            image: confluentinc/cp-enterprise-kafka:5.4.0
      
            hostname: kafka
      
            container_name: kafka
      
            depends_on:
      
              - zookeeper
      
            ports:
      
              - "29092:29092"
      
              - "9092:9092"
      
            environment:
      
              KAFKA_BROKER_ID: 1
      
              KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
      
              KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      
              KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      
              KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      
              KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      
          elasticsearch:
      
            image: docker.elastic.co/elasticsearch/elasticsearch:7.6.2
      
            container_name: elastic
      
            environment:
      
              - ES_JAVA_OPTS=-Xms1g -Xmx1g
      
              - discovery.type=single-node
      
              - network.host=_site_, _local_
      
            ulimits:
      
              memlock:
      
                soft: -1
      
                hard: -1
      
            ports:
      
              - 9200:9200
      
              - 9300:9300
      
          vector:
      
            image: timberio/vector:0.8.0-alpine
      
            container_name: vector
      
            ports:
      
              - 8888:8888
      
            volumes: 
      
              - $PWD/vector.toml:/etc/vector/vector.toml:ro
      
            depends_on:
      
              - kafka
      
              - elasticsearch
      
          kibana:
      
            image: docker.elastic.co/kibana/kibana:7.6.2
      
            container_name: kibana
      
            ports:
      
              - 5601:5601
      
            depends_on:
      
              - elasticsearch
      
          webapp:
      
            build: ./webapp/
      
            container_name: webapp
      
            ports:
      
              - 80:80
      
              - 9999:9999

view raw docker-compose.yml hosted with ❤ by GitHub

The Vector central service is configured as below:

Reading events from Kafka
JSON Parsing from events send by Vector agent
Grok Parsing (same as Logstash Grok format) from raw log line
Indexing into Elasticsearch

      
        # Set global options
      
        data_dir = "/var/lib/vector"
      
        [sources.from_broker]
      
          type = "kafka"
      
          bootstrap_servers = "kafka:29092"
      
          group_id = "vector-consumer"
      
          topics = ["events"]
      
        [transforms.json_parser]
      
          type   = "json_parser"
      
          inputs = ["from_broker"]
      
          drop_field = true
      
          field = "message"
      
        [transforms.log_parser]
      
          type   = "grok_parser"
      
          inputs = ["json_parser"]
      
          pattern = '%{IPORHOST:client} - %{USERNAME:user} \[%{HTTPDATE:timestamp}\] \"%{WORD:verb} %{NOTSPACE:path} HTTP/%{NUMBER}\" %{INT:status} %{NUMBER:bytes} \"%{DATA:referer}\" \"%{DATA:user_agent}\"'
      
          types.status = "int"
      
          types.bytes = "int"
      
          types.timestamp = "timestamp|%d/%b/%Y:%H:%M:%S %z"
      
        [sinks.to_indexer]
      
          type = "elasticsearch"
      
          inputs = ["log_parser"]
      
          healthcheck = false
      
          host = "http://elasticsearch:9200"
      
        [[tests]]
      
          name = "test_log_parser"
      
          [[tests.inputs]]
      
            insert_at = "json_parser"
      
            type = "raw"
      
            value = '172.21.0.1 - - [28/Feb/2020:12:38:46 +0000] "GET /path/to/a HTTP/1.1" 200 46459 "http://localhost/path/to/b" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36" "-"'
      
          [[tests.outputs]]
      
            extract_from = "log_parser"
      
            [[tests.outputs.conditions]]
      
              type = "check_fields"
      
              "client.equals" = "172.21.0.1"
      
              "user.equals" = "-"
      
              "timestamp.equals"= "2020-02-28T12:38:46Z"
      
              "verb.equals" = "GET"
      
              "path.equals" = "/path/to/a"
      
              "status.equals" = 200
      
              "bytes.equals" = 46459
      
              "referer.equals" = "http://localhost/path/to/b"
      
              "user_agent.equals" = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"

view raw vector.toml hosted with ❤ by GitHub

Fun fact, we can unit testing our configuration, as we can see in the [[tests]] section.

Note that each configuration step is based on at least one previous step.

On our webapp side, we will have an Vector agent configured as below:

Reading logs from file
Sending them to Kafka

      
        # Set global options
      
        data_dir = "/var/lib/vector"
      
        [sources.from_file]
      
          type = "file"
      
          include = ["/var/log/nginx/*.log"]
      
        [sinks.to_broker]
      
          type = "kafka"
      
          inputs = ["from_file"]
      
          bootstrap_servers = "kafka:29092"
      
          topic = "events"
      
          encoding = "json"

view raw agent_vector.toml hosted with ❤ by GitHub

Complete projet is available on github discovering_vector

Now, I can start all my services with docker-compose:

 docker-compose build
 docker-compose up

Then, you should be able to access the web app (http://localhost:80, in my case) Web Application example (source: https://github.com/sbilly/joli-admin)

After few minutes browsing, you can go to Kibana UI. (in my case, http://localhost:5601), then click on Management tab, then Kibana > Index Patterns

Adding vector-* index pattern

Here we go ! A vector-YYYY.MM.DD index should be created with my application logs. From there, I will be able to create my searchs, visualizations, dashboards or canvas in Kibana, and use all theses informations.

To conclude, it’s actually quite easy to use Vector as a substitute for Logstash/Beats in an Elastic stack, and it works. Remains to see if performance gains are real, and if the project can resist in the future and become a real alternative for the community. Until then, even very young, this project is full of promises and good ideas (unit tests, multi-topologies, …), and so deserves that we take a look!

Published on April 15, 2020

← A la découverte de Vector

	version: "3.7"

	services:

	zookeeper:
	image: confluentinc/cp-zookeeper:5.4.0
	hostname: zookeeper
	container_name: zookeeper
	ports:
	- "2181:2181"
	environment:
	ZOOKEEPER_CLIENT_PORT: 2181
	ZOOKEEPER_TICK_TIME: 2000

	kafka:
	image: confluentinc/cp-enterprise-kafka:5.4.0
	hostname: kafka
	container_name: kafka
	depends_on:
	- zookeeper
	ports:
	- "29092:29092"
	- "9092:9092"
	environment:
	KAFKA_BROKER_ID: 1
	KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181'
	KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
	KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
	KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
	KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0

	elasticsearch:
	image: docker.elastic.co/elasticsearch/elasticsearch:7.6.2
	container_name: elastic
	environment:
	- ES_JAVA_OPTS=-Xms1g -Xmx1g
	- discovery.type=single-node
	- network.host=_site_, _local_
	ulimits:
	memlock:
	soft: -1
	hard: -1
	ports:
	- 9200:9200
	- 9300:9300

	vector:
	image: timberio/vector:0.8.0-alpine
	container_name: vector
	ports:
	- 8888:8888
	volumes:
	- $PWD/vector.toml:/etc/vector/vector.toml:ro
	depends_on:
	- kafka
	- elasticsearch

	kibana:
	image: docker.elastic.co/kibana/kibana:7.6.2
	container_name: kibana
	ports:
	- 5601:5601
	depends_on:
	- elasticsearch

	webapp:
	build: ./webapp/
	container_name: webapp
	ports:
	- 80:80
	- 9999:9999

	# Set global options
	data_dir = "/var/lib/vector"

	[sources.from_broker]
	type = "kafka"
	bootstrap_servers = "kafka:29092"
	group_id = "vector-consumer"
	topics = ["events"]

	[transforms.json_parser]
	type = "json_parser"
	inputs = ["from_broker"]
	drop_field = true
	field = "message"

	[transforms.log_parser]
	type = "grok_parser"
	inputs = ["json_parser"]
	pattern = '%{IPORHOST:client} - %{USERNAME:user} \[%{HTTPDATE:timestamp}\] \"%{WORD:verb} %{NOTSPACE:path} HTTP/%{NUMBER}\" %{INT:status} %{NUMBER:bytes} \"%{DATA:referer}\" \"%{DATA:user_agent}\"'
	types.status = "int"
	types.bytes = "int"
	types.timestamp = "timestamp\|%d/%b/%Y:%H:%M:%S %z"

	[sinks.to_indexer]
	type = "elasticsearch"
	inputs = ["log_parser"]
	healthcheck = false
	host = "http://elasticsearch:9200"

	[[tests]]
	name = "test_log_parser"

	[[tests.inputs]]
	insert_at = "json_parser"
	type = "raw"
	value = '172.21.0.1 - - [28/Feb/2020:12:38:46 +0000] "GET /path/to/a HTTP/1.1" 200 46459 "http://localhost/path/to/b" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36" "-"'

	[[tests.outputs]]
	extract_from = "log_parser"

	[[tests.outputs.conditions]]
	type = "check_fields"
	"client.equals" = "172.21.0.1"
	"user.equals" = "-"
	"timestamp.equals"= "2020-02-28T12:38:46Z"
	"verb.equals" = "GET"
	"path.equals" = "/path/to/a"
	"status.equals" = 200
	"bytes.equals" = 46459
	"referer.equals" = "http://localhost/path/to/b"
	"user_agent.equals" = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"

	# Set global options
	data_dir = "/var/lib/vector"

	[sources.from_file]
	type = "file"
	include = ["/var/log/nginx/*.log"]

	[sinks.to_broker]
	type = "kafka"
	inputs = ["from_file"]
	bootstrap_servers = "kafka:29092"
	topic = "events"
	encoding = "json"