SQR-068: Sasquatch: beyond the EFD

  • Angelo Fausti

Latest Revision: 2023-03-08

1 Abstract

Sasquatch is a service for recording, displaying, and alerting on Rubin Observatory’s telemetry data and scalar metrics.

It is a unification of the Engineering and Facilities Database (EFD) [2] and SQuaSH [1] under the same deployment.

The new features include a REST API for sending data to Sasquatch and two-way data replication between the Summit and USDF. This way, observatory telemetry produced at the Summit and metrics computed at the USDF or Summit are always available locally.

Sasquatch can be easily extended to record other time-series data such as camera diagnostic metrics, rapid analysis metrics, scheduler events etc.

In its third generation, we took the opportunity to rebrand it to Sasquatch, and it should be understood and the service that manages the EFD and other time-series databases.

Sasquatch is currently deployed at Summit, USDF, Tucson test stand and Base test stand through Phalanx.

2 Overview

Sasquatch is based on InfluxDB, an open-source time-series database optimized for efficient storage and analysis of time series data, and Apache Kafka which is used as a write-ahead log to InfluxDB and for data replication between sites.

We are taking the opportunity to migrate to InfluxDB OSS 2.x which uses Flux for querying and data analaysis, in addition to InfluxQL. This version also has a new task engine to process time-series data with Flux, and a new Python client.

Apache Kafka is now deployed with Strimzi, a Kubernetes operator to manage the Kafka resources. It also includes resources to manage Mirror Maker 2 used for data replication, which is an improvement compared to the previous deployment (see [3]). In addition to Strimzi_ components we also deploy the Confluent REST proxy, used in Sasquatch for connecting HTTP-based clients with Kafka.

Figure 1 shows a diagram of the Sasquatch architecutre highlighting the new functionalities: two-way replication between the Summit and USDF; multiple InfluxDB databases; Flux Tasks; and a REST API based on the Confluent REST proxy.

_images/sasquatch_overview.svg

3 Sending data to Sasquatch

There are two main mechanisms for sending data to Sasquatch. One is based on the SAL Kafka Producers (ts_salkafka) and the other is based on the Confluent REST proxy.

ts_salkafka is currently used with Sasquatch at the Summit and test stands to forward DDS messages to Kafka. Once DDS is replaced by Kafka, the CSCs will write directly to Kafka and the ts_salkafka won’t be necessary anymore [4].

3.1 Confluent REST proxy

The Confluent REST proxy provides a REST interface for connecting HTTP-based clients with Kafka.

With the REST proxy, a client can produce messages to or consume messages from Kafka topics using HTTP requests which simplifies the integration with Sasquatch. In particular, the REST proxy integrates well with the Schema Registry and so this mechanism support sending Avro messages with an schema.

Once the data lands in Kafka, an InfluxDB Sink connector is responsible for consuming the Kafka topic and writing the data to a bucket in InfluxDB.

In addition, everything that is sent to kafka can be replicated to other sites, and we can also persist data into other formats like Parquet using off-the-shell Kafka connectors.

4 Two-way replication between Summit and USDF

In the current EFD implementation, data replication between the Summit and USDF is done throught the Kafka Mirror Maker 2 connector (MM2) [3].

The EFD replication service allows for one-way replication (or active/standby replication) from the Summit to the USDF. We have measured sub-second latency for high throughput topics in the MTM1M3 subsystem in this set up.

In Sasquatch, two-way replication (or active/active replication) is now required. With two-way replication, metrics computed at USDF (e.g. from Prompt Processing), for example, sent to the USDF instance of Sasquatch can be replicated to the Summit.

In addition to the instance of MM2 configured at USDF to replicate Observatory telemetry, events and metrics from the Summit, Sasquatch adds a second instance of MM2 at the Summit.

The Kafka Topics to be replicated are listed in the MM2 configuration on each Kafka cluster.

Two-way replication requires Kafka Topic renaming. Usually, in this scenario, the Kafka Topic at the destination cluster is prefixed with the name of the source cluster. That helps to identify its origin and avoid replicating it back to the source cluster.

Consequently, topic schemas at the destination cluster need to be renamed adding some complexity compared to the one-way replication scneario.

5 Storing telemetry, metrics and events into multiple databases

In InfluxDB OSS 2.x, a database or bucket is a named location where time-series data is stored.

By using multiple buckets we can specify different retention policies, time precision, access control and backup strategies. InfluxDB OSS 2.x provides a buckets API to programatically interact with buckets.

In the original EFD implementation, telemetry and events from the Observatory are recorded into a single InfluxDB database. In Sasquatch, when migtrating to InfluxDB OSS 2.x we are planning on storing telemetry and events into separate buckets. In particular, because the time difference between events is not regular, they need to be stored with higher time precision than telemetry and metrics to avoid overlaping data.

6 Flux Tasks

InfluxDB OSS 2.x provides a new task engine that replaces Continuous Queries and Kapacitor used in InfluxDB OSS 1.x.

An InfluxDB task is a scheduled Flux script that takes an input data stream, transforms or analyzes it, and performs some action.

In most cases, the transformed data can be stored into a new InfluxDB bucket, or sent to other destinations using Flux output functions. An example is sending a notification to Slack, or triggering some computation using the Flux http.post() function.

InfluxDB OSS 2.x also provides a tasks API to programatically interact with tasks.

7 Implementation phases

This section describes the Sasquatch implementation phases.

7.1 Phase 1 - Replace EFD deployments

  1. Add Sasquatch to Phalanx.

  2. Enable Chronograf authentication through Gafaelfawr.

  3. Replace Confluent Kafka with Strimzi Kafka.

  4. Automate Strimzi Kafka image builds adding the InfluxDB Sink, Mirror Maker 2, and S3 connectors.

  5. Deploy Sasquatch at IDF Dev.

  6. Deploy Sasquatch at TTS (Pillan cluster).

  7. Add csc and kafka-producer subcharts to Sasquatch for end-to-end testing.

  8. Add SASL configuration to ts_salkafka.

  9. Test connectors and integration with CSCs.

  10. Integrate news feeds with rsp_broacast.

  11. Implement external listeners in Strimzi Kafka.

  12. Migrate Sasquatch monitoring to monitoring.lsst.codes

  13. Deploy Sasquatch at USDF (SLAC).

  14. Migrate EFD data from the Summit to the Sasquatch instance at USDF.

  15. Deploy Sasquatch at the Summit (Yagan cluster).

  16. Migrate EFD data from the efd-temp-k3s.cp.lsst.org server to Sasquatch at the Summit.

  17. Implement data replication bewteen Sasquatch at the Summit and USDF with Strimzi Kafka.

  18. Deploy Sasquatch at the BTS (Manke cluster).

7.2 Phase 2 - Replace the SQuaSH deployment

  1. Implement Confluent REST proxy as a replacement for the SQuaSH API in Sasquatch.

  2. Implement a Butler data store for Sasquatch.

  3. Implement two-way replication in Sasquatch.

  4. Migrate SQuaSH data to Sasquatch at USDF.

7.2.1 Related goals

  1. Remove squash and influxdb-demo clusters on Google

7.3 Phase 3 - Migration to InfluxDB OSS 2.x

  1. Add InfluxDB OSS 2.x to Sasquatch deployment.

  2. Connect Chronograf with InfluxDB OSS 2.x (rquires DBRP mapping).

  3. Replace InfluxDB Sink connector with Telegraf Kafka Consumer so it works with InfluxDB OSS 2.x.

  4. Migrate EFD database to 2.x format (TTS, BTS, Summit, USDF).

  5. Exercise InfluxDB OSS 2.x backup/restore tools.

  6. Migrate Kapacitor alerts to Flux tasks.

  7. Migrate Chronograf 1.x annotations (_chronograf database) to InfluxDB 2.x.

  8. Upgrage EFD client to use the InfluxDB OSS 2.x Python client.

References

[1]

[SQR-009]. Angelo Fausti. The SQuaSH metrics dashboard. 2020. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-009.lsst.io/

[2]

[SQR-034]. Angelo Fausti. EFD Operations. 2021. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-034.lsst.io/

[3] (1,2)

[SQR-050]. Angelo Fausti. The EFD replication service. 2021. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-050.lsst.io/

[4]

[TSTN-033]. Russell Owen. Exploring Kafka for Telescope Control. 2022. Vera C. Rubin Observatory. URL: https://tstn-033.lsst.io/