Cluster Monitoring with the CockroachDB Operator

On this page Carat arrow pointing down

Despite CockroachDB's various built-in safeguards against failure, it is critical to actively monitor the overall health and performance of a cluster running in production and to create alerting rules that promptly send notifications when there are events that require investigation or intervention.

Note:

The CockroachDB operator is in Preview.

Configure Prometheus

Every node of a CockroachDB cluster exports granular timeseries metrics formatted for easy integration with Prometheus, an open source tool for storing, aggregating, and querying timeseries data. This section shows you how to orchestrate Prometheus as part of your Kubernetes cluster and pull these metrics into Prometheus for external monitoring.

This guidance is based on CoreOS's Prometheus Operator, which allows a Prometheus instance to be managed using built-in Kubernetes concepts.

Note:

If you're on Hosted GKE, before starting, make sure the email address associated with your Google Cloud account is part of the cluster-admin RBAC group, as shown in Deploy CockroachDB with Kubernetes.

  1. From your local workstation, edit the cockroachdb service to add the prometheus: cockroachdb label:

    icon/buttons/copy
    kubectl label svc cockroachdb prometheus=cockroachdb
    
    service/cockroachdb labeled
    

    This ensures that only the cockroachdb (not the cockroach-public service) is being monitored by a Prometheus job.

  2. Determine the latest version of CoreOS's Prometheus Operator and run the following to download and apply the latest bundle.yaml definition file:

    Note:

    Be sure to specify the latest CoreOS Prometheus Operator version in the following command, in place of this example's use of version v0.82.0.

    icon/buttons/copy
    kubectl apply \
    -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.82.0/bundle.yaml \
    --server-side
    
    customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com serverside-applied
    customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com serverside-applied
    customresourcedefinition.apiextensions.k8s.io/probes.monitoring.coreos.com serverside-applied
    customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com serverside-applied
    customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com serverside-applied
    customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com serverside-applied
    customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com serverside-applied
    clusterrolebinding.rbac.authorization.k8s.io/prometheus-operator serverside-applied
    clusterrole.rbac.authorization.k8s.io/prometheus-operator serverside-applied
    deployment.apps/prometheus-operator serverside-applied
    serviceaccount/prometheus-operator serverside-applied
    service/prometheus-operator serverside-applied
    
  3. Confirm that the prometheus-operator has started:

    icon/buttons/copy
    kubectl get deploy prometheus-operator
    
    NAME                  READY   UP-TO-DATE   AVAILABLE   AGE
    prometheus-operator   1/1     1            1           27s
    
  4. Download our Prometheus manifest:

    icon/buttons/copy
    curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/prometheus/prometheus.yaml
    
  5. Apply the Prometheus manifest. This creates the various objects necessary to run a Prometheus instance:

    icon/buttons/copy
    kubectl apply -f prometheus.yaml
    
    serviceaccount/prometheus created
    clusterrole.rbac.authorization.k8s.io/prometheus created
    clusterrolebinding.rbac.authorization.k8s.io/prometheus created
    servicemonitor.monitoring.coreos.com/cockroachdb created
    prometheus.monitoring.coreos.com/cockroachdb created
    
  6. Access the Prometheus UI locally and verify that CockroachDB is feeding data into Prometheus:

    1. Port-forward from your local machine to the pod running Prometheus:

      icon/buttons/copy
      kubectl port-forward prometheus-cockroachdb-0 9090
      
    2. Go to http://localhost:9090 in your browser.

    3. To verify that each CockroachDB node is connected to Prometheus, go to Status > Targets. The screen should look like this:

      Prometheus targets

    4. To verify that data is being collected, go to Graph, enter the sys_uptime variable in the field, click Execute, and then click the Graph tab. The screen should like this:

      Prometheus graph

    Note:

    Prometheus auto-completes CockroachDB time series metrics for you, but if you want to see a full listing, with descriptions, port-forward as described in Access the DB Console and then point your browser to http://localhost:8080/_status/vars.

For more details on using the Prometheus UI, see their official documentation.

Configure Alertmanager

Active monitoring helps you spot problems early, but it is also essential to send notifications when there are events that require investigation or intervention. This section shows you how to use Alertmanager and CockroachDB's starter alerting rules to do this.

  1. Download our alertmanager-config.yaml configuration file:

    icon/buttons/copy
    curl -O https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/prometheus/alertmanager-config.yaml
    
  2. Edit the alertmanager-config.yaml file to specify the desired receivers for notifications. Initially, the file contains a placeholder web hook.

  3. Add this configuration to the Kubernetes cluster as a secret, renaming it to alertmanager.yaml and labeling it to make it easier to find:

    icon/buttons/copy
    kubectl create secret generic alertmanager-cockroachdb \
    --from-file=alertmanager.yaml=alertmanager-config.yaml
    
    secret/alertmanager-cockroachdb created
    
    icon/buttons/copy
    kubectl label secret alertmanager-cockroachdb app=cockroachdb
    
    secret/alertmanager-cockroachdb labeled
    
    Warning:

    The name of the secret, alertmanager-cockroachdb, must match the name used in the alertmanager.yaml file. If they differ, the Alertmanager instance will start without configuration, and nothing will happen.

  4. Use our alertmanager.yaml file to create the various objects necessary to run an Alertmanager instance, including a ClusterIP service so that Prometheus can forward alerts:

    icon/buttons/copy
    kubectl apply \
    -f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/prometheus/alertmanager.yaml
    
    alertmanager.monitoring.coreos.com/cockroachdb created
    service/alertmanager-cockroachdb created
    
  5. Verify that Alertmanager is running:

    1. Port-forward from your local machine to the pod running Alertmanager:

      icon/buttons/copy
      kubectl port-forward alertmanager-cockroachdb-0 9093
      
    2. Go to http://localhost:9093 in your browser. The screen should look like this:

      Alertmanager

  6. Ensure that the Alertmanagers are visible to Prometheus by opening http://localhost:9090/status. The screen should look like this:

    Alertmanager

  7. Add CockroachDB's starter alerting rules:

    icon/buttons/copy
    kubectl apply \
    -f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/prometheus/alert-rules.yaml
    
    prometheusrule.monitoring.coreos.com/prometheus-cockroachdb-rules created
    
  8. Ensure that the rules are visible to Prometheus by opening http://localhost:9090/rules. The screen should look like this:

    Alertmanager

  9. Verify that the TestAlertManager example alert is firing by opening http://localhost:9090/alerts. The screen should look like this:

    Alertmanager

  10. To remove the example alert:

    1. Use the kubectl edit command to open the rules for editing:

      icon/buttons/copy
      kubectl edit prometheusrules prometheus-cockroachdb-rules
      
    2. Remove the dummy.rules block and save the file:

      - name: rules/dummy.rules
        rules:
        - alert: TestAlertManager
          expr: vector(1)
      

Monitor the operator

The CockroachDB operator automatically exposes Prometheus-style metrics that you can monitor to observe its operations.

Metrics can be collected from the operator via HTTP requests (port 8080 by default) against the /metrics endpoint. The response will describe the current node metrics, for example:

...
# HELP node_decommissioning Whether a CockroachDB node is decommissioning.
# TYPE node_decommissioning gauge
node_decommissioning{node="cockroachdb-nvq2l"} 0
node_decommissioning{node="cockroachdb-pmp45"} 0
node_decommissioning{node="cockroachdb-q6784"} 0
node_decommissioning{node="cockroachdb-r4wz8"} 0
...

Configure logging

You can use the operator to configure the CockroachDB logging system. This allows you to output logs to configurable log sinks such as file or network logging destinations.

The logging configuration is defined in a ConfigMap object, using a key named logs.yaml. For example:

apiVersion: v1
data:
  logs.yaml: |
    sinks:
      file-groups:
        dev:
          channels: DEV
          filter: WARNING
kind: ConfigMap
metadata:
  name: logconfig
  namespace: cockroach-ns

The above configuration overrides the default logging configuration and saves debug-level logs (the DEV log channel) to disk for troubleshooting.

The ConfigMap name must match the cockroachdb.crdbCluster.loggingConfigMapName object in the values file used to deploy the cluster:

cockroachdb:
  crdbCluster:
    loggingConfigMapName: logconfig

By default, the operator also modifies the default logging configuration with the following:

sinks:
  stderr:
    channels: {INFO: "HEALTH, OPS", WARNING: "STORAGE, DEV"}
      redact: true

This outputs logging events in the OPS channel to a cockroach-stderr.log file.

Example: Configuring a troubleshooting log file on pods

In this example, CockroachDB has already been deployed on a Kubernetes cluster. Override the default logging configuration to output DEV logs to a cockroach-dev.log file.

  1. Create a ConfigMap named logconfig. Note that namespace is set to the cockroach-ns namespace:

    apiVersion: v1
    data:
      logs.yaml: |
        sinks:
          file-groups:
            dev:
              channels: DEV
              filter: WARNING
    kind: ConfigMap
    metadata:
      name: logconfig
      namespace: cockroach-ns
    

    For simplicity, also name the YAML file logconfig.yaml.

    Note:

    The ConfigMap key is not related to the ConfigMap name or YAML filename, and must be named logging.yaml.

    This configuration outputs DEV logs that have severity WARNING to a cockroach-dev.log file on each pod.

  2. Apply the ConfigMap to the cluster:

    icon/buttons/copy
    kubectl apply -f logconfig.yaml
    
    configmap/logconfig created
    
  3. Add the name of the ConfigMap in loggingConfigMapName to the values file:

    cockroachdb:
      crdbCluster:
        loggingConfigMapName: logconfig
    
  4. Apply the new settings to the cluster:

    icon/buttons/copy
    helm upgrade --reuse-values $CRDBCLUSTER ./cockroachdb-parent/charts/cockroachdb --values ./cockroachdb-parent/charts/cockroachdb/values.yaml -n $NAMESPACE
    

    The changes will be rolled out to each pod.

  5. See the log files available on a pod:

    icon/buttons/copy
    kubectl exec cockroachdb-2 -- ls cockroach-data/logs
    
    cockroach-dev.cockroachdb-2.unknownuser.2022-05-02T19_03_03Z.000001.log
    cockroach-dev.log
    cockroach-health.cockroachdb-2.unknownuser.2022-05-02T18_53_01Z.000001.log
    cockroach-health.log
    cockroach-pebble.cockroachdb-2.unknownuser.2022-05-02T18_52_48Z.000001.log
    cockroach-pebble.log
    cockroach-stderr.cockroachdb-2.unknownuser.2022-05-02T18_52_48Z.000001.log
    cockroach-stderr.cockroachdb-2.unknownuser.2022-05-02T19_03_03Z.000001.log
    cockroach-stderr.cockroachdb-2.unknownuser.2022-05-02T20_04_03Z.000001.log
    cockroach-stderr.log
    cockroach.cockroachdb-2.unknownuser.2022-05-02T18_52_48Z.000001.log
    cockroach.log
    ...
    
  6. View a specific log file:

    icon/buttons/copy
    kubectl exec cockroachdb-2 -- cat cockroach-data/logs/cockroach-dev.log
    
×