Kubernetes Integration Guide

warning

In order to configure the Vega Kubernetes Integration with your AWS Environment, you need to provide Vega with a Cost and Usage Report (CUR) that has Split cost allocation data enabled. Contact Vega support if you need assistance making a change to your CUR export.

Vega CUR configuration guide · AWS Docs: Split cost allocation

tip

The most up-to-date code, issues, and documentation can be found at the Vega Kubernetes Metrics Agent GitHub project.

1. Overview

The Vega Platform provides seamless integration for collecting detailed data and metrics from Kubernetes clusters through the Vega Kubernetes Metrics Agent. This versatile agent collects, processes, and transmits a wide range of metrics to the Vega Platform, giving users deep insights into their Kubernetes environments.

Architecture

The agent uses a collector-based architecture where:

Each collector specializes in gathering specific metrics (nodes, pods, etc.)
Metrics are collected in parallel with configurable concurrency
Data is securely uploaded to S3 storage using pre-signed URLs
Support for both metrics API and direct Kubelet collection with automatic fallback

Key Features

Comprehensive Metrics Collection:
- Node Metrics: Detailed metrics for each node including capacity, allocatable resources, usage, and hardware details
- Pod Metrics: Metrics for all pods covering resource requests, limits, usage, and container status
- Cluster Metrics: Aggregated cluster-wide metrics with cloud provider detection
- Persistent Volume Metrics: Metrics for persistent volumes, claims, and storage classes
- Namespace Metrics: Resource quotas, limit ranges, and detailed usage
- Workload Metrics: Deployments, stateful sets, daemon sets, jobs, and cron jobs
- Networking Metrics: Services, ingresses, and network policies
- Orchestration Metrics: HPAs, replication controllers, and replica sets
Multiple Collection Methods:
- Metrics API Integration: Primary collection through the metrics.k8s.io API
- Kubelet Direct Collection: Fallback mechanism for detailed node metrics
- Automatic Failover: Graceful degradation if primary collection method fails
Advanced Configuration:
- API Rate Limiting: Configurable QPS, Burst, and Timeout settings
- Concurrency Control: Parallel collection with throttling
- Customizable Parameters: Extensive configuration options through environment variables and flags
Operational Features:
- Health Check: Simple HTTP endpoint for liveness monitoring
- Secure Authentication: Bearer token authentication for API access
- Cloud Provider Detection: Automatic identification of AWS (EKS), Azure (AKS), GCP (GKE)
- Agent Check-in: Optional capability to report agent status

2. Prerequisites

System Requirements

Kubernetes cluster version 1.30 or higher
Helm v3.0 or higher
Outbound access to:
- Container image repository (public.ecr.aws/c0f8b9o4/vegacloud)
- api.vegacloud.io (port 443) — for pre-signed URL retrieval.
- vegametricsocean.s3.us-west-2.amazonaws.com (port 443) — for uploading data.

Access Requirements

Kubernetes Administrator privileges
RBAC permissions for:
- Creating deployments
- Creating cluster roles
- Reading metrics
API access credentials (clientId and clientSecret)

Pre-Installation Checks

Verify cluster access:

kubectl cluster-info
kubectl auth can-i create deployment
kubectl auth can-i create clusterrole

Verify Helm installation:

helm version

Check network connectivity:

curl -v https://google.com

3. Installation

Quick Start

Add the Vega Helm repository:

helm repo add vegacloud https://vegacloud.github.io/charts/
helm repo update

Verify repository addition:

helm search repo vegacloud

Install the agent:

helm install vega-metrics vegacloud/vega-metrics-agent \
  --set vega.clientId="your-client-id" \
  --set vega.clientSecret="your-client-secret" \
  --set vega.orgSlug="your-org-slug" \
  --set vega.clusterName="your-cluster-name"

4. Configuration

Helm Chart Parameters

Parameter	Description	Default	Required
`vega.clientId`	Client ID for authentication with Vega Cloud API	`""`	Yes
`vega.clientSecret`	Client Secret for authentication with Vega Cloud API	`""`	Yes
`vega.orgSlug`	Your Vega Cloud Organization slug	`""`	Yes
`vega.clusterName`	Unique name for your Kubernetes cluster	`""`	Yes
`apiRateLimiting.qps`	Kubernetes API requests per second	`100`
`apiRateLimiting.burst`	Max burst of API requests allowed	`100`
`apiRateLimiting.timeout`	API request timeout in seconds	`10`
`maxConcurrency`	Max concurrent collector operations	`8`
`resources.requests.memory`	Memory request for the agent	`"2Gi"`
`resources.requests.cpu`	CPU request for the agent	`"500m"`
`resources.limits.memory`	Memory limit for the agent	`"4Gi"`
`resources.limits.cpu`	CPU limit for the agent	`"1000m"`
`replicaCount`	Number of agent replicas to run	`1`
`image.repository`	Agent container image repository	`public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent`
`image.tag`	Agent container image tag	`1.1.4`
`image.pullPolicy`	Container image pull policy	`Always`
`nodeSelector`	Node labels for pod assignment	`{}`
`affinity`	Pod affinity/anti-affinity rules	`{}`
`tolerations`	Pod tolerations for scheduling	`[]`
`env`	Additional environment variables to set in container	`{}`

note

If you do not specify the API rate limiting or concurrency parameters, the agent will use its built-in defaults, which are optimized for most use cases. For complex configurations like affinity and tolerations, using a custom values file (-f values.yaml) is recommended over multiple --set flags.

Vega Metrics Agent Configuration

The Vega Metrics Agent can be configured using either environment variables or command-line flags. Below are the key configuration parameters:

Environment Variable	Description	Default	Required
`VEGA_CLIENT_ID`	Client ID for authentication		Yes
`VEGA_CLIENT_SECRET`	Client secret for authentication		Yes
`VEGA_CLUSTER_NAME`	Name of the Kubernetes cluster		Yes
`VEGA_ORG_SLUG`	Your Vega Cloud Organization slug		Yes
`VEGA_POLL_INTERVAL`	Interval for polling metrics	`60m`	No
`VEGA_UPLOAD_REGION`	AWS region for S3 uploads	`us-west-2`	No
`LOG_LEVEL`	Log level (`DEBUG`, `INFO`, `WARN`, `ERROR`)	`INFO`	No
`VEGA_INSECURE`	Use insecure connections	`false`	No
`VEGA_WORK_DIR`	Working directory for temporary files	`/tmp`	No
`VEGA_COLLECTION_RETRY_LIMIT`	Retry limit for metric collection	`3`	No
`VEGA_BEARER_TOKEN_PATH`	Path to the bearer token file	`/var/run/secrets/kubernetes.io/serviceaccount/token`	No
`VEGA_NAMESPACE`	Kubernetes namespace for agent deployment	`vegacloud`	No
`VEGA_QPS`	API rate limiter for requests per second	`100`	No
`VEGA_BURST`	API rate limiter burst allowance	`100`	No
`VEGA_TIMEOUT`	Timeout for API requests	`10s`	No
`VEGA_MAX_CONCURRENCY`	Maximum number of concurrent collectors	`8`	No

Additional parameters for local testing and debugging include:

AGENT_ID: Unique identifier for the agent
SHOULD_AGENT_CHECK_IN: Determines if the agent should check in with the metrics server
START_COLLECTION_NOW: Start metric collection immediately
SAVE_LOCAL: Save metrics locally

API Rate Limiting Configuration

The agent provides the following parameters to control the rate of API requests to the Kubernetes API server:

QPS (Queries Per Second): Controls the sustainable rate of requests to the Kubernetes API. Default is 100 QPS.
Burst: Sets the maximum burst of requests allowed beyond the QPS rate. Default is 100 requests.
Timeout: Sets the timeout for individual API requests. Default is 10 seconds.

These settings can be adjusted based on your cluster size and API server capacity. For larger clusters or environments with high API server load, you may need to tune these values to prevent overwhelming the Kubernetes API server.

Example environment variable configuration:

VEGA_QPS=200
VEGA_BURST=300
VEGA_TIMEOUT=15s

Example command line configuration:

--qps=200 --burst=300 --timeout=15s

Adjusting Concurrency

To control the maximum number of concurrent metric collection operations:

helm install vega-metrics-agent ./charts/vega-metrics-agent \
  --set maxConcurrency=12

Using a Custom Values File

Create a file named custom-values.yaml with your desired configuration:

vega:
  clientId: "YOUR_CLIENT_ID"
  clientSecret: "YOUR_CLIENT_SECRET"
  orgSlug: "YOUR_ORG_SLUG"
  clusterName: "YOUR_CLUSTER_NAME"

apiRateLimiting:
  qps: 200
  burst: 300
  timeout: 15

maxConcurrency: 12

resources:
  requests:
    memory: "4Gi"
    cpu: "750m"
  limits:
    memory: "8Gi"
    cpu: "1500m"

Then install using:

helm install vega-metrics-agent ./charts/vega-metrics-agent -f custom-values.yaml

Configuring Scheduling Constraints

You can control where the agent pods are scheduled using nodeSelector, affinity, and tolerations.

Using `nodeSelector`:

To schedule the agent only on nodes with specific labels (e.g., disktype=ssd):

helm install vega-metrics-agent ./charts/vega-metrics-agent \
  --set vega.clientId=YOUR_CLIENT_ID \
  --set vega.clientSecret=YOUR_CLIENT_SECRET \
  --set vega.orgSlug=YOUR_ORG_SLUG \
  --set vega.clusterName=YOUR_CLUSTER_NAME \
  --set nodeSelector.disktype=ssd

Using `affinity`:

Affinity rules provide more advanced scheduling control. For complex affinity rules, it's recommended to use a custom values file. Here's an example snippet for custom-values.yaml:

# custom-values.yaml
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - metrics-agent
      topologyKey: "kubernetes.io/hostname"

Using `tolerations`:

To allow the agent pods to be scheduled on nodes with specific taints:

helm install vega-metrics-agent ./charts/vega-metrics-agent \
  --set vega.clientId=YOUR_CLIENT_ID \
  --set vega.clientSecret=YOUR_CLIENT_SECRET \
  --set vega.orgSlug=YOUR_ORG_SLUG \
  --set vega.clusterName=YOUR_CLUSTER_NAME \
  --set tolerations[0].key="example-key" \
  --set tolerations[0].operator="Exists" \
  --set tolerations[0].effect="NoSchedule"

For multiple or complex tolerations, using a custom values file is cleaner:

# custom-values.yaml
tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"
  - key: "key2"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 3600

5. Verification & Monitoring

Installation Verification

Check pod status:

kubectl get pods -n vegacloud -l app=metrics-agent

View agent logs:

kubectl logs -f deployment/metrics-agent -n vegacloud

Verify metrics collection:

kubectl top nodes
kubectl top pods -A

Check agent connectivity:

kubectl exec -it deployment/metrics-agent -n vegacloud -- curl -v https://api.vegacloud.io/health

Health Monitoring

Monitor agent health:

kubectl describe pod -l app=metrics-agent -n vegacloud

Check resource usage:

kubectl top pod -l app=metrics-agent -n vegacloud

View recent events:

kubectl get events -n vegacloud --field-selector involvedObject.name=vega-metrics

6. Maintenance

Upgrades

Option 1: Upgrade via Helm Chart

# Update repository
helm repo update

# Upgrade with existing values
helm upgrade vega-metrics vegacloud/vega-metrics-agent \
  --namespace vegacloud \
  --reuse-values

Option 2: Update Container Image

Check available container versions:

docker pull public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent
docker images public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent

Update the deployment with new image:

kubectl set image deployment/metrics-agent \
  vega-metrics-agent=public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent:latest \
  -n vegacloud

Monitor the rolling update:

kubectl rollout status deployment/metrics-agent -n vegacloud

Note: Replace :latest with a specific version tag for better version control.

Version-Specific Helm Upgrade

helm upgrade vega-metrics vegacloud/vega-metrics-agent \
  --version 1.2.3 \
  --namespace vegacloud \
  --reuse-values

Backup and Recovery

Export current configuration:

helm get values vega-metrics -n vegacloud > vega-metrics-backup.yaml

Export secrets:

kubectl get secret -n vegacloud vega-metrics-agent-secret -o yaml > vega-metrics-agent-secret-backup.yaml

Uninstallation

helm uninstall vega-metrics -n vegacloud
kubectl delete namespace vegacloud  # Optional: removes the namespace

7. Troubleshooting

Common Issues

Pod in CrashLoopBackOff
- Check logs: kubectl logs -f deployment/metrics-agent -n vegacloud
- Verify credentials: kubectl get secret vega-metrics-agent-secret -n vegacloud
- Check resource limits: kubectl describe pod -l app=metrics-agent -n vegacloud
Connection Issues
- Verify network policies allow outbound traffic
- Check if proxy configuration is needed
- Ensure correct orgSlug is configured
- Verify DNS resolution: kubectl run test-dns --image=busybox:1.28 --rm -it --restart=Never -- nslookup api.vegacloud.io
Authentication Failures
- Verify clientId and clientSecret are correct
- Check secret creation: kubectl describe secret vega-metrics-agent-secret -n vegacloud
- Validate API access: curl -v -H "Authorization: Bearer $TOKEN" https://api.vegacloud.io/health

Debugging Steps

Enable debug logging:

helm upgrade vega-metrics vegacloud/vega-metrics-agent \
  --set env.LOG_LEVEL=DEBUG \
  --namespace vegacloud

Check RBAC permissions:

kubectl auth can-i --list --as system:serviceaccount:vegacloud:vega-metrics-agent

Verify network connectivity:

kubectl run test-net --image=busybox:1.28 --rm -it --restart=Never -- wget -q -O- https://api.vegacloud.io/health

8. Reference

Resource Recommendations

Based on the number of nodes in your cluster, here are the suggested CPU and memory requirements for the agent:

Nodes	CPU Request	CPU Limit	Mem Request	Mem Limit
< 50	200m	500m	512Mi	1Gi
50–100	300m	700m	1Gi	2Gi
100–250	500m	1000m	2Gi	4Gi
250–500	750m	1500m	4Gi	8Gi
500–1000	1000m	2000m	8Gi	12Gi
1000+	1500m	3000m	12Gi	16Gi

To apply these recommendations:

helm install vega-metrics vegacloud/vega-metrics-agent \
  --set vega.clientId="your-client-id" \
  --set vega.clientSecret="your-client-secret" \
  --set vega.orgSlug="your-org-slug" \
  --set vega.clusterName="your-cluster-name" \
  --set resources.requests.cpu="500m" \
  --set resources.requests.memory="2Gi" \
  --set resources.limits.cpu="1000m" \
  --set resources.limits.memory="4Gi"

Tips on resource allocation

CPU requests should be set to allow guaranteed minimum CPU resources. The metrics agent doesn't need high CPU most of the time but benefits from having a consistent baseline.
CPU limits should be set moderately higher than requests to allow for metric collection spikes. Setting CPU limits too low can cause throttling that may interrupt metrics collection.
Memory requests should be set to accommodate the baseline memory footprint plus overhead for metrics processing.
Memory limits should be set higher than requests to prevent OOM (Out of Memory) kills during peak collection periods.

These values should be adjusted based on your specific monitoring needs, collection frequency, and the total number of metrics being collected. For clusters with high pod density or custom metric collection, increase these values accordingly.

Common Commands

Operational Commands

# Check agent status
kubectl get pods -n vegacloud -l app=metrics-agent

# View agent configuration
helm get values vega-metrics -n vegacloud

# Force agent restart
kubectl rollout restart deployment vega-metrics -n vegacloud

# Scale agent
kubectl scale deployment vega-metrics -n vegacloud --replicas=2

Debugging Commands

# Check agent permissions
kubectl auth can-i --list --as system:serviceaccount:vegacloud:vega-metrics-agent

# View agent events
kubectl get events -n vegacloud --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pod -l app=metrics-agent -n vegacloud

# View detailed pod information
kubectl describe pod -l app=metrics-agent -n vegacloud

1. Overview​

Architecture​

Key Features​

2. Prerequisites​

System Requirements​

Access Requirements​

Pre-Installation Checks​

3. Installation​

Quick Start​

4. Configuration​

Helm Chart Parameters​

Vega Metrics Agent Configuration​

API Rate Limiting Configuration​

Adjusting Concurrency​

Using a Custom Values File​

Configuring Scheduling Constraints​

Using nodeSelector:​

Using affinity:​

Using tolerations:​

5. Verification & Monitoring​

Installation Verification​

Health Monitoring​

6. Maintenance​

Upgrades​

Option 1: Upgrade via Helm Chart​

Option 2: Update Container Image​

Version-Specific Helm Upgrade​

Backup and Recovery​

Uninstallation​

7. Troubleshooting​

Common Issues​

Debugging Steps​

8. Reference​

Resource Recommendations​

Common Commands​

Operational Commands​

Debugging Commands​