Skip to main content

Kubernetes Integration Guide

warning

In order to configure the Vega Kubernetes Integration with your AWS Environment, you need to provide Vega with a Cost and Usage Report (CUR) that has Split cost allocation data enabled. Contact Vega support if you need assistance making a change to your CUR export.

Vega CUR configuration guide ยท AWS Docs: Split cost allocation

tip

The most up-to-date code, issues, and documentation can be found at the Vega Kubernetes Metrics Agent GitHub project.

1. Overviewโ€‹

The Vega Platform provides seamless integration for collecting detailed data and metrics from Kubernetes clusters through the Vega Kubernetes Metrics Agent. This versatile agent collects, processes, and transmits a wide range of metrics to the Vega Platform, giving users deep insights into their Kubernetes environments.

Architectureโ€‹

The agent uses a collector-based architecture where:

  • Each collector specializes in gathering specific metrics (nodes, pods, etc.)
  • Metrics are collected in parallel with configurable concurrency
  • Data is securely uploaded to S3 storage using pre-signed URLs
  • Support for both metrics API and direct Kubelet collection with automatic fallback

Key Featuresโ€‹

  • Comprehensive Metrics Collection:

    • Node Metrics: Detailed metrics for each node including capacity, allocatable resources, usage, and hardware details
    • Pod Metrics: Metrics for all pods covering resource requests, limits, usage, and container status
    • Cluster Metrics: Aggregated cluster-wide metrics with cloud provider detection
    • Persistent Volume Metrics: Metrics for persistent volumes, claims, and storage classes
    • Namespace Metrics: Resource quotas, limit ranges, and detailed usage
    • Workload Metrics: Deployments, stateful sets, daemon sets, jobs, and cron jobs
    • Networking Metrics: Services, ingresses, and network policies
    • Orchestration Metrics: HPAs, replication controllers, and replica sets
  • Multiple Collection Methods:

    • Metrics API Integration: Primary collection through the metrics.k8s.io API
    • Kubelet Direct Collection: Fallback mechanism for detailed node metrics
    • Automatic Failover: Graceful degradation if primary collection method fails
  • Advanced Configuration:

    • API Rate Limiting: Configurable QPS, Burst, and Timeout settings
    • Concurrency Control: Parallel collection with throttling
    • Customizable Parameters: Extensive configuration options through environment variables and flags
  • Operational Features:

    • Health Check: Simple HTTP endpoint for liveness monitoring
    • Secure Authentication: Bearer token authentication for API access
    • Cloud Provider Detection: Automatic identification of AWS (EKS), Azure (AKS), GCP (GKE)
    • Agent Check-in: Optional capability to report agent status

2. Prerequisitesโ€‹

System Requirementsโ€‹

  • Kubernetes cluster version 1.30 or higher
  • Helm v3.0 or higher
  • Outbound access to:
    • Container image repository (public.ecr.aws/c0f8b9o4/vegacloud)
    • api.vegacloud.io (port 443) โ€” for pre-signed URL retrieval.
    • vegametricsocean.s3.us-west-2.amazonaws.com (port 443) โ€” for uploading data.

Access Requirementsโ€‹

  • Kubernetes Administrator privileges
  • RBAC permissions for:
    • Creating deployments
    • Creating cluster roles
    • Reading metrics
  • API access credentials (clientId and clientSecret)

Pre-Installation Checksโ€‹

  1. Verify cluster access:
kubectl cluster-info
kubectl auth can-i create deployment
kubectl auth can-i create clusterrole
  1. Verify Helm installation:
helm version
  1. Check network connectivity:
curl -v https://google.com

3. Installationโ€‹

Quick Startโ€‹

  1. Add the Vega Helm repository:
helm repo add vegacloud https://vegacloud.github.io/charts/
helm repo update
  1. Verify repository addition:
helm search repo vegacloud
  1. Install the agent:
helm install vega-metrics vegacloud/vega-metrics-agent \
--set vega.clientId="your-client-id" \
--set vega.clientSecret="your-client-secret" \
--set vega.orgSlug="your-org-slug" \
--set vega.clusterName="your-cluster-name"

4. Configurationโ€‹

Helm Chart Parametersโ€‹

ParameterDescriptionDefaultRequired
vega.clientIdClient ID for authentication with Vega Cloud API""Yes
vega.clientSecretClient Secret for authentication with Vega Cloud API""Yes
vega.orgSlugYour Vega Cloud Organization slug""Yes
vega.clusterNameUnique name for your Kubernetes cluster""Yes
apiRateLimiting.qpsKubernetes API requests per second100
apiRateLimiting.burstMax burst of API requests allowed100
apiRateLimiting.timeoutAPI request timeout in seconds10
maxConcurrencyMax concurrent collector operations8
resources.requests.memoryMemory request for the agent"2Gi"
resources.requests.cpuCPU request for the agent"500m"
resources.limits.memoryMemory limit for the agent"4Gi"
resources.limits.cpuCPU limit for the agent"1000m"
replicaCountNumber of agent replicas to run1
image.repositoryAgent container image repositorypublic.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent
image.tagAgent container image tag1.1.4
image.pullPolicyContainer image pull policyAlways
nodeSelectorNode labels for pod assignment{}
affinityPod affinity/anti-affinity rules{}
tolerationsPod tolerations for scheduling[]
envAdditional environment variables to set in container{}
note

If you do not specify the API rate limiting or concurrency parameters, the agent will use its built-in defaults, which are optimized for most use cases. For complex configurations like affinity and tolerations, using a custom values file (-f values.yaml) is recommended over multiple --set flags.

Vega Metrics Agent Configurationโ€‹

The Vega Metrics Agent can be configured using either environment variables or command-line flags. Below are the key configuration parameters:

Environment VariableDescriptionDefaultRequired
VEGA_CLIENT_IDClient ID for authenticationYes
VEGA_CLIENT_SECRETClient secret for authenticationYes
VEGA_CLUSTER_NAMEName of the Kubernetes clusterYes
VEGA_ORG_SLUGYour Vega Cloud Organization slugYes
VEGA_POLL_INTERVALInterval for polling metrics60mNo
VEGA_UPLOAD_REGIONAWS region for S3 uploadsus-west-2No
LOG_LEVELLog level (DEBUG, INFO, WARN, ERROR)INFONo
VEGA_INSECUREUse insecure connectionsfalseNo
VEGA_WORK_DIRWorking directory for temporary files/tmpNo
VEGA_COLLECTION_RETRY_LIMITRetry limit for metric collection3No
VEGA_BEARER_TOKEN_PATHPath to the bearer token file/var/run/secrets/kubernetes.io/serviceaccount/tokenNo
VEGA_NAMESPACEKubernetes namespace for agent deploymentvegacloudNo
VEGA_QPSAPI rate limiter for requests per second100No
VEGA_BURSTAPI rate limiter burst allowance100No
VEGA_TIMEOUTTimeout for API requests10sNo
VEGA_MAX_CONCURRENCYMaximum number of concurrent collectors8No

Additional parameters for local testing and debugging include:

  • AGENT_ID: Unique identifier for the agent
  • SHOULD_AGENT_CHECK_IN: Determines if the agent should check in with the metrics server
  • START_COLLECTION_NOW: Start metric collection immediately
  • SAVE_LOCAL: Save metrics locally

API Rate Limiting Configurationโ€‹

The agent provides the following parameters to control the rate of API requests to the Kubernetes API server:

  • QPS (Queries Per Second): Controls the sustainable rate of requests to the Kubernetes API. Default is 100 QPS.
  • Burst: Sets the maximum burst of requests allowed beyond the QPS rate. Default is 100 requests.
  • Timeout: Sets the timeout for individual API requests. Default is 10 seconds.

These settings can be adjusted based on your cluster size and API server capacity. For larger clusters or environments with high API server load, you may need to tune these values to prevent overwhelming the Kubernetes API server.

Example environment variable configuration:

VEGA_QPS=200
VEGA_BURST=300
VEGA_TIMEOUT=15s

Example command line configuration:

--qps=200 --burst=300 --timeout=15s

Adjusting Concurrencyโ€‹

To control the maximum number of concurrent metric collection operations:

helm install vega-metrics-agent ./charts/vega-metrics-agent \
--set maxConcurrency=12

Using a Custom Values Fileโ€‹

Create a file named custom-values.yaml with your desired configuration:

vega:
clientId: "YOUR_CLIENT_ID"
clientSecret: "YOUR_CLIENT_SECRET"
orgSlug: "YOUR_ORG_SLUG"
clusterName: "YOUR_CLUSTER_NAME"

apiRateLimiting:
qps: 200
burst: 300
timeout: 15

maxConcurrency: 12

resources:
requests:
memory: "4Gi"
cpu: "750m"
limits:
memory: "8Gi"
cpu: "1500m"

Then install using:

helm install vega-metrics-agent ./charts/vega-metrics-agent -f custom-values.yaml

Configuring Scheduling Constraintsโ€‹

You can control where the agent pods are scheduled using nodeSelector, affinity, and tolerations.

Using nodeSelector:โ€‹

To schedule the agent only on nodes with specific labels (e.g., disktype=ssd):

helm install vega-metrics-agent ./charts/vega-metrics-agent \
--set vega.clientId=YOUR_CLIENT_ID \
--set vega.clientSecret=YOUR_CLIENT_SECRET \
--set vega.orgSlug=YOUR_ORG_SLUG \
--set vega.clusterName=YOUR_CLUSTER_NAME \
--set nodeSelector.disktype=ssd

Using affinity:โ€‹

Affinity rules provide more advanced scheduling control. For complex affinity rules, it's recommended to use a custom values file. Here's an example snippet for custom-values.yaml:

# custom-values.yaml
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- metrics-agent
topologyKey: "kubernetes.io/hostname"

Using tolerations:โ€‹

To allow the agent pods to be scheduled on nodes with specific taints:

helm install vega-metrics-agent ./charts/vega-metrics-agent \
--set vega.clientId=YOUR_CLIENT_ID \
--set vega.clientSecret=YOUR_CLIENT_SECRET \
--set vega.orgSlug=YOUR_ORG_SLUG \
--set vega.clusterName=YOUR_CLUSTER_NAME \
--set tolerations[0].key="example-key" \
--set tolerations[0].operator="Exists" \
--set tolerations[0].effect="NoSchedule"

For multiple or complex tolerations, using a custom values file is cleaner:

# custom-values.yaml
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "key2"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 3600

5. Verification & Monitoringโ€‹

Installation Verificationโ€‹

  1. Check pod status:
kubectl get pods -n vegacloud -l app=metrics-agent
  1. View agent logs:
kubectl logs -f deployment/metrics-agent -n vegacloud
  1. Verify metrics collection:
kubectl top nodes
kubectl top pods -A
  1. Check agent connectivity:
kubectl exec -it deployment/metrics-agent -n vegacloud -- curl -v https://api.vegacloud.io/health

Health Monitoringโ€‹

  1. Monitor agent health:
kubectl describe pod -l app=metrics-agent -n vegacloud
  1. Check resource usage:
kubectl top pod -l app=metrics-agent -n vegacloud
  1. View recent events:
kubectl get events -n vegacloud --field-selector involvedObject.name=vega-metrics

6. Maintenanceโ€‹

Upgradesโ€‹

Option 1: Upgrade via Helm Chartโ€‹

# Update repository
helm repo update

# Upgrade with existing values
helm upgrade vega-metrics vegacloud/vega-metrics-agent \
--namespace vegacloud \
--reuse-values

Option 2: Update Container Imageโ€‹

  1. Check available container versions:
docker pull public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent
docker images public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent
  1. Update the deployment with new image:
kubectl set image deployment/metrics-agent \
vega-metrics-agent=public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent:latest \
-n vegacloud
  1. Monitor the rolling update:
kubectl rollout status deployment/metrics-agent -n vegacloud

Note: Replace :latest with a specific version tag for better version control.

Version-Specific Helm Upgradeโ€‹

helm upgrade vega-metrics vegacloud/vega-metrics-agent \
--version 1.2.3 \
--namespace vegacloud \
--reuse-values

Backup and Recoveryโ€‹

  1. Export current configuration:
helm get values vega-metrics -n vegacloud > vega-metrics-backup.yaml
  1. Export secrets:
kubectl get secret -n vegacloud vega-metrics-agent-secret -o yaml > vega-metrics-agent-secret-backup.yaml

Uninstallationโ€‹

helm uninstall vega-metrics -n vegacloud
kubectl delete namespace vegacloud # Optional: removes the namespace

7. Troubleshootingโ€‹

Common Issuesโ€‹

  1. Pod in CrashLoopBackOff

    • Check logs: kubectl logs -f deployment/metrics-agent -n vegacloud
    • Verify credentials: kubectl get secret vega-metrics-agent-secret -n vegacloud
    • Check resource limits: kubectl describe pod -l app=metrics-agent -n vegacloud
  2. Connection Issues

    • Verify network policies allow outbound traffic
    • Check if proxy configuration is needed
    • Ensure correct orgSlug is configured
    • Verify DNS resolution: kubectl run test-dns --image=busybox:1.28 --rm -it --restart=Never -- nslookup api.vegacloud.io
  3. Authentication Failures

    • Verify clientId and clientSecret are correct
    • Check secret creation: kubectl describe secret vega-metrics-agent-secret -n vegacloud
    • Validate API access: curl -v -H "Authorization: Bearer $TOKEN" https://api.vegacloud.io/health

Debugging Stepsโ€‹

  1. Enable debug logging:
helm upgrade vega-metrics vegacloud/vega-metrics-agent \
--set env.LOG_LEVEL=DEBUG \
--namespace vegacloud
  1. Check RBAC permissions:
kubectl auth can-i --list --as system:serviceaccount:vegacloud:vega-metrics-agent
  1. Verify network connectivity:
kubectl run test-net --image=busybox:1.28 --rm -it --restart=Never -- wget -q -O- https://api.vegacloud.io/health

8. Referenceโ€‹

Resource Recommendationsโ€‹

Based on the number of nodes in your cluster, here are the suggested CPU and memory requirements for the agent:

NodesCPU RequestCPU LimitMem RequestMem Limit
< 50200m500m512Mi1Gi
50โ€“100300m700m1Gi2Gi
100โ€“250500m1000m2Gi4Gi
250โ€“500750m1500m4Gi8Gi
500โ€“10001000m2000m8Gi12Gi
1000+1500m3000m12Gi16Gi

To apply these recommendations:

helm install vega-metrics vegacloud/vega-metrics-agent \
--set vega.clientId="your-client-id" \
--set vega.clientSecret="your-client-secret" \
--set vega.orgSlug="your-org-slug" \
--set vega.clusterName="your-cluster-name" \
--set resources.requests.cpu="500m" \
--set resources.requests.memory="2Gi" \
--set resources.limits.cpu="1000m" \
--set resources.limits.memory="4Gi"
Tips on resource allocation
  • CPU requests should be set to allow guaranteed minimum CPU resources. The metrics agent doesn't need high CPU most of the time but benefits from having a consistent baseline.
  • CPU limits should be set moderately higher than requests to allow for metric collection spikes. Setting CPU limits too low can cause throttling that may interrupt metrics collection.
  • Memory requests should be set to accommodate the baseline memory footprint plus overhead for metrics processing.
  • Memory limits should be set higher than requests to prevent OOM (Out of Memory) kills during peak collection periods.

These values should be adjusted based on your specific monitoring needs, collection frequency, and the total number of metrics being collected. For clusters with high pod density or custom metric collection, increase these values accordingly.

Common Commandsโ€‹

Operational Commandsโ€‹

# Check agent status
kubectl get pods -n vegacloud -l app=metrics-agent

# View agent configuration
helm get values vega-metrics -n vegacloud

# Force agent restart
kubectl rollout restart deployment vega-metrics -n vegacloud

# Scale agent
kubectl scale deployment vega-metrics -n vegacloud --replicas=2

Debugging Commandsโ€‹

# Check agent permissions
kubectl auth can-i --list --as system:serviceaccount:vegacloud:vega-metrics-agent

# View agent events
kubectl get events -n vegacloud --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pod -l app=metrics-agent -n vegacloud

# View detailed pod information
kubectl describe pod -l app=metrics-agent -n vegacloud