Kubernetes Integration Guide
In order to configure the Vega Kubernetes Integration with your AWS Environment, you need to provide Vega with a Cost and Usage Report (CUR) that has Split cost allocation data enabled. Contact Vega support if you need assistance making a change to your CUR export.
Vega CUR configuration guide ยท AWS Docs: Split cost allocation
The most up-to-date code, issues, and documentation can be found at the Vega Kubernetes Metrics Agent GitHub project.
1. Overviewโ
The Vega Platform provides seamless integration for collecting detailed data and metrics from Kubernetes clusters through the Vega Kubernetes Metrics Agent. This versatile agent collects, processes, and transmits a wide range of metrics to the Vega Platform, giving users deep insights into their Kubernetes environments.
Architectureโ
The agent uses a collector-based architecture where:
- Each collector specializes in gathering specific metrics (nodes, pods, etc.)
- Metrics are collected in parallel with configurable concurrency
- Data is securely uploaded to S3 storage using pre-signed URLs
- Support for both metrics API and direct Kubelet collection with automatic fallback
Key Featuresโ
-
Comprehensive Metrics Collection:
- Node Metrics: Detailed metrics for each node including capacity, allocatable resources, usage, and hardware details
- Pod Metrics: Metrics for all pods covering resource requests, limits, usage, and container status
- Cluster Metrics: Aggregated cluster-wide metrics with cloud provider detection
- Persistent Volume Metrics: Metrics for persistent volumes, claims, and storage classes
- Namespace Metrics: Resource quotas, limit ranges, and detailed usage
- Workload Metrics: Deployments, stateful sets, daemon sets, jobs, and cron jobs
- Networking Metrics: Services, ingresses, and network policies
- Orchestration Metrics: HPAs, replication controllers, and replica sets
-
Multiple Collection Methods:
- Metrics API Integration: Primary collection through the metrics.k8s.io API
- Kubelet Direct Collection: Fallback mechanism for detailed node metrics
- Automatic Failover: Graceful degradation if primary collection method fails
-
Advanced Configuration:
- API Rate Limiting: Configurable QPS, Burst, and Timeout settings
- Concurrency Control: Parallel collection with throttling
- Customizable Parameters: Extensive configuration options through environment variables and flags
-
Operational Features:
- Health Check: Simple HTTP endpoint for liveness monitoring
- Secure Authentication: Bearer token authentication for API access
- Cloud Provider Detection: Automatic identification of AWS (EKS), Azure (AKS), GCP (GKE)
- Agent Check-in: Optional capability to report agent status
2. Prerequisitesโ
System Requirementsโ
- Kubernetes cluster version 1.30 or higher
- Helm v3.0 or higher
- Outbound access to:
- Container image repository (public.ecr.aws/c0f8b9o4/vegacloud)
- api.vegacloud.io (port 443) โ for pre-signed URL retrieval.
- vegametricsocean.s3.us-west-2.amazonaws.com (port 443) โ for uploading data.
Access Requirementsโ
- Kubernetes Administrator privileges
- RBAC permissions for:
- Creating deployments
- Creating cluster roles
- Reading metrics
- API access credentials (clientId and clientSecret)
Pre-Installation Checksโ
- Verify cluster access:
kubectl cluster-info
kubectl auth can-i create deployment
kubectl auth can-i create clusterrole
- Verify Helm installation:
helm version
- Check network connectivity:
curl -v https://google.com
3. Installationโ
Quick Startโ
- Add the Vega Helm repository:
helm repo add vegacloud https://vegacloud.github.io/charts/
helm repo update
- Verify repository addition:
helm search repo vegacloud
- Install the agent:
helm install vega-metrics vegacloud/vega-metrics-agent \
--set vega.clientId="your-client-id" \
--set vega.clientSecret="your-client-secret" \
--set vega.orgSlug="your-org-slug" \
--set vega.clusterName="your-cluster-name"
4. Configurationโ
Helm Chart Parametersโ
Parameter | Description | Default | Required |
---|---|---|---|
vega.clientId | Client ID for authentication with Vega Cloud API | "" | Yes |
vega.clientSecret | Client Secret for authentication with Vega Cloud API | "" | Yes |
vega.orgSlug | Your Vega Cloud Organization slug | "" | Yes |
vega.clusterName | Unique name for your Kubernetes cluster | "" | Yes |
apiRateLimiting.qps | Kubernetes API requests per second | 100 | |
apiRateLimiting.burst | Max burst of API requests allowed | 100 | |
apiRateLimiting.timeout | API request timeout in seconds | 10 | |
maxConcurrency | Max concurrent collector operations | 8 | |
resources.requests.memory | Memory request for the agent | "2Gi" | |
resources.requests.cpu | CPU request for the agent | "500m" | |
resources.limits.memory | Memory limit for the agent | "4Gi" | |
resources.limits.cpu | CPU limit for the agent | "1000m" | |
replicaCount | Number of agent replicas to run | 1 | |
image.repository | Agent container image repository | public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent | |
image.tag | Agent container image tag | 1.1.4 | |
image.pullPolicy | Container image pull policy | Always | |
nodeSelector | Node labels for pod assignment | {} | |
affinity | Pod affinity/anti-affinity rules | {} | |
tolerations | Pod tolerations for scheduling | [] | |
env | Additional environment variables to set in container | {} |
If you do not specify the API rate limiting or concurrency parameters, the agent will use its built-in defaults, which are optimized for most use cases. For complex configurations like affinity
and tolerations
, using a custom values file (-f values.yaml
) is recommended over multiple --set
flags.
Vega Metrics Agent Configurationโ
The Vega Metrics Agent can be configured using either environment variables or command-line flags. Below are the key configuration parameters:
Environment Variable | Description | Default | Required |
---|---|---|---|
VEGA_CLIENT_ID | Client ID for authentication | Yes | |
VEGA_CLIENT_SECRET | Client secret for authentication | Yes | |
VEGA_CLUSTER_NAME | Name of the Kubernetes cluster | Yes | |
VEGA_ORG_SLUG | Your Vega Cloud Organization slug | Yes | |
VEGA_POLL_INTERVAL | Interval for polling metrics | 60m | No |
VEGA_UPLOAD_REGION | AWS region for S3 uploads | us-west-2 | No |
LOG_LEVEL | Log level (DEBUG , INFO , WARN , ERROR ) | INFO | No |
VEGA_INSECURE | Use insecure connections | false | No |
VEGA_WORK_DIR | Working directory for temporary files | /tmp | No |
VEGA_COLLECTION_RETRY_LIMIT | Retry limit for metric collection | 3 | No |
VEGA_BEARER_TOKEN_PATH | Path to the bearer token file | /var/run/secrets/kubernetes.io/serviceaccount/token | No |
VEGA_NAMESPACE | Kubernetes namespace for agent deployment | vegacloud | No |
VEGA_QPS | API rate limiter for requests per second | 100 | No |
VEGA_BURST | API rate limiter burst allowance | 100 | No |
VEGA_TIMEOUT | Timeout for API requests | 10s | No |
VEGA_MAX_CONCURRENCY | Maximum number of concurrent collectors | 8 | No |
Additional parameters for local testing and debugging include:
AGENT_ID
: Unique identifier for the agentSHOULD_AGENT_CHECK_IN
: Determines if the agent should check in with the metrics serverSTART_COLLECTION_NOW
: Start metric collection immediatelySAVE_LOCAL
: Save metrics locally
API Rate Limiting Configurationโ
The agent provides the following parameters to control the rate of API requests to the Kubernetes API server:
- QPS (Queries Per Second): Controls the sustainable rate of requests to the Kubernetes API. Default is 100 QPS.
- Burst: Sets the maximum burst of requests allowed beyond the QPS rate. Default is 100 requests.
- Timeout: Sets the timeout for individual API requests. Default is 10 seconds.
These settings can be adjusted based on your cluster size and API server capacity. For larger clusters or environments with high API server load, you may need to tune these values to prevent overwhelming the Kubernetes API server.
Example environment variable configuration:
VEGA_QPS=200
VEGA_BURST=300
VEGA_TIMEOUT=15s
Example command line configuration:
--qps=200 --burst=300 --timeout=15s
Adjusting Concurrencyโ
To control the maximum number of concurrent metric collection operations:
helm install vega-metrics-agent ./charts/vega-metrics-agent \
--set maxConcurrency=12
Using a Custom Values Fileโ
Create a file named custom-values.yaml
with your desired configuration:
vega:
clientId: "YOUR_CLIENT_ID"
clientSecret: "YOUR_CLIENT_SECRET"
orgSlug: "YOUR_ORG_SLUG"
clusterName: "YOUR_CLUSTER_NAME"
apiRateLimiting:
qps: 200
burst: 300
timeout: 15
maxConcurrency: 12
resources:
requests:
memory: "4Gi"
cpu: "750m"
limits:
memory: "8Gi"
cpu: "1500m"
Then install using:
helm install vega-metrics-agent ./charts/vega-metrics-agent -f custom-values.yaml
Configuring Scheduling Constraintsโ
You can control where the agent pods are scheduled using nodeSelector
, affinity
, and tolerations
.
Using nodeSelector
:โ
To schedule the agent only on nodes with specific labels (e.g., disktype=ssd):
helm install vega-metrics-agent ./charts/vega-metrics-agent \
--set vega.clientId=YOUR_CLIENT_ID \
--set vega.clientSecret=YOUR_CLIENT_SECRET \
--set vega.orgSlug=YOUR_ORG_SLUG \
--set vega.clusterName=YOUR_CLUSTER_NAME \
--set nodeSelector.disktype=ssd
Using affinity
:โ
Affinity rules provide more advanced scheduling control. For complex affinity rules, it's recommended to use a custom values file. Here's an example snippet for custom-values.yaml
:
# custom-values.yaml
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- metrics-agent
topologyKey: "kubernetes.io/hostname"
Using tolerations
:โ
To allow the agent pods to be scheduled on nodes with specific taints:
helm install vega-metrics-agent ./charts/vega-metrics-agent \
--set vega.clientId=YOUR_CLIENT_ID \
--set vega.clientSecret=YOUR_CLIENT_SECRET \
--set vega.orgSlug=YOUR_ORG_SLUG \
--set vega.clusterName=YOUR_CLUSTER_NAME \
--set tolerations[0].key="example-key" \
--set tolerations[0].operator="Exists" \
--set tolerations[0].effect="NoSchedule"
For multiple or complex tolerations, using a custom values file is cleaner:
# custom-values.yaml
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "key2"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 3600
5. Verification & Monitoringโ
Installation Verificationโ
- Check pod status:
kubectl get pods -n vegacloud -l app=metrics-agent
- View agent logs:
kubectl logs -f deployment/metrics-agent -n vegacloud
- Verify metrics collection:
kubectl top nodes
kubectl top pods -A
- Check agent connectivity:
kubectl exec -it deployment/metrics-agent -n vegacloud -- curl -v https://api.vegacloud.io/health
Health Monitoringโ
- Monitor agent health:
kubectl describe pod -l app=metrics-agent -n vegacloud
- Check resource usage:
kubectl top pod -l app=metrics-agent -n vegacloud
- View recent events:
kubectl get events -n vegacloud --field-selector involvedObject.name=vega-metrics
6. Maintenanceโ
Upgradesโ
Option 1: Upgrade via Helm Chartโ
# Update repository
helm repo update
# Upgrade with existing values
helm upgrade vega-metrics vegacloud/vega-metrics-agent \
--namespace vegacloud \
--reuse-values
Option 2: Update Container Imageโ
- Check available container versions:
docker pull public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent
docker images public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent
- Update the deployment with new image:
kubectl set image deployment/metrics-agent \
vega-metrics-agent=public.ecr.aws/c0f8b9o4/vegacloud/vega-metrics-agent:latest \
-n vegacloud
- Monitor the rolling update:
kubectl rollout status deployment/metrics-agent -n vegacloud
Note: Replace
:latest
with a specific version tag for better version control.
Version-Specific Helm Upgradeโ
helm upgrade vega-metrics vegacloud/vega-metrics-agent \
--version 1.2.3 \
--namespace vegacloud \
--reuse-values
Backup and Recoveryโ
- Export current configuration:
helm get values vega-metrics -n vegacloud > vega-metrics-backup.yaml
- Export secrets:
kubectl get secret -n vegacloud vega-metrics-agent-secret -o yaml > vega-metrics-agent-secret-backup.yaml
Uninstallationโ
helm uninstall vega-metrics -n vegacloud
kubectl delete namespace vegacloud # Optional: removes the namespace
7. Troubleshootingโ
Common Issuesโ
-
Pod in CrashLoopBackOff
- Check logs:
kubectl logs -f deployment/metrics-agent -n vegacloud
- Verify credentials:
kubectl get secret vega-metrics-agent-secret -n vegacloud
- Check resource limits:
kubectl describe pod -l app=metrics-agent -n vegacloud
- Check logs:
-
Connection Issues
- Verify network policies allow outbound traffic
- Check if proxy configuration is needed
- Ensure correct orgSlug is configured
- Verify DNS resolution:
kubectl run test-dns --image=busybox:1.28 --rm -it --restart=Never -- nslookup api.vegacloud.io
-
Authentication Failures
- Verify clientId and clientSecret are correct
- Check secret creation:
kubectl describe secret vega-metrics-agent-secret -n vegacloud
- Validate API access:
curl -v -H "Authorization: Bearer $TOKEN" https://api.vegacloud.io/health
Debugging Stepsโ
- Enable debug logging:
helm upgrade vega-metrics vegacloud/vega-metrics-agent \
--set env.LOG_LEVEL=DEBUG \
--namespace vegacloud
- Check RBAC permissions:
kubectl auth can-i --list --as system:serviceaccount:vegacloud:vega-metrics-agent
- Verify network connectivity:
kubectl run test-net --image=busybox:1.28 --rm -it --restart=Never -- wget -q -O- https://api.vegacloud.io/health
8. Referenceโ
Resource Recommendationsโ
Based on the number of nodes in your cluster, here are the suggested CPU and memory requirements for the agent:
Nodes | CPU Request | CPU Limit | Mem Request | Mem Limit |
---|---|---|---|---|
< 50 | 200m | 500m | 512Mi | 1Gi |
50โ100 | 300m | 700m | 1Gi | 2Gi |
100โ250 | 500m | 1000m | 2Gi | 4Gi |
250โ500 | 750m | 1500m | 4Gi | 8Gi |
500โ1000 | 1000m | 2000m | 8Gi | 12Gi |
1000+ | 1500m | 3000m | 12Gi | 16Gi |
To apply these recommendations:
helm install vega-metrics vegacloud/vega-metrics-agent \
--set vega.clientId="your-client-id" \
--set vega.clientSecret="your-client-secret" \
--set vega.orgSlug="your-org-slug" \
--set vega.clusterName="your-cluster-name" \
--set resources.requests.cpu="500m" \
--set resources.requests.memory="2Gi" \
--set resources.limits.cpu="1000m" \
--set resources.limits.memory="4Gi"
- CPU requests should be set to allow guaranteed minimum CPU resources. The metrics agent doesn't need high CPU most of the time but benefits from having a consistent baseline.
- CPU limits should be set moderately higher than requests to allow for metric collection spikes. Setting CPU limits too low can cause throttling that may interrupt metrics collection.
- Memory requests should be set to accommodate the baseline memory footprint plus overhead for metrics processing.
- Memory limits should be set higher than requests to prevent OOM (Out of Memory) kills during peak collection periods.
These values should be adjusted based on your specific monitoring needs, collection frequency, and the total number of metrics being collected. For clusters with high pod density or custom metric collection, increase these values accordingly.
Common Commandsโ
Operational Commandsโ
# Check agent status
kubectl get pods -n vegacloud -l app=metrics-agent
# View agent configuration
helm get values vega-metrics -n vegacloud
# Force agent restart
kubectl rollout restart deployment vega-metrics -n vegacloud
# Scale agent
kubectl scale deployment vega-metrics -n vegacloud --replicas=2
Debugging Commandsโ
# Check agent permissions
kubectl auth can-i --list --as system:serviceaccount:vegacloud:vega-metrics-agent
# View agent events
kubectl get events -n vegacloud --sort-by='.lastTimestamp'
# Check resource usage
kubectl top pod -l app=metrics-agent -n vegacloud
# View detailed pod information
kubectl describe pod -l app=metrics-agent -n vegacloud