· VMware,Home Lab Hacking

Over the last few months I've continued working the the great folks at Solace testing all the various flavours of their event mesh platform on vSphere 7 and Tanzu. It's been a great experience and continues to be so but I wanted to share some recent 'fun' I've had working with the combination of OVA deployments and Kubernetes deployments in my lab.

The Setup

vSphere & ESXi

My Lab setup is a collection of SuperMicro D300 servers setup as a single cluster with VSAN (1tb) and traditional iSCSI storage. I wasn't able to stretch to 10gb for VSAN so it is doing just fine on a 1gb copper line, but outside that everything is on the HAL and has nice green ticks for everything.

Tanzu Kubernetes Grid

I'm using vSphere 7u1 with Tanzu to give me rapid creation (and delete) of Kubernetes clusters. Once I'd jumped though a few hoops with the HAProxy configuration ( VLAN setup and networking in general isn't a speciality of mine or much of an interest if I'm honest Cormac's blog was just want I needed to get this working ) everything has been running perfectly for the last few months.. or so I thought.

Solace Deployment

Solace is packed in a number of form factors depending on how you want to run your event mesh and your level of conformt with the underlaying packaging. In the past vSphere OVA was a common choice, moving to use Kubernetes Solace can be delivered either from a single docker image manually deployed into the K8S cluster or deployment and configuration can be managed from Solace's SaaS Portal ( solace cloud ) to permit deployment and configuration of an entire event mesh worldwide from one place. This global deploy & configuration requires the Solace Agent be pushed to any K8S cluster you might want to adopt in the future.

All sounds easy enough... The complication.

For my lab testing the deployment was easy enough:

  • Spin up 2 Tanzu K8S clusters. 
  • Deploy the helm chart for Solace on each cluster
  • Deploy the OVA onto the vSphere Cluster
  • Spin up a final Tanzu Cluster for the Solace Cloud Agent
  • Deploy the config for the Solace Cloud Agent and remotely configure. 

Spinning up TKG K8S Clusters.

With the Workload Management setup in vSphere creating the Tanzu Clusters was quick and easy. I'm a fan of keeping all my configuration in code form (and hidden safely away in github) which allows me to make sure each cluster is identical apart from the name.

My Solace-01 & Solace-02 cluster yaml is very simple.

with this config I'm able to login to the supervisor cluster and create my guest clusters. It is worth noting that setting the default PVC storage class is an operation at the supervisor level. If you do not setup the defaultClass in the cluster definition you will not be able to set it from inside the guest cluster. In fact if you do set the default storage class inside the guest cluster it will configure then during the k8s reconciliation loop if will become unset again causing much confusion.

Login in to the supervisor cluster and apply the config.

kubectl vsphere login --server=192.168.0.61 --insecure-skip-tls-verify --vsphere-username robbie@lab.shadowtech.io

Once logged into the supervisor I can check any existing Tanzu Kubernetes Clusters or any other parameters inside my yaml.

e.g.

kubectl get tkc

(⎈ |homelab:homelab)robbie@Cores ~ % kubectl get tkc

NAME CONTROL PLANE WORKER DISTRIBUTION AGE PHASE

solace-03 1 3 v1.18.15+vmware.1-tkg.1.600e412 14h running

solace-04 1 3 v1.18.15+vmware.1-tkg.1.600e412 3h15m running

solace-cloud 1 3 v1.18.15+vmware.1-tkg.1.600e412 174m running

kubectl get virtualmachineclass

^^ Get the virtual machines you can deploy.

kubectl describe virtualmachineclass guaranteed-small

^^ find out about each machine.

kubectl get tanzukubernetescluster

^^ get you the provisioned tkg clusters

kubectl describe tanzukubernetescluster

^^ Discover the deployment state of each cluster.

To create my new cluster from my yaml definition.

kubectl create -f solace-03.yaml

with the cluster deployed I like to extract the kubeconfig file from the supervisor cluster and store is as part of my lab configuration. (it gets encrypted and pushed to github alongside everything else in a private repo).

The supervisor cluster has all the kubeconfig's for all the deployed clusters -

kubectl get secrets | grep kubeconfig

solace-03-kubeconfig Opaque 1 14h

solace-04-kubeconfig Opaque 1 3h26m

solace-cloud-kubeconfig Opaque 1 3h4m

Grab the config and decode it ( notice big '-D' for MacOS little '-d' for Linux )

kubectl get secret solace-03-kubeconfig -o jsonpath='{.data.value}' | base64 -D > ./kubeconfigs/solace-03.kubeconfig

Helm Deploy for 1kPROD setup

helm install solace-03 --set solace.size=prod1k,solace.usernameAdminPassword=VMware solacecharts/pubsubplus

helm install solace-04 --set solace.size=prod1k,solace.usernameAdminPassword=VMware solacecharts/pubsubplus

OVA deployment - simply download the OVA from vSphere console. It is of course possible to deploy this using GOVC, I choose to use the GUI for a change.

With these deployment running happily I worked with Michael@Solace to configure the event mesh and this is where things became confusing.

"It should be working ?!?!"

OVA & K8S

Configuring the Solace Event Mesh cluster, I noted something strange. It was possible to configure the event mesh successfully provided the configuration was iniated by the OVA deployment, if either of the Kubernetes deployments iniated the event mesh config a connection to the OVA would fail.

OVA -> Solace-03 Cluster - Connection & Config Successful

Solace 03 -> OVA - Connection Failed -> Config Failed

Solace 03 -> Solace 04 -> Connection & Config Successful

Solace 04 -> OVA - Connection Failed -> Config Failed

Solace 04 -> Solace 03 -> Connection & Config Successful

kubectl get pods --namespace default --show-labels -w | grep solace-03-pubsubplus

Given the kubernetes workloads are behind a HAProxy LB on a different subnet & VLAN I assumed there must be something blocking the route between my 192.168.0.x (mgt network where the OVA is) and my 192.168.120.x network where the kubernetes workloads have their LB entry point. Checking the network configurations vSphere and my lab gateway weren't blocking any traffic. I could ping servers either side of the network, I could use use curl to connect to the OVA's various service ports and get a response.

So VM(192.168.0.x) to VM (192.168.120.x) communication works perfectly and I can connect and test services running on the K8S LB and even cluster & node IP's. Time to get inside Kubernetes.. and look at networking, not where I would really want to be.

Dropping on a running solace pod I could investigate the networking,

kubectl exec -n solace -it solace-01-pubsubplus-0 -- bash

From inside this POD I wasn't able to ping any services on the 192.168.0.x network, everything on the 192.168.120.x was fine but the route to the 192.168.0.1 gateway wasn't working ( even more confusing because both 192.168.0.1 and 192.168.120.1 were the same machine). But progress K8S network routes not working correctly.

Turning to google I dived straight into learning more about kubeproxy and k8s networking, and speaking with others on slack; Because to be perfectly honest I was confused and networking wasn't where I liked to spend my time. One of the awesome VMware engineers at this point helpfully jumped onto a zoom call with me to help me diagnose my problem, (sometimes having fresh perspective helps).

I walked him though my deployments and showed him the configs.. after a brief pause he asked me to dump the cluster configuration for my solace-03 cluster from the supervisor cluster instead or looking at my hand crafted file and take a look at the spec.

kubectl get -o yaml tanzukubernetesclusters -n homelab solace-03

apiVersion: run.tanzu.vmware.com/v1alpha1

kind: TanzuKubernetesCluster

metadata:

annotations:

kubectl.kubernetes.io/last-applied-configuration: |

{"apiVersion":"run.tanzu.vmware.com/v1alpha1","kind":"TanzuKubernetesCluster","metadata":{"annotations":{},"name":"solace-03","namespace":"homelab"},"spec":{"distribution":{"version":"v1.18"},"settings":{"network":{"cni":{"name":"antrea"},"pods":{"cidrBlocks":["192.168.0.0/12"]},"serviceDomain":"cluster.local","services":{"cidrBlocks":["10.96.0.0/12"]}},"storage":{"classes":["vsan-default-storage-policy"],"defaultClass":"vsan-default-storage-policy"}},"topology":{"controlPlane":{"class":"best-effort-large","count":1,"storageClass":"vsan-default-storage-policy"},"workers":{"class":"best-effort-large","count":3,"storageClass":"vsan-default-storage-policy"}}}}

creationTimestamp: "2021-03-23T22:34:11Z"

finalizers:

- tanzukubernetescluster.run.tanzu.vmware.com

generation: 1

managedFields:

- apiVersion: run.tanzu.vmware.com/v1alpha1

fieldsType: FieldsV1

fieldsV1:

spec:

distribution:

fullVersion: v1.18.15+vmware.1-tkg.1.600e412

version: v1.18

settings:

network:

cni:

name: antrea

pods:

cidrBlocks:

- 192.168.0.0/12

serviceDomain: cluster.local

services:

cidrBlocks:

- 10.96.0.0/12

storage:

classes:

- vsan-default-storage-policy

defaultClass: vsan-default-storage-policy

topology:

controlPlane:

class: best-effort-large

count: 1

storageClass: vsan-default-storage-policy

workers:

class: best-effort-large

count: 3

storageClass: vsan-default-storage-policy

status:

addons:

authsvc:

name: authsvc

status: applied

version: 0.1-67-gb9aa0d3

cloudprovider:

name: vmware-guest-cluster

status: applied

version: 0.1-87-gb6bb261

cni:

name: antrea

status: applied

version: v0.9.2_vmware.2

csi:

name: pvcsi

status: applied

version: v0.0.1.alpha+vmware.81-3f4bb9e

dns:

name: CoreDNS

status: applied

version: v1.6.7_vmware.8

proxy:

name: kube-proxy

status: applied

version: 1.18.15+vmware.1

psp:

name: defaultpsp

status: applied

version: v1.18.15+vmware.1-tkg.1.600e412

clusterApiStatus:

apiEndpoints:

- host: 192.168.120.174

port: 6443

phase: Provisioned

nodeStatus:

solace-03-control-plane-68n2j: ready

solace-03-workers-7n9hm-589f6f9c59-4zgft: ready

solace-03-workers-7n9hm-589f6f9c59-6x96x: ready

solace-03-workers-7n9hm-589f6f9c59-qs9xl: ready

phase: running

vmStatus:

solace-03-control-plane-68n2j: ready

solace-03-workers-7n9hm-589f6f9c59-4zgft: ready

solace-03-workers-7n9hm-589f6f9c59-6x96x: ready

solace-03-workers-7n9hm-589f6f9c59-qs9xl: ready

Within minutes of looking at this yaml the problem jumped out, the default CIDR block for the POD network was 192.168.0.0/12 <-- conflicting with my lab network of 192.168.0.x/24 . Thereby causing the routing problem. Because I hadn't explicitly described the POD CIDR TKG had picked up the sensible default, which 'luckily' conflicted. I quickly edited my solace TKG cluster configurations to include a different POD CIDR, deleted the old clusters and redeployed.

Redeployed Solace from the helm charts and now everything worked as expected.

A Future blog will show Solace Event Mesh running across K8S, VM + K8S Cloud Agent with message flows from Java and C++, I think the post is long enough already.

Lessons Learned

  • Be explicit when you describe your K8S configurations, both for cluster config and application configuration. 'sensible defaults' will often work but vary from K8S platform to K8S platform causing difficult to debug problems. If I didn't have full control of my environment it would have been very difficult to eliminate all the variables. 
  • It's usually the network -> Start there 🙂 
  • Talk your deployment though with someone who didn't work with you on it. The fresh eyes and ears always helps. They challenge your documentation and help you find any assumptions you made because you were too close to the rock face. 
  • and finally... it's always the network. I need spend more time over there. 

Thanks for reading,

All Posts
×

Almost done…

We just sent you an email. Please click the link in the email to confirm your subscription!

OK