Earlier this week I was continuing working with Solace testing out their new Cloud Provisioner for the Solace broker on TKG and ran into an 'interesting' issue. For some reason all the persistent volumes mounted into the primary app container by Kubernetes had incorrect permissions set and the running process didn't have permission to write to them.. which of course caused the deployment to fail.
Upon further testing this was also happening in the simplest of use cases too - busy box with a pv. (https://github.com/RobbieJVMW/Kubernetes-PV-Test). In fact anything that wasn't running as privileged non longer had permission to write to their PV's.. checking first I hadn't made some crazy error like specifying read-only (I hadn't) Solace Dev and I started poking around. Did I mention that I love OSS.. without too much effort we found an issue similar to what we had experienced.
https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/370 - sure enough this was the issue, fsgroup wasn't being honoured by the driver, so anything using a security context was going to have a problem. Francois@SolaceDev spotted this change to how the CSI drivers work in Kubernetes 1.19 with regard to fsGroupPolicy in the CSIDriver Spec.
and the pieces fell into place... I had just upgraded my TKG cluster to TKG1.2 (k8s 1.19) and the the Solace images were using fsGroups and security policy, in k8s 1.19 the default behaviour has changed so that. If FSGroupPolicy was set to default (ReadWriteOnceWithFSType), these conditions must still be met in order to have fsGroup applied :
* AccessType MUST be ReadWriteOnce
* The fsType MUST been specified
Checking back on my storageclass my default the fsType isn't defined (which would make is null). Was the fix as simple as adding an additional parameter to my storageClass ?
taking the previously optional parameter and for my case XFS for the Solace agent (Solace is optimised for that filesystem) and defining it my StorageClass for VSAN fixed my permission error in image.
Updating the findings and workaround / documentation fix in the GitHub issue and a quick conversation with one of the VMware OSS storage driver developers via Slack confirmed the issue, workaround and the dev committed to updating the documentation to make the requirement very clear.
( The VMware Storage Driver Docs are published via gitbooks and not something you can contribute to directly or I would have made the change myself ).
So folks .. 'weird' problem discovered, diagnosed, work-around developed, and developer confirmed within 24 hours (timezones where involved). Not something that could have been achieved using closed source.. I'm sure I would still be at the 'send us some more logs' phase.
So if you update your K8S version and weird things happen... check the release notes very closely. Small 'optional' configurations can change the course of your day.
Now everything is working nicely again .. I can get the Solace Cloud agent re-deployed and record some more videos.