VMware Cloud Foundation bringup troubleshooting

Recently I was playing around with the bringup of a VMware Cloud Foundation environment in my lab (version 3.9). I did that before and it all went fine, but this time around I ran into some issues. The first issue had to do with the step in the bringup where the licences get uploaded to the SDDC Manager. The second issue popped up while setting the vCenter login message. Normally you would fix the issue, reset the ESXi hosts and start over. Since I was just playing around testing this I figured why not troubleshoot and try to solve this to be able to continue the bringup.

Disclaimer

Alright, The things that follow in this blog post are almost certainly NOT SUPPORTED by VMware unless maybe support is telling you to do this. I just wanted to make clear that I work for VMware and would this have been a production bringup at a customer I would have raised a support ticket or start over. The tings I tried where purely to learn more about the inner workings of VMware Cloud Foundation and because I’m sometimes to stubborn to give up if I feel I’m close to solving the puzzle 🙂 Also, this involves making changes to the embedded PostgreSQL database on the Cloud Builder appliance.

That being said, continue at your own risk.

First issue

The first issue I encountered was at the step called “Update SDDC Manager Inventory with License Details”. Looking at the Cloud Builder appliance logs located in /var/log/vmware/vcf/bringup/. I noticed something was not right with updating the SDDC manager with the NSX license.

When I double checked I noticed that I entered the wrong license key in the VMware Cloud Foundation bringup Excel sheet. So now what, revert, update the licence and try again? Nope I wanted to figure out how I could change this in the current bringup.

PostgrSQL database

After some digging around in the Cloud Builder file system I could not find any leads so I turned my attention to the embedded PostgreSQL database.

To log in to the PostgreSQL database on the Cloud Builder appliance follow these steps:

Log in to the Cloud Builder appliance as admin
Run the following command: sudo psql -U postgres -d bringup -h localhost

There where only a hand full of tables to investigate and quickly I found the table “ResourceConfig”. Using the SELECT * FROM “ResourceConfig”; statement I found all sorts of records that seem to hold configuration data.
One recored looked like this:

id               | 9836fb93-bcf5-4b33-9fa5-6e61657171fd
creationTime     | 1575899188518
dalVersion       | 1.0
modificationTime | 1575899188518
status           | Active
version          | 6.4.5-13282012
bringupId        | 616b48d1-f3d7-40a2-819c-7a40f37b649d
payload          | {"nsxManagerConfigId":"9836fb93-bcf5-4b33-9fa5-6e61657171fd","nsxManagerId":"wve-vcf-nsxm-01","vCenterId":"37f1b86b-f8b2-46ad-89b8-106c9d6bcb2c","bringupId":"616b48d1-f3d7-40a2-819c-7a40f37b649d","status":"Active","ver
sion":"6.4.5-13282012","ovfDatacenter":"dc-01","ovfDataStore":"vSAN-mgmt","ovfCluster":"cl-mgmt-01","ovfVMName":"wve-vcf-nsxm-01","ovfNetwork":"SDDC-DPortGroup-Mgmt","ovfUrl":"/mnt/iso/sddc-foundation-bundle-3.9.0.0-14866160/nsx_ova/VMwa
re-NSX-Manager-6.4.5-13282012.ova","dlfFilePath":"DLF/license-nsx-621a-d1-2-eoem-c2-201307","licenseFile":"xxxxx-xxxxx-xxxxx-xxxxx-xxxxx","VSMCLIAdminCredentials":{"username":"admin","password":"xxxx"},"VSMCLIPrivilegedCredentials":{
"username":"admin","password":"xxxx"},"ovfVSMHostname":"wve-vcf-nsxm-01.wve.net","ipAddress":{"address":"172.16.27.x","dhcpEnabled":false,"netmask":"255.255.255.0","gateway":"172.16.27.x","cidr":"24"},"nsxManagerTlsSettings":"TLSv1
.2","licenseType":"PURCHASED"}
payloadType      | NSX_MANAGER_CONFIG

Notice the payload as a section called “licenseFile”. I decided to update this record with the correct license key with the following SQL statement.

UPDATE "ResourceConfig"
SET "payload"='{"nsxManagerConfigId":"9836fb93-bcf5-4b33-9fa5-6e61657171fd","nsxManagerId":"wve-vcf-nsxm-01","vCenterId":"37f1b86b-f8b2-46ad-89b8-106c9d6bcb2c","bringupId":"616b48d1-f3d7-40a2-819c-7a40f37b649d","status":"Active","ver
sion":"6.4.5-13282012","ovfDatacenter":"dc-01","ovfDataStore":"vSAN-mgmt","ovfCluster":"cl-mgmt-01","ovfVMName":"wve-vcf-nsxm-01","ovfNetwork":"SDDC-DPortGroup-Mgmt","ovfUrl":"/mnt/iso/sddc-foundation-bundle-3.9.0.0-14866160/nsx_ova/VMwa
re-NSX-Manager-6.4.5-13282012.ova","dlfFilePath":"DLF/license-nsx-621a-d1-2-eoem-c2-201307","licenseFile":"yyyyy-yyyyy-yyyyy-yyyyy-yyyyy","VSMCLIAdminCredentials":{"username":"admin","password":"xxxx"},"VSMCLIPrivilegedCredentials":{
"username":"admin","password":"xxxx"},"ovfVSMHostname":"wve-vcf-nsxm-01.wve.net","ipAddress":{"address":"172.16.27.x","dhcpEnabled":false,"netmask":"255.255.255.0","gateway":"172.16.27.x","cidr":"24"},"nsxManagerTlsSettings":"TLSv1
.2","licenseType":"PURCHASED"}'
WHERE "id"='9836fb93-bcf5-4b33-9fa5-6e61657171fd';

Note that I only changed the license key part the rest I kept exactly the same.

When I now clicked the retry button on the Cloud Builder web interface the bringup continued.

Side effects

So far the only side effect I noticed is that the original ‘wrong’ license got uploaded to vCenter and was sitting there unassigned. I just removed this and no harm was done. The correct license was there and assigned and that all looks fine.

Second issue

The second issue I ran into happened a lot later on in the process. This was really a bummer since I was so close to having the bringup completed. The step it failed on was “Configure vCenter Login Message and
Message of the Day”. This was not so important for me in this scenario. And again I thought I could fix this pretty quickly in the Cloud Builder database.

From the log files on the Cloud Builder appliance I saw that the API call to set the vCenter login message gave an error stating ‘CONFIGURE_VCENTER_LOGIN_MESSAGE_FAILED Failed to configure vCenter login message’ and ‘Received message is too long: 1433299817’.
So I figured thats odd, why would this message get too long. You would think this is tested. So, I went looking for the text that was getting set.

Skipping steps

After some digging around I found (in the database) the actual message that should be set by the bringup process. I checked the vCenter login message and it was not there. I set the message manually and I did not get an error that the message is too long. This means that what the API is trying to do should work.
At this point I wasn’t sure what to do. The login message and message of the day have been set correctly (partly manually) but retrying the bringup kept failing. If there was only a way to set this step to completed successfully.

More PostgreSQL

I went back to the PostgreSQL database and started to look for places to mark this step as completed. The first thing I checked was the table I found for the previous issue but there was nothing there I could change that would help.

Next I turned to the next table called “processing_task”. This indeed has a record that has the status “COMPLETED_WITH_FAILURE”. So I tried to change this to a status that the previous records also have, “POSTVALIDATION_COMPLETED_WITH_SUCCESS”.

This made total sense to me but when I reopened the Cloud Builder web interface I saw massive red bars and error messages. Again, proof you should not mess around in the database 🙂 Time to undo the changes and look for something else.

This time I looked into the table “processing_context”. Here we also find records with success and failure messages. For instance this one:

-[ RECORD 1 ]----------------+--------------------------------------------------------------------------------
id                           | 7f000001-6eea-1f2b-816e-eae94b58008e
execution_order              | 49
dal_version                  | 1.0
execution_errors             | null
execution_id                 | 616b48d1-f3d7-40a2-819c-7a40f37b649d
meta                         | null
next_processing_state        | S49
previous_processing_state    | S48
processed_resource_type      | EVO
processing_state_description |
processing_state_name        | _VcenterMessagesConfigurations_VsphereContractPlugin_ConfigureVcenterMessages_3
recipe_version               | 1.0
ref_id                       |
sddc_id                      | dc-01-workflowspec-ems
status                       | COMPLETED_WITH_FAILURE

So running the following statement set this to completion.

UPDATE "processing_context"
SET "status"='COMPLETED_WITH_SUCCESS'
WHERE "id"='7f000001-6eea-1f2b-816e-eae94b58008e';

After setting this and reloading the Cloud Builder web interface I got no errors. I hit the retry button and on it went. I actually marked the step as failed in the end but the bringup continued.

Side effects

So far I have not seen any side effects after my efforts to get through the bringup process while hacking my way through the database. After the bringup my VMware Cloud Foundation instance is still running fine. I also added an additional host and this went fine.

Conclusion

Even tough this would be completely unsupported in a real world scenario (see the disclaimer) I found it very entertaining and educational to be digging around in the Cloud Builder database trying to move the bringup forward. I learned about how the bringup works in the back-end and got some additional PostgreSQL knowledge along the way. Also keep in mind that these are just two issues that showed up from a long list of steps. Next time it might be something else which requires a different approach. But, let me say this, the state VMware Cloud Foundation is in currently would probably mean it just works as long as you have satisfied all pre-requisites.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.