Freedom to Choose… Freedom to Play… Freedom to Cloud….

I just returned from a week in New Orleans at the Nutanix .Next conference, where I was fortunte to represent eGroup as a partner as well as being part of the Nutanix Technical Champions group.

In addition to being a conference attendee, a co-worker Dave Strum  and I co-presented with one of our customers on the benefits of deploying Nutanix on Cisco UCS hardware, lessons learned and future plans.  It was fun and definintely not like your typical presentation.IMG_0830.jpeg

There’s a lot of blog posts and content around the .Next conference news (Plug for Dave here), and the Nutanix roadmap continues to dazzle and amaze people (ok, me especially) with simplicity, functionality and yes, Freedom.  Keyword here is Freedom.

And this post isn’t about recapping the .next conference, I’ll let my peers and friends handle that.  This post is about Freedom…

(more…)

I recently had the opportunity to deploy 12 Nutanix nodes for a customer across 2 sites (Primary and DR), 6 of which were 3055-G5 nodes with dual NVIDIA M60 GPU cards installed and dedicated to running the Horizon View desktop VMs for this customer. This was my first experience doing a Nutanix deployment using the NVIDIA GPU cards with VMware, and thankfully there is plenty of documentation out there on the process.

The Nutanix deployment with GPU cards installed is no different than without, you still go thru the process of imaging the nodes with Foundation just like you’d do without GPU cards. In this case, each site was configured with 2 Nutanix clusters, one for Server VMs and a second cluster specific to VDI. The VDI cluster was configured in a 3 node cluster, using the NX-3055-G5 nodes, running Horizon View 7.2.0 specifically.

I’ll touch on some details of the M60 card below, and then get into some of the places where I had a few issues with the deployment and how I fixed them, and finally some Host/VM configuration and validation commands.

(more…)

Well, it’s that time again… 2017 has come and gone, and sometimes I just don’t know where all the time went and what I was able to accomplish.

I’m happy to say that for the 2nd year in a row I’m part of a great group of people in the IT industry, those of us pushing the value of Nutanix and their simple, effective and scalable HyperConverged solutions.

Pretty cool in the large world of IT, to be a part of this small group of folks in the #NutanixNTC family, and especially joined by another eGroup Member, Dave Strum (http://vthistle.com) on this journey.

Thank you Nutanix for giving us an amazing platform to help our customers along on their journey, and I cannot wait what’s in store for 2018!

To read the full post about the 2018 Nutanix Technology Champions, follow the link below.

http://next.nutanix.com/t5/Nutanix-Connect-Blog/Welcome-to-the-2018-Nutanix-Technology-Champions-NTC/ba-p/26328

This week I had the pleasure of deploying 2 more Nutanix blocks on behalf of one our partners, who is now starting to highly recommend Nutanix for their customer deployments of critical systems.

The installation was pretty vanilla, 3 NX-1065-G5 nodes at the Primary site and matching at the DR site.  For the VMware components, we went with the vCenter 6.5 appliance (I love the stability and speed of the 6.5 appliance by the way), and for the ESXi hosts we went with 6.5 (build 4887370).

The install went great, super fast and easy as always is the case with Nutanix deployments, and off we were rolling for customer deployment.

After running the command ncc health_checks run_all post install (running ncc version 3.0.4-b0379d15 for this), I noticed that the results were calling out 3 hosts for having disabled services.

Detailed information for esx_check_services:
Node 10.xx.xx.xx: 
WARN: Services disabled on ESXi host:
 -- sfcbd-watchdog
Node 10.xx.xx.xx: 
WARN: Services disabled on ESXi host:
 -- sfcbd-watchdog
Node 10.xx.xx.xx: 
WARN: Services disabled on ESXi host:
 -- sfcbd-watchdog

After doing some research on why the sfcbd-watchdog wasn’t starting – and trying to start it manually, I came across this KBase from VMware, which detailed that this was expected behavior starting in ESXi 6.5.

Wondering if the NCC code just wasn’t updated for this specific change from VMware, I checked the Nutanix Knowledge Base, and came across this link which details that services identified by the ncc health_checks hypervisor_checks esx_check_services command should be enabled.

Ok, so that makes sense… ESXi 6.5 has been out long enough to assume that the ncc scripts have been updated to accomodate the 6.5 changes.  So time to get the service re-enabled, and check ncc again.

To enable the service on a ESXi 6.5 host, use the command esxcli system wbem set –enable true (be sure to use double hypens!).  Per VMware, if a 3rd party CIM provider is installed, sfcbd and openwsman should start automatically.  Just to be safe I also ran /etc/init.d/sfcbd-watchdog startfollowed by /etc/init.d/sfcbd-watchdog status to make sure my services started.

So let’s see what we get now after running the ncc checks again, using the command ncc health_checks hypervisor_checks esx_check_services to simplify my results.

Results look much better, no more warnings about disabled services on the ESXi hosts.

Running : health_checks hypervisor_checks esx_check_services
[==================================================] 100%
/health_checks/hypervisor_checks/esx_check_services [ PASS ] 
-------------------------------------------------------------------------------------------------------------------------------------------------------+
+---------------+
| State | Count |
+---------------+
| Pass | 1 |
| Total | 1 |
+---------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log

Good to know that VMware purposefully has disabled this service, and it’s easy to put that in a checklist for future deployments.   I do wish though that since Foundations is taking care of the ESXi install and customization, they would add those 2 cli commands to the routine to make those services start, if they truly are needed.

Hope this helps if you run into this same issue!

3.5 hours down to 6.5 minutes…

Recently I went thru a project to get Zerto Replication up and running for a Emergency Dispatch Customer who was moving away from RecoverPoint and SRM in an effort to simplify and consolidate their DR runbooks.

As part of this project, we created multiple VPGs to match up with their software solutions, protection around 5TB of total VM space.  The smaller VPGs consisted of small groupings of VMs, most of which ranged between 250 and 500GB of provisioned storage.    The 5th VPG was a large VPG, consisted of a heavily utilized Production SQL Server and Report SQL Server and had around 3.2TB of provisioned storage.

Environment

The customers VMware environment consists of 4 ESXi hosts, a VNX 5300 array and a pair of Nexus 9K’s at each site.  Connecting the sites together is a private 300MB circuit that handles the replication traffic as well as VOIP traffic.

The makeup of the VPGs were pretty standard, with recovery volumes, failover IPs, etc all configured.

Failover

After replication fully synchronized between the Primary and DR data centers, we performed a Live Failover of each of the VPGs.  Each of these 4 VPGs failed over successfully and within minutes, with the IP’s being reconfigured as expected.

When we finally went to failover the final large VPG, while the failover went smooth and the VMs were brought online on the DR side, we suffered thru another almost 3 hours of time where the Zerto status was for the VPG was ‘Promoting’, where the VMs were accessible but horribly slow.  So slow that users were told to go back to paper for data capture.

Troubleshooting

After the failover was complete, we commenced with the troubleshooting of what took the promotion so long to complete for this single VPG.

After some emails back and forth with a former Co-Worker who now works for Zerto – (Thanks Wes!), he pointed me to a Zerto document for best practices when protecting Microsoft SQL Server, found her.  Going thru this document had me go back and look at the configuration of the SQL Servers that were being proteced.   The Zerto document highlighted using Temp Data Disks specifically for the Windows Page and the SQL TempDB files.

Looking into the configuration of the SQL Servers, I noticed that the primary SQL Server VM was configured with a separate vmdk for the Page File, and an additional vmdk was provisioned for the SQL TempDB files, and also confirmed in SQL that the TempDB files were located on this vmdk.

In addition, we also looked at the VPG journal history, which was set to the default 24 hours, which meant that the VPG was storing a large amount of history in the journal, not only for the SQL Server data files, but also in the TempDB files.  Add in the Report SQL Server and the replicated data processing.

After going thru the Zerto SQL Best practices document, a lot of things started to make sense why the VPG promotion took so long.  First off, we had a very large amount of data that was being retained in the Journal History.  Every transaction that hit the TempDB files were having to be placed in the journal, and upon failover when the TempDB is recreated – all that journal history had to be pushed back into the VM’s vmkd to bring it back to synchronization.

Solution

We switched the Page File and SQL TempDB vmdks in the Zerto VPG to utilize a Temp Data File.  Since both the Page File and SQL TempDB contents are recreated upon reboot, usign the Temp Data File option allows Zerto to not track any data in the journal after initial synchronization.

After ensuring that the VPG was fully synchronized, we performed a failover from the DR facility back to the Primary facility.  Each of the 4 small VPGs were failed back over, and then we prepared for the final large VPG.  Due to the nature of this customers business, it’s criticial to have fairly precise estimates on downtime windows.  Knowing we went thru a very lengthy initial failover process, we didn’t have a guarantee of the failover.

Once we kicked off the final VPG failover, the process moved smoothly thru until the Promoting stage, and then moved right thru.  This failover attempt went from 3.5 hours down to 6.5 minutes in total.  6.5 minutes!

Needless to say, the customers expectations of the Zerto solution was met, the solution proved its worth with the ease of the failover.

Takeaways

This was a great scenario where going back to basics and best practices make all the difference in the world.  Vendors put out documentation on best practices for a reason, and this is one scenario where personally I was happy to get some extra knowledge on SQL replication – as it wasn’t something I had historically done within Zerto.  Great results from this project, moving away from RecoverPoint and SRM over to Zerto, and getting the customer a DR Replication solution that now allows them to do DR testing on a regularly scheduled basis with full confidence.

Cisco seems to be having a rough go of it lately with bugs that have a time bomb for certain hardware and software.  Following up on the Signal Component issues –  that plagued a large number of product lines (And in Cisco’s defense affected more than just Cisco – other vendors are affetected).  I’m still waiting to find out when my Meraki MX84 will be replaced on that one 🙂

Yesterday, Cisco released another Field Notice as well as a blog post, this time affecting a good number of ASA code versions dating back to 9.1.x, and certain FirePower versions.  The Field Notice states that all appliances are affected, so this is not a hardware issue like the Signal Component, but a software bug.    After around ~ +213 days, the appliance will just start to stop passing network traffic.

No workarounds listed (Cisco has stated updated versions will be available in the coming weeks), other than to perform a planned reload of the ASA or FTD appliance.  So plan accordingly for a reboot so you don’t get stuck in a unexpected outage situation.

To find out how long your appliance has been online, run the command show version | grep up.

Here’s the table that lists the affected ASA code versions.  The Field Notice is also assocated to Bug CSCvd78303 from Cisco.

Products Affected
CISCO FIREPOWER 6.1.0.1
CISCO FIREPOWER 6.1.0.2
CISCO FIREPOWER 6.2.0
ASA 9.1.7.11
ASA 9.1.7.12
ASA 9.1.7.13
ASA 9.1.7.15
ASA 9.1.7.9
ASA 9.2.4.15
ASA 9.2.4.17
ASA 9.2.4.18
ASA 9.4.3.11
ASA 9.4.3.12
ASA 9.4.3.6
ASA 9.4.3.8
ASA 9.4.4
ASA 9.4.4.2
ASA 9.5.3
ASA 9.5.3.1
ASA 9.5.3.2
ASA 9.5.3.6
ASA 9.6.2.1
ASA 9.6.2.11
ASA 9.6.2.13
ASA 9.6.2.2
ASA 9.6.2.3
ASA 9.6.2.4
ASA 9.6.2.7
ASA 9.6.3
ASA 9.7.1
ASA 9.7.1.2

eGroup – Together, We make IT happen.  Our slogan, our tagline.  IT is what we do, together.  From our Operations team, to Inside Sales, to Sales, to Pre-Sales, to Delivery.  We do IT, and together.

We announced today that eGroup has been named to CRN Tech Elite 250 list for 2017, which “honors an exclusive group of North American IT solution providers that have earned the highest number of advanced technical certifications from leading technology vendors, scaled to their company size”.  Read the full press release here.

It’s amazing to work at a place that focuses on the team, and making sure that from project to project, customer to customer, our strategy, customer focus and delivery revolves around the team.  Whether its part of our Microsoft strategy, Data Center and Hyperconverged strategy, or our Managed Services

So to my eGroup co-workers, thank you for bringing your A games every day to work!

To the guys on the Data Center Architecture team (Pete, Dave, Joe, Adam and Steven), let’s keep raising the bar, bringing amazing solutions and results to our customers.

tech_elite_250_2017 2

We’re just two months into 2017, and it already feels like the year is flying by!

The 2017 vExpert announcements came out today – you can read it here, and for the 3rd year in a row (well, really 2.5 since I forgot to renew at the end of 2015 and made it in the 2nd half of 2016) I’m honored to be included in the list of some very smart people.

Back in December, I was also included in the Nutanix Technical Champions list found here, and again I was very honored to make that list.

So what does it all mean?

Well first off, getting on these lists isn’t just going to a class and/or passing a certification. It’s not doing some training on a subject(s) and proving knowledge during a project.  It’s not helping out your friends/coworkers when they need help or have questions.

It’s all of the above.  It’s taking the time to share ones knowledge and give back to the community that has provided so much learning and education.  It’s collaborating on solutions, on particpating in beta programs that help shape upcoming software.

I’ve been really fortunate to have had a very fun and educational career, from cutting my teeth with Citrix platforms, to Oracle databases and Unix platforms, to Virtualization, Networking, Storage, the list goes on.

What I’ve gotten from all those different learning opportunities was somebody always smarter than me showing me the ropes, how to do something one way, why to do it another, and how to think about the ultimate outcome for solutions we deploy.

Being able to give back; even if it’s just a quick answer to a question, spending time with somebody on GoToMeeting or doing some hands on workshops or training is incredibly meaningful to me.  The opportunities I’ve had to present at conferences, from the Regional VMUG in Charlotte on vRealize, to the upcoming session on Nutanix at TriCon, to the eGroup take on the Modern Office – give me the opportunity to share some of the knowledge I’ve gained and hopefully get a few people excited about the topic and ready to explore some new opportunities.

So to me, being included on these lists with some very smart people doesn’t validate how smart I am, but it validates to me the value of knowledge sharing – thru blogs, Twitter, Slack, LinkedIn, etc.  More to learn, more to share!

Thank You!

So to all the folks that join me on the list of #vExpert and #NTC in 2017, thank you for sharing your knowledge with me, for giving different ways to do different things.  I hope that I’ve been able to give back as much as I’ve received.

For the people that I work with at eGroup, thank you for the amazing company we have, and for making every day a fun and exciting one.  Every day is a learning experience, one I do not take lightly.

And finally, to my wife Michelle.  You put up with a crazy travel schedule sometimes, a garage full of servers, and a husband who sometimes doesn’t know when to put work away.

vexpert-3-year                                  NTC2017