Skip to main content

How Effective IT Problem Solving Helped Save the VA.gov Launch

News | June 25, 2019 | IT Strategic Communication | Views: 404
The new VA dot gov website

Each month, more than 10 million Veterans and stakeholders visit Department of Veterans Affairs (VA) websites. They do so to access the information, tools, and services they need. But most have no idea of the planning and problem-solving that goes on behind the scenes to make those websites available for such an immense organization as VA, the nation’s second largest federal agency. As we celebrate milestone of VA’s relaunch of its public-facing website VA.gov 6 months ago, we consider the story of how VA’s team of experts identified and fixed a mysterious problem that threatened the launch of VA.gov.

After months of painstaking work and testing, just as VA was preparing to flip the switch to direct online customers to the redesigned and streamlined VA.gov, the team working on the project noticed a problem. The last routine tests showed that attempts to reach some web pages and files were taking much longer than normal to complete—meaning that if a large number of requests began lagging, it could quickly deteriorate to a VA.gov-wide crash. Knowing that millions of Veterans were depending on the new site to access health care and benefits information and services, VA technicians sprang into action.

This re-cap of how the VA incident response team found and solved the problem demonstrates how VA staff working together, asking good questions, and looking at patterns and anomalies are keys to delivering technology solutions that translate into positive Veterans outcomes. This problem solving is an example of what VA does behind the scenes every day to keep VA systems running—and at times—how we accelerate this problem-solving and partner across VA to provide swift intervention ahead of new product launches.

DAYS BEFORE THE LAUNCH

Upon identifying the presence of a significant, unknown issue, the VA team working on the VA.gov launch, led by the Digital Service at VA, immediately went into action. The consolidated VA team executed its incident response plan, designated an incident commander, and convened engineers from across the relevant teams at VA to debug the issue. Simultaneously, the VA team initiated a comprehensive monitoring system to identify patterns and rule out multiple theories.

Initial monitoring focused on a set of servers known as reverse proxies, where all traffic to VA.gov was flowing. These reverse proxies had been put in place several weeks prior to the launch and would be used to route Veteran website visitors from the old, on-premise infrastructure serving the old VA.gov homepage to new, cloud-based servers that would host the redesigned VA.gov.

Using the monitoring tool(s) already installed on the servers, the team made three key observations. First, roughly half of the reverse proxies remained in a healthy state, with no latency (slowness) issues. Second, the other half would begin operating at a slightly slower pace, then their latency would spike to as high as several minutes per request, and eventually stall. Third, it often appeared a larger file would spike the latency even though transferring one such file should not cause this type of behavior. Significant latency of a websites means that users experience slow enough response times that the website becomes virtually unusable—an outcome VA could not risk.

monitoring waveform
Some reverse proxies were healthy (lines on the bottom) while others were experiencing significant latency, which would sometimes spike (lines on the top).

The VA Team considered many theories and questions, but none led to solutions. Were all the reverse proxy virtual servers running the same software? Yes. Was one group running in a different physical environment? No. Were there any clear differences between the reverse proxy instances? No.

Refusing to let this challenge derail the launch, the team hunkered down for an aggressive, systematic hunt for a solution. On a hunch, VA team member Patrick Vinograd began studying IP addresses of the healthy reverse proxies and slow reverse proxies. Strangely enough, all the problematic servers had odd numbered IP addresses. Since a server’s IP address being odd or even should have no bearing on how fast it works, this pattern piqued Patrick’s interest.

To determine if this pattern was just a coincidence or something more, Patrick and the team immediately began testing the hypothesis, by removing reverse proxies from the server pool and replacing them with new ones. Because the team was using cloud-native infrastructure—a byproduct of VA’s Digital Transformation and its pivot to the cloud—removing and adding these virtual servers could be accomplished easily and did not require any end-user downtime.

As the new servers came to life and began routing traffic, the even/odd pattern held up—the team was on to something. Soon, the core issue came into view. The incident response team engaged the VA network team to further isolate the issue, and, working together, they quickly cracked the code.

The combined team discovered a failing fiber optic network link between the former VA.gov web servers and the hardware that distributes the traffic. Nearly every network request on that connection was becoming corrupted by this failing piece of equipment, necessitating a “retry” attempt. Requests for large files triggered a storm of network retries from which the web server would never recover.

So how did a failing fiber optic network link seemingly, arbitrarily impede the flow of data from odd-numbered IP addresses? Critically, a load balancer—a tool used to help reduce network congestion and spread out traffic across all available network paths—had been configured to send traffic from even-numbered IP addresses through one link, and traffic from odd-numbered IP addresses through another link. This fiber optic link the load balancer had been sending the odd-numbered IP addresses through was the source of the failure.

After the network team quickly removed and replaced the server connection from the “odd” side of the link, the problem was solved.

network component
A failing piece of hardware known as a small form-factor pluggable (SFP) transceiver (example shown above) was determined to be the cause of the problem that almost prevented the launch of the new VA.gov

This story illustrates some valuable lessons for all of us in OIT. First, by having a comprehensive set of performance monitoring tools in place, the team was able to quickly identify existence of a problem. These tools also allowed the team to effectively diagnose the problem, test hypotheses, and rule out causes based on data. Second, by operating with a combined development and operations (DevOps) team, the VA.gov team was able to more effectively troubleshoot the problem and take action—through the teamwork and effective communication a combined DevOps environment fosters. Finally, by using cloud architecture that enabled automated deployments, the team was able to much more quickly fix the problem than would have been possible if VA was running physical servers in an on-premise environment.

We are extremely proud of the VA.gov DevOps team and the entire OIT staff who were a part of this effort—an effort that directly impacted the millions of Veterans and stakeholders who rely on the site.

After six months, the new VA.gov website is earning positive reviews from Veterans. The site and the efforts of all those who supported its launch remain a testament to VA’s commitment to modernization and to improving the customer service that underpins the care and services our Veterans have earned.

Check out and bookmark the new VA.gov website, for all you Veterans to quickly learn about, apply for, track and manage your VA benefits and services.

Page last updated on July 22, 2019