|| 24 mins
During a failover, the databases stalled connections to the application servers. The application servers became stuck waiting for the connections to recover, which resulted in a system outage. Additional health monitoring was put in place to prevent this from occurring. We also worked with the AWS team to account for an updated version of their failover process.
|| 8 mins
||A database failover issue, similar to the below, caused the application servers to become unresponsive. We escalated and released an update on Nov 4th, 2016 in an attempt to handled this process properly.
||A database failover experienced an unforeseen issue when one failed and another instance was promoted. While the original database attempted to recover, it began accepting connections and the application servers incorrectly read the machine as being available. This caused the servers to become unresponsive while waiting for timeout thresholds to expire. The servers did not recover and had to undergo full host replacement.
||App Server Timeouts
||The TrackVia API experienced a high volume of malformatted requests which left the app servers unable to process the requests. The servers went into infinite loops processing the bad requests resulting in denied connections from all clients. TrackVia deployed a software update on July 1, 2016 to prevent this from occurring.
||Metric Collection System Failure
||Hardware failed on a collection system resulting in an outage. The TrackVia application servers did not gracefully handle the failure and began emitting errors in the system logs. The application was accepting connections, but was unable to complete the transactions without the collection service.
||Account Locks / Server Reboot
||Account level locking errors were occurring causing some customers to be affected by delayed response times and timeouts. The TrackVia team was able to unlock individual accounts, but continued to see the rate of lock errors increasing. It was decided that the best route to clear all locking errors was a system reboot. During the reboot process, the application servers were not successfully removed from the routing servers which resulted in the routers queuing incoming requests.
||The API was improperly used causing a high enough load on the system to briefly cause login and navigation to stall. We have since trained the end user on proper use and have plans in place to prevent it from occurring again.
||We experienced network connectivity and DNS resolution issues affecting some instances within a VPC in the US-WEST-2 Region in Amazon Web Services. Details at AWS STATUS SIREN. This affected saving records and sending emails. We made temporary updates on our end to mitigate connectivity issues.
||A small portion of accounts were seeing issues with loading the Table Editor page. There were also scenarios where a browser refresh was required to find a newly created Form listed in navigation menus.