I know the server size was increased and this improved thing greatly, but perhaps query the rider tracking details/statistics from a second database, preserving performance in the scanning page.
Bit more nuanced than that. There's the web servers and DB server. The web servers scale horizontally the DB vertically. There's also a clustered memcached servers layer in front of the DB. So most the time public tracking data was being pulled from memcached servers memory and not going near the DB server. Then we have time limited caching rules on origin servers, caching in a CDN, caching in your ISP, and finally caching in the browser all working to reduce load.
There's a however coming as you might have guessed. Cloud servers are priced for a baseline performance of CPU memory etc with burst capability. That's how they (and you) keep the costs down and cram so much into so little. So what does this mean?
As long as your server is running at or below the baseline CPU you gain CPU credits. If you are running above the baseline CPU you've paid for you use credits. The max the credits can be is a 100% CPU for an hour. So if you need a burst of CPU above the baseline for a short period it works great. But on balance you need your server CPU to be at or just below the baseline most of the time. So you choose and pay for one of their server sizes to meet that. With me so far?
For the web servers it's easy you scale horizontally in and out as demand dictates. I.e. You add and remove servers automatically based on rules you've set as required to match demand. No outage required and done automatically.
The DB server is trickier. You can generally scale down the DB server without an outage. To scale up a DB server (in AWS at least) requires an outage of the server and therefore entire website. So it's not an easy decision to make mid event, but it was necessary. We asked controls to make their scanning offline whilst I did the upgrade and we had the outage (10 mins). Then scanning made online again and the scans came up automatically to the website. The DB server had exhausted its CPU credits, and was throttling the requests, it's wasn't an I/O bottle neck of any sort. When the DB was not being throttled it had millisecond response time. It lost its credits very slowly And I was hoping it'd recover them and I wouldn't need that outage. Alas. Had we not relaxed the rate limit on public tracking we would not have required the DB server upgrade. But we did for a better user experience. The cost was the upgrade becoming necessary later in the event.
DB Replication with a read only replica. You actually get a lot less for your money than just having a suitably sized DB server for the load during the actual event when buying that in the cloud, and it's not necessary for LEL. I did look at it, but it'd be a colossal waste of money that could be spent elsewhere in the event. Sorry there's no food for the riders, Phil spent it all on the website.
Before LEL 2017 we didn't really have a clear idea of the size and scale of what we needed in terms of servers etc. and how big the demand would be at different times in build up and during the event to drive that. The company that hosted the LEL 2013 website kept that pretty much to themselves. As you would. Plus the interest and demand this time is several orders of magnitude more than 2013 based on what information we do have to compare.
Cloud computing is great and allowed us to deal with unbelievable interest and demand on tracking during the event. But it can also very quickly take all your money if you're not careful. You have to strike a balance. We've got a very good handle on what we need now for that balance and how much it costs to deliver. (well money man will when I send him my latest expense claim)
Most of the time I could monitor and manage the servers via my IPad. A couple of times I had to connect directly to the servers though. Control / school firewalls meant going home to do that as they blocked Internet traffic that isn't through a browser.
One last thing with AWS costs. You have fully pre paid pricing , part pre paid pricing, on demand pricing, and spot pricing. Spot pricing can be the cheapest but isn't guaranteed you'll get the server capacity when you need it. Pre paid is cheapest for guaranteed capacity but you pre pay for a size for a fixed period. On demand most flexible but most expensive by a long way. So there's many things to juggle to give a good balance of stability and performance of the website against costs to LEL. Without costs getting out of control. We are also subject to currency fluctuations as AWS price things in $ dollars. Unlimited budget, the things I could do..
You also have to go back to the demand during the event and blink. By Wednesday we'd reached sustained 31,000 browsers sessions and 125,000 requests per minute. We peaked at approx 103,500 concurrent browser sessions connected to the website and 250,000 requests per minute for tracking data. I think we held up pretty well even if I needed to undertake some active management a couple of times.
I'd like see more website elements black box'd that needed some active management this time, so that someone with an IT operations background could look after them via a browser interface in the future.
Right, back to as you were. I intend to ride LEL 2021.