For a large chunk of my career as an ISP, all hats admin, one of my duties was maintaining the inbound and outbound mail system that tens of thousands of users used daily. Most users had multiple accounts, so the total count of email accounts on this system was about 60,000. Across those accounts, the average daily inbound mail was about one million emails. Granted, much of this was spam, and fighting this spam is where a deeper realization of wasted processor time came front and center. It was a problem I had to solve. For a bit of background, by the time I started working for this ISP, the domain name was ten years old. For anyone who knows email admin, a 10-year-old ISP email domain is ripe for spamming, and spam they did.
At the front of the inbound system, the edge as it is more commonly known, or IMR (Inbound Mail Router), in mail admin lingo, were 4 RBLs (Realtime Block List). And that is where the problem started. But it took me a while to find that is where that problem actually started.
For months, I had catted, and grepped, and tailed, and tail with grep, and grep | grep, and top, and ps, and, well, every other Linux command I could throw at it. But no matter how far I twisted these tools or how many commands I piped together, no answer was forthcoming. And then I found the ELK stack. Elasticsearch, Logstash, Kibana. This post is not about the ELK stack directly but just that it was a new tool in the arsenal of getting the full use out of the millions of log entries that we collected daily.
Once the logstash filters were built and we had a few days worth of data in Elastic and awesome Kibana dashboards that I built, the truth hit like a truck. I had decided to grok the logs all the way down to every last piece I could pull out of them. I had total inbound traffic, inbound traffic per domain, outbound traffic, outbound traffic per domain, what was stopped at the edge, what was stopped at the scoring server, and what finally made it to the mailstore. And, as if by magic, well, it wasn’t magic it was my skill with the ELK stack, there stood our answer, bright and clear like a Rocky Mountain lake on a bright spring day. The RBLs were the problem.
Through this log parsing and dashboards, I found that the 4 RBLs we used added an average of 40ms per email to the processing time. For a one-off email, 40ms may not seem too bad, less than the blink of an eye. But they held that thread stopped until the RBL responses came back. That thread could have been processing many other emails while it waited for the RBLs to respond. Like all systems, this mail system had a limited number of threads, so as those threads waited, the emails waiting on those threads waited. And, if things got really bad, the SMTP connection would only stay open so long, so at times, the connection was closed, and the sending server “might” try to send it again later. This was unacceptable. I had to find a better way.
Let’s poke at the math for a minute. If we are processing about one million emails per day, and each of them takes 40ms to process the RBLs well, then that is about 11 hours per day of stalled threads just waiting for the RBLs to process. Most may not even notice this, especially from the user side. But this would quickly spiral out of control at high traffic times, and mails would bounce with “Could not connect to server”. Which then in some cases, the sender would call their friend or business partner and say something like, “Your email system is bouncing these important emails.” Then the user calls tech support, and tech support opens a case with operations, “me.” I then have to stop what I was working on and figure out what was going on here. Unfortunately, in a lot of cases, my only response was, “It was bounced back to the remote end, they can wait to see if the remote server tries again or have their friend resend the email.” and then hope on my side that it did not get bounced again.
Our edge email servers had a whitelisting functionality, but users could not access this. They could only change their scoring or whitelist options on the scoring server or the final mail server. So, the standard in our department was to only modify the edge servers for internal use, not for every email bounce that a customer reported.
We needed the RBLs. There was no getting around that. The mail system barely touched an N+1 classification, so it always ran on the bleeding edge of what it was capable of. Without the RBL, the system could not even fathom processing the amount of inbound mail we were dealing with. So, the system presented a challenge, and I accepted.
With my metrics from Elasticsearch in hand, I approached management and simply said, “We can fix this by mirroring these RBLs in house with an automated cleanup system.” They were relieved that we had some path forward and said to go for it with all of the resources I needed. And to be truthful, the resources were quite simple. Elasticsearch, MariaDB, Perl, and rbldnsd.
Generally, it worked like this. Elasticsearch and Logstash were still collecting, splitting, and storing the mail logs. A Perl script launched from cron once per minute and queried Elasticsearch for a list of IP addresses that were not in our internal RBL. The script processed this list and sent it to the rbldnsd configuration and then put the same entry in MariaDB. During the same 1 minute cycle, it would query MariaDB for IP addresses that had not hit our system within the last 2 hours and then remove those from rbldnsd. More or less, the system was an auto-cleaning RBL that fed off of 4 external RBLs.
For ease of use, I built 3 simple web pages for this. One showed a list of all IPs currently blocked by our in-house RBL. This interface also had a button that our tech support team could click to request removal. This worked with the warning that if the IP appeared on an external RBL, it would be blocked again. Then there was the master whitelist, which we, of course, used to whitelist all of our own servers and then any other servers that had been requested by tech support and then vetted by us. Over the years, a few bad actors slipped through that whitelist. But, it was as easy for us as admins to click “Remove Whitelist,” and that problem was solved within 1 minute.
Now to the time saved. We had to adjust the order that the email system checked RBLs and made ours the first and final stop if an IP was found there. So, each new IP would come from an external RBL once, and then we had it mirrored in-house. The time to check incoming mail against this RBL was 1-2ms, vs the 40ms we had been dealing with before. This comes down to, even if we look at it on the high end of 2ms, our total processor time just waiting on RBL threads went from a high of 11 hours to an average of 30 minutes. To be clear, that is not 30 minutes per email, but the combined time of stalled threads across the system. The mail system stabilized, and that system ran for a decade. As far as I know, the entire mail system has now been replaced. But what a fun run it was.
I bring this to you, my readers, and explain my whole path with this as something to think about the next time you have a server or an app that is wasting what may seem like minor amounts of time. But, if you add them up over counts of users or batches processed, 40ms can become 11 hours really quickly.
-Parting Wisdom: Time is a fickle beast.
–Bryan