September 6, 2018
Six months ago, after years of upgrading our server capacity in the cloud, we finally decided to build our very own development (secondary) data farm – including an array of specialized AI processing hardware – completely from scratch.
It wasn’t easy. Along the way, we faced the challenges of finding the right supplier, securing a perfect physical location, mounting and installing the servers, and testing every component until all our servers were singing perfectly in tune.
But we expected those challenges going in – and the payoff has been even greater than we expected. Our data center now supports all our local operations in Turin, as well as some back-up and failover from the main production system, which remain hosted as bare metal managed by a professional third-party infrastructure supplier.
Here’s part one of our exclusive inside look at the step-by-step process of building our data center.
First of all, why did we take on this massive project?
Like all machine learning companies, we live and breathe data.
As the volumes of data we crunch grow ever larger, our storage and processing capacity has to keep pace.
And while a lot of companies go the IaaS (Infrastructure as a Service) route and move their data and processing into the cloud, we knew that approach would never provide the security and speed we needed, at costs we were willing to shell out.
In short, our decision began with these four core needs:
This was priority number one. Every day, we deal with large confidential datasets – and our clients expect the privacy of their data to be safeguarded at all costs. To enhance the security of our clients’ information, and provide total transparency throughout the chain, we decided we needed to own the physical location and equipment where that data would be securely stored (encrypted).
When you run millions of processes per second, as we do, you’re not only worried about the server’s processing power, but also the bandwidth of the connection. If you’re sending and receiving a terabyte of data to a server on the other side of the world, latency becomes a serious concern – and the only way to eliminate that issue is to handle this type of processing (mainly development) locally.
What happens when a remote server goes down? We’ve all felt the pain of waiting in line for cloud application support, chewing our fingernails as valuable time goes to waste. But when you’re handling your processing in-house, fixing the problem is as simple as running down the hall. And if you can’t get connection back up immediately, in a worst-case scenario, you can always grab an external drive and mount it to another server.
A virtual environment in the cloud would cost us 10 times as much as owning our own in-house servers – in other words, the pay back period was well under 1 year. What’s more, owning the physical hardware enables the perfect balance of performance, storage and security, enabling us to optimize all those areas in a single stroke, while quickly amortizing our initial investment.
How did we approach the planning process?
Although modern servers are relatively straightforward in terms of installation, we needed quite a bit of preparation to get all our ducks in a row.
The first, and biggest question we to answer was, “How much computing power do we need?” To find out, we spent weeks talking with our data scientists, as well as our client team, to project the number of clients we’d be onboarding over the coming year. Therefore we assessed how much storage and processing capacity we’d need to meet that demand.
Based on our previous experience with a pure cloud environment where cores, memory and I/O capacity are the key factors, we estimated to need the following:
- 240 cores: data processing is very CPU intensive (especially db operations)!
- 2TB of RAM: not just a lot of CPU, but also RAM to support the computation in-memory
- 100TB+ of disk space with different I/O capabilities from SSD and SAS to SATA, i.e. for backup and storage purposes
Once we had a clear projection of demand on our servers, it was time to choose our specs. One key consideration was the need for servers with KVM (keyboard, video and mouse) capabilities, since we’d be accessing them from remote machines.
Dell’s remote access controller (iDRAC) gave us the ability to simplify administration and reduce cable clutter, while also reducing downtime. All that’s necessary to get up and running is to connect to a web page, which displays a virtual monitor.
Availability of spare parts was also a key consideration. The most easily available model that could quickly scale to our requirements was the Dell PowerEdge r910 server, which comes with 16 backplane 2.5′ disks, a powerful RAID controller and an internal battery. We also liked the Dell Compellent SC200 for storage, armed with the perc H800 RAID controller card, so we went all-in with Dell for our core hardware.
For GPU-intensive development/model training (think ‘machine learning’ type of tasks), we opted for stock hardware (single i7-7700 CPU family and ASUS gaming motherboards) with rack mounted cases large enough to host several Nvidia GeForce 1070-series video cards.
Now that we knew which Dell servers we wanted, we needed to find a reliable supplier.
Instead of purchasing from private sellers, we wanted to work with a dealer who came with solid referrals, offered suitable warranty, a wide range of spare parts, and could supply servers rapidly and reliably from a good source – ideally enterprise data centers – so they’d been well looked after, with a stable power supply throughout their lifetimes.
We researched various companies online, made a shortlist of three top contenders, then asked around a few major online forums for feedback on our shortlist – which whittled our shortlist down to just two vendors. After a week of back-and-forth with these suppliers, we made our final selection – based not only on product quality, but also on the vibe we got from the seller. We did give seller #2 a try anyways, to test our fall-back option should it be needed in the future.
The day the servers arrived, we knew we’d made the right call. All the internal components were in excellent condition, and we’d gotten our servers at a bargain price. Once our server farm was up and running, we’d soon be operating at a significantly lower cost than we’d been paying for our cloud servers.
How did we prepare and set up our server room?
Our next step was to find a perfect place to store our servers. We needed a location that was cool enough for our servers to be comfortable, secure enough to prevent break-ins, and wired for VLAN access, which we needed both for security and for management.
As it turned out, the ideal location was right under our noses – literally! Our basement provided an excellent environment for our servers. Once we’d run a couple of Cat-6 FTP cables to the basement, bought some uninterruptible power supply units, a lot of pech cords and all other necessary accessories: so we were ready to go.
The servers arrived within three weeks. They were delivered on several pallets, so it was quite easy to move them. Other parts, like the external storage unit, came in a huge cardboard box, which meant we had to be careful.
Before we installed any of the servers, we brought them into our office, inspected them thoroughly for defects, and configured them to be ready to run as soon as we booted them up. Then we carried them down to the basement one by one; after all, IT people need to work out more often!
Every server came with its own rail, so we began by mounting these, then pulled each server up onto its rack – which also turned out to be a more intense workout than we’d expected!
The appropriate pre-determined mounting sequence was designed to provide greater air flow, so that all servers could receive fresh air from the front panel.
Since we were concerned about maintaining healthy working conditions, we built from scratch and installed our own temperature and humidity sensors, to monitor the temperature and humidity in the rack environment.
Thankfully, our new servers booted up without a hitch. Now it was time for the serious test: running heavy-duty code on one of them – handled it without breaking a sweat. It was so exciting to see the servers running exactly how we expected, after all the hard work!
The final step was to configure the power failure procedure. We monitor every server, with a suitable poweroff procedure to handle any extended power outage. So we created an interface between the batteries and our monitoring system.
So in the end we built our very first in-house data center in just three weeks. With all this high-powered hardware in place, we’re better equipped to serve our clients than ever before while ensuring the speed, security and availability to meet ever-growing demand.
About the author
Ben Thomas is the Chief Writer and Brand Strategist at Evo, with a core focus on emerging technologies, Big Data, and the Internet of Things (IoT).
He loves to engage audiences about the frontiers of science, culture and technology — and the ways these all come together.