Site Reliability Engineering Manager
Arlington, VA 22201 | Employee
Their infrastructure enables Mobile carriers, ISP’ s and governments, to provide Internet Access to their local and remote populations. Their team’ s talent spans fields from semiconductor design, telecom core network and small cell production and deployment, to hyper local rural regulatory and educational challenges. They are developing leading edge technology to solve some of the world’ s largest problems – and having a lot of fun doing it!
If building the infrastructure to connect 2 million schools is something you would like to make happen, then joining our client may be a great personal and career move. They can provide an intellectually challenging workplace and fast growing opportunity with a clear purpose. Come join the team that is making affordable communication ubiquitous on a global scale.
We are looking for an Engineering Manager for their Site Reliability Engineering (SRE) team to help launch and operate the satellite constellation. The world’ s largest satellite constellation will require a proportionally large data center infrastructure for command and control, orbit determination, and engineering analysis. This position requires deep knowledge on Linux system operation, security, virtualization, disaster recovery, networking and cloud technologies. You will lead a highly-skilled SRE team to design, implement and operate Linux based IT infrastructure, both in the cloud and on premises. The mission of the SRE team is to ensure production systems have 99.99% high-reliability to support the world’ s largest satellite constellation.
Provide technical leadership and management for satellite constellation IT operation center.
Design and implement Linux based IT infrastructure, in the cloud and on premises.
Support client's production systems with fail-over, load-balancing, backup, security, log management and monitoring services.
Responsible for maintaining 99.99% system SLA. Final escalation point for production Infrastructure issues/ outages.
Automate the deployment of a complex system consisting of virtual machines and containers.
Monitor security and operation alerts, take preventive or corrective action to resolve issues.
Ensure effective performance and 24x7 availability of the production IT systems.
Expect up to 10% of travel. Mostly US Based - some travel to London
Two Roads is an Equal Opportunity Employer: EEO/AA/M/F/Veteran/Disability
Expert knowledge in Linux system operation, configuration, troubleshooting and automation.
Deep understanding of IT security with best practices.
Strong in Ansible, Python, and Shell scripting.
Working knowledge in Networking (TCP/IP, VPN, DNS, DHCP, SMTP).
Working knowledge in Database administration (MySQL Preferred).
Red Hat Certified System Administrator (RHCSA) is highly desired.
Excellent communication and documentation skills.