Site Reliability Engineer

Longmont, CO 80504

Posted: 05/16/2019 Employment Type: Contract to Hire Industry: Engineering Job Number: 80218
Summary
The Site Reliability Engineer (SRE) is a combination of a software engineer and a systems enthusiast that provides technical leadership to a growing team focused on applying software engineering practices to operations at scale.  SREs focus on operational procedures, code fixes, etc. increasing the automation, repeatability, and consistency of operational tasks.  The successful candidate will have a breadth of knowledge to solve for complex problems across the entire technology stack.
Responsibilities
  • Design and architect operational solutions for managing applications and infrastructure
  • Monitor and report on service level objectives for system-wide application and infrastructure services. Work with service and product owners to establish KPIs to identify trends and quantify whether at the site/system level we are getting better, or not
  • Define standards for configuration, monitoring, reliability, and performance
  • participating actively and critically in retrospectives that had broad impact and/or are leading indicators of potential site issues
  • Provide deep troubleshooting for production issues
  • Engage with service owners on root cause analysis for service interruption recovery and create preventive measures
  • Analyzing & interpreting metrics by using them
Background/Experience
  • Strong DevOps background
  • Experience with system engineering & SDLC
  • Preferred Educational Background:  Bachelors in Computer Science/ Information Systems Management/or relevant work experience
  • Years of work experience required: 5 years
  • Ability to qualify for TS/SCI Clearance level may be required
  • Experience with application of requirements from NISPOM, DCID 6/3, ICD 503, NIST 800-53 and related US Government standards and requirements a plus
  • Experience working in a large scale enterprise containing infrastructure with hundreds or thousands of servers and dozens of technologies
 
Technical Skills
  • Advanced knowledge of Unix/Linux systems: feel very comfortable at the command line
  • Proficient with at least one programming language (e.g., Python, Ruby, Java, etc)
  • Familiarity with configuration management and remote execution tools (Ansible, Puppet, etc)
  • Understanding of Docker and automated deployment via pipeline (concourse or Jenkins)
  • Familiarity with distributed version control systems such as Git
  • Good understanding of networking fundamentals
  • In-depth understanding of web operations best practices
  • Familiarity with infrastructure as code, AWS cloud platform  
 
Non-technical skills
  • Effectively prioritize work and encourage best practices in others
  • The ability to “smell out” potential issues in the system (not just individual services) is important
  • A knack for troubleshooting tough problems: a high level of ownership and curiosity empower this skill
  • Meticulous and cautious: identify and consider all risks and balance those with performing the task efficiently
  • Organized - able to document and communicate ongoing work tasks and projects
  • Positive, flexible, and personable – adaptive to change
  • "Make it happen" attitude
  • Receptive to giving, receiving, and implementing feedback in a highly collaborative environment
#DevOps
#SiteReliabilityEngineer
#TwoRoads
#2Roads
 
For more info call Roby @ 714.901.3804 or RLane@2roads.com
Apply Online

Send an email reminder to:

Share This Job:

Related Jobs: