Job Information
Infoblox CloudOps Site Reliability Engineer in Tacoma, Washington
Description It's an exciting time to be at Infoblox. Named a Top 25 Cyber Security Company by The Software Report and one ofInc. magazine's Best Workplaces for 2020, Infoblox is the leader in cloud-first networking and security services. Our solutions empower organizations to take full advantage of the cloud to deliver network experiences that are inherently simple, scalable, and reliable for everyone. Infoblox customers are among the largest enterprises in the world and include 70% of the Fortune 500, and our success depends on bright, energetic, talented people who share a passion for building the next generation of networking technologies-and having fun along the way. We are looking for CloudOps Site Reliability Engineer to join our Incident Management Engineering team located in Tacoma, WA, or remote, reporting to the manager of Cloud Operations. In this role, you will be part of the Incident Management team responsible for the monitoring and support of Infoblox cloud-based services. You will monitor and maintain the infrastructure that runs our SaaS services, as well as ensure these services are running at peak performance. You will also be responsible for maintaining the services and assisting in the automation that enables Infoblox services in the cloud. You are the ideal candidate if you are a proactive, hands-on professional who picks up new technology quickly, has excellent interpersonal skills, and is driven to find solutions while collaborating across teams. What you'll do: Provide real-time monitoring, triage, and escalation of critical and major issues and incoming alarms within the environment Participate in incident management calls and coordinate response, triage, recovery, and reporting of incidents Actively engage through the service restoration and ensure senior leadership is aware of activities being carried out Expand and mature existing incident response processes and activities, including managing and administering the runbook Partner with Engineering and NOC to prepare and present RCA reports for incidents, their impact, and resolution Implement and utilize SRE developed tools for incident response Assist in the development of resilient and self-scaling systems Lead complex projects focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility What you'll bring: Minimum 5 years of combined experience in DevOps, SRE, and/or incident management and monitoring tools Hands-on experience with cloud architecture and deploying infrastructure in a cloud environment Solid networking experience, such as TCP/IP, BGP routing, load balancing, and DNS Experience with monitoring tools, such as Grafana, Loki, PagerDuty, AWS Lambda, etc. Experience with Linux distributions, including CentOS, Ubuntu, and Amazon Linux Experience with Amazon Web Services, including EC2, VPC, ELB, S3, RDS, CloudFormation, etc. Experience with configuration management, such as Terraform, Chef, Puppet, Ansible, and/or Salt Experience with monitoring tools and CI/CD toolchain, like Git, Jenkins, or Spinnaker Experience with Python, Java, Golang, Kubernetes, Linux Containers, and Docker is preferred Bachelor's degree in computer science, information security, computer engineering or electrical engineering is required What success looks like: After six months, you will... Provide real-time monitoring, triage, and escalation of critical and major issues and incoming alarms within the environment Participate in incident management calls and coordinate response, triage, recovery, and reporting of incidents After about a year, you will... Partner with SRE/DevOps to resolve infrastructure maintenance tasks, internal access request/issues and management of monitoring and CI/CD tools Use knowledge