|Job Type:||Full Time|
Do you like to engineer large scale mission critical solutions that are Always On? Do you like to contribute where cutting edge software development, global scale system architecture and enterprise systems administration intersect? Are you curious about how things break, so that you can eliminate these failure modes? Are you driven to eliminate toil by designing/adopting better tools and automation techniques? If so, we would love to have you in our Production Engineering team.
The Site Reliability Engineer is a technical Subject Matter Expert that pro-actively drives the technical stability and performance of the applications in the global technology portfolio. They combine software and systems engineering to design solutions in physical, virtual and cloud environments that automate fault detection, containment, and resolution without customer impact or human intervention. These solutions typically involve software development for metrics and event collection/correlation across distributed architectures, automation, monitoring, intelligent alerting, random fault injection, and self-healing.
Our Site Reliability Engineers have a full understanding of the hardware and software architecture of the applications within the end to end business flow and are responsible for guiding/implementing operational technologies in next gen solutions while driving down current technical debt. Working in an Agile DevOps model with Architecture, Operations, Application Development and Infrastructure engineers, they pro-actively develop reusable patterns/solutions that enhance the health and performance of our global platforms, and identify/solve chronic technical issues. They ensure that the developed solutions address non-functional requirements including:
- Performance and Interoperability Requirements
- Application scalability/Capacity Management
- Standards, best practices and Compensating Controls
- Solution designs that are fit for purpose
- Logging, monitoring, intelligent alerting, self-healing
- High Availability, Disaster Recovery, Sustained Resiliency, Chaos Engineering
- Service and Operational Level Agreements
- Application Knowledge Support Artifacts, etc.
BS degree in Computer Science, Computer Engineering or similar technical field of study/equivalent experience. Graduate level engineering degree preferred.
Systematic, fact based decision making and problem solving.
Strong curiosity and bias for pro-active planning, action, ownership, learning and continuous improvement.
Strong inter-personal skills and ability to cultivate relationships with all internal/external stakeholders, promoting diversity of perspectives, ideas and cultures.
Ability to clearly articulate ideas, problem/solution/business value descriptions that can be understood by a broad audience in a time sensitive environment
Software Engineering/systems engineering experience
Networking (Security, Load Balancing, Network Routing Protocols, etc.)
Cloud native applications; deployment, monitoring and operations using Kubernetes, Prometheus, FluentD, Slack, Elasticsearch, Grafana, Kibana, etc.
Relational and NoSQL databases; developing and managing operations leveraging key event streaming, messaging and DB services such as Cassandra, MQ/JMS/Kafka, Aurora, RDS, Cloud SQL, BigTable, DynamoDB, MongoDB, Cloud Spanner, Kinesis, Cloud Pub/Sub, etc.