|Job Type:||Full Time|
JPMorgan Chase (JPMC) is a leading global financial services firm and the largest bank in the United States with total assets of $2.687 trillion. With an annual tech budget of $10B+, we has started significantly investing and building in the next generation core infrastructure, Cloud, Big Data and AI/ML technology. Our goal is to accelerate the delivery and adoption of the Global Technology Vision – and enable the firm’s Global Technology teams to deliver faster and more impactful for customers and clients.
We have a Site Reliability Engineer (SRE) position to help JPMC AI/ML team on production support in public cloud. In this role, you’ll be working with AI/ML and cloud engineers to build the platform, pipeline, and monitoring systems to ensure the application landscape is designed to take most advantage of JPMC’s global cloud solution.
This role requires a wide variety of strengths and capabilities, including:
- Deep understanding of SRE philosophy, technologies, platforms and tools, SLA management, incident resolution, and automation
- Mastery of application, data and infrastructure architecture disciplines
- Command of architecture, design and business processes Keen understanding of financial control and budget management
- Expertise in working in partnership with colleagues throughout the firm, and in leading collaborative teams to achieve common goals
- Hands on experience on managing operations of large-scale internet-centric production environments for application or infrastructure services serving tens to millions of end users.
- Prior experience in large scale internet companies/technologies, where uptime and continuous availability was core to the business.
- Work with Architecture to design reusable patterns to deploy to applications, provide governance around adoption, and influence application development teams on roadmaps and designs.
- Identify and partner with Infrastructure teams and AD teams to implement automation opportunities to drive down toil and reduce technical debt.
- Apply standards of cloud compliance to application design to achieve reliability
- Understanding of Networking and cloud technologies, for example Security, Load Balancing, Network routing protocols.
- Implement SRE frameworks to support globally multi-cloud environments, and ensure the highest level of SLA through operational excellence
- Provides failure analysis / root cause analysis when required
- Provides support to develop & improve the quality of technical engineering documentation
- Provides support to drive the maturity of the software development lifecycle
- Provides quality control of engineering deliverables
- Provides technical consultation to product management
- Performs deployment, administration, management, configuration, testing, and integration tasks related to the AI/ML platforms in cloud environment
- Helps to develop new cloud engineering strategies and implementations for the firm
- Champion a DevOps model so that services are automated and elastic across all platforms
- Helps on coaching and mentoring less experienced team members.
- Writes operation documentation and knowledge base of known issues with solutions
- Participates in 24x7 SRE on-call rotations and escalation workflows.
- Bachelor's degree in Computer Science, Information Technology, or equivalent technical field
- 2+ years of Enterprise Cloud infrastructure experience (AWS, Azure, GCP) in a mission critical environment
- Familiar with each step in the AI model development life cycle - data collection, model development, model training, model deployment and inference.
- Familiar with any of the AI/machine learning frameworks, statistical packages, and libraries: Tensorflow, Amazon Machine Learning, Apache Spark, PyTorch, Scikit-learn etc.
- In-Depth OS experience (RHEL, Ubuntu, Windows Server) with strong debugging, troubleshooting, and problem-solving skills
- Experience in building automation and tooling in large enterprise environment and engineering productivity tools such CICD, Jenkins, code coverage.
- Experience in site reliability engineering in one of the following languages: Python, Java, PowerShell, shell scripting or GO
- Hand-on experience with cloud-based technologies and tools especially in deployment, monitoring and operations, such as Data Dog, Prometheus, Splunk, Elasticsearch, Grafana
- Strong working knowledge of modern development technologies and tools such Agile, CI/CD, Git, Terraform and Jenkins.
- Deep knowledge of Internet protocols and web services technologies such as HTTP, DNS, TCP/UDP, SOAP, JSON and REST
- Good understanding of networking protocols and cybersecurity best practices in cloud environment
- AWS certification is highly desirable