Job Information
Navy Federal Credit Union Cloud Site Reliability Engineer (Azure) – ETS Engineer IV in San Diego, California
Overview
The Site Reliability Engineer is a member of the Cloud Team and providing support on software development, operations and maintenance while dealing with complex infrastructure to improve performance, visibility, stability, availability and reliability using automated solutions. This role will provide Tier 3 support, either directly or by engaging with other stakeholders, for applications and platforms residing in the Cloud. Ideal candidate has hands-on experience and understanding of software development lifecycle from inception to implementation. The successful candidate should have knowledge and understanding of maintaining and will be responsible for ensuring the reliability and speed of the software.
This position is eligible for the TalentQuest employee referral program. If an employee referred you for this job, please apply using the system-generated link that was sent to you.
Responsibilities
Set up and maintain Azure-native monitoring tools like Azure Monitor, Log Analytics, and Application Insights to oversee system performance, resource health, and workload behavior across AKS environments.
Build tailored dashboards that provide clear visualizations of key metrics and configure proactive alerting mechanisms to detect anomalies early and trigger appropriate responses.
Utilize Azure Sentinel to enhance security incident detection and response for AKS environments, maintaining compliance and minimizing risks.
Implement end-to-end observability practices by combining metrics, logs, and traces for comprehensive insights into containerized applications and their underlying infrastructure.
Design and maintain automation scripts using Python, PowerShell, or Bash to streamline repetitive tasks, such as automated scaling, backup processes, and system health checks.
Develop runbooks and automated workflows that trigger predefined remediation steps for commonly encountered issues, minimizing manual intervention and response time.
Create scripts that enable automatic system adjustments and recovery actions when performance thresholds are crossed or errors are detected.
Utilize tools such as Terraform or ARM templates to automate and manage the provisioning of cloud resources, ensuring consistency and repeatability.
Rapidly Diagnose Issues: Lead the identification and troubleshooting of issues impacting system performance, leveraging data from monitoring tools and logs for swift resolution.
Root Cause Analysis (RCA): Conduct thorough post-incident analyses to document root causes, identify areas for improvement, and implement preventive measures to reduce recurrence.
Runbook Maintenance: Keep incident response runbooks up to date with the latest information and best practices to ensure readiness and consistency during unexpected events.
Analyze Metrics and Performance Data: Continuously monitor key performance indicators (KPIs) across cloud resources and workloads to spot trends, potential bottlenecks, and opportunities for enhancement.
Propose and implement strategies to improve the cost-efficiency and performance of cloud services, such as right-sizing resources or enhancing load-balancing configurations.
Work closely with architecture and development teams to provide input on designing robust, scalable, and resilient cloud solutions.
Implement best practices for optimizing container performance within AKS clusters, ensuring optimal CPU and memory usage without compromising application availability.
Provide feedback and support to development teams to ensure applications are designed with reliability and scalability in mind.
Advocate for and help implement best practices in reliability, incident management, and proactive monitoring across teams.
Collaborate with security teams to identify and mitigate vulnerabilities in cloud infrastructure, integrating security monitoring and automated compliance checks.
Create comprehensive documentation covering monitoring configurations, incident response protocols, and remediation procedures to ensure team alignment and knowledge retention.
Contribute to the creation of internal training resources to help team members familiarize themselves with new tools, techniques, and processes.
Regularly share insights, lessons learned, and new approaches to improve the team’s response capabilities and the overall reliability of cloud services.
Regularly analyze usage data and performance metrics to identify opportunities for cost optimization, such as rightsizing virtual machines, optimizing storage solutions, and scheduling non-critical resources to shut down during off-peak hours.
Use Azure Cost Management + Billing to monitor expenses and track actual versus predicted costs.
Work with architecture teams to design solutions that maintain performance while minimizing costs, including the use of reserved instances, spot instances, and optimizing data transfer methods.
Develop automation scripts that dynamically manage resource allocation based on load, reducing unnecessary expenditure.
Proficiency in Service Level Objectives, Service Level Indicators, and error budgeting to balance system reliability with development velocity.
Expertise in chaos engineering practices to test and improve system resiliency under controlled conditions.
Deep knowledge of monitoring and observability tools, such as Prometheus, Grafana, and Azure Monitor.
Strong troubleshooting abilities for distributed systems with proficiency in identifying root causes.
Experience implementing incident management frameworks, ensuring smooth communication, documentation, and follow-up for service interruptions.
Qualifications
Bachelor's Degree in Information Technology or the equivalent combination of training, education, and experience.
Solid hands-on experience in a Site Reliability Engineer, DevOps Engineer, or similar role with a strong focus on Azure cloud services.
Technical Skills
Proficiency in scripting languages such as Python, PowerShell, or Bash.
Extensive experience with Azure monitoring tools like Azure Monitor, Log Analytics, Application Insights, and Azure Sentinel.
Familiarity with AKS and best practices for monitoring containerized applications.
Problem-Solving: Proven track record of effective troubleshooting and resolution of cloud infrastructure issues.
Automation Expertise: Hands-on experience creating automated solutions using IaC tools like Terraform or ARM templates.
Collaboration and Communication: Strong interpersonal skills to work effectively within cross-functional teams.
Desired Qualifications
Certifications: Azure certifications such as Microsoft Certified: Azure Administrator Associate or Azure Solutions Architect Expert.
Advanced Knowledge: Experience with Kusto Query Language (KQL) for in-depth data analysis and complex queries.
Security Acumen: Familiarity with integrating security best practices into monitoring and incident response.
Dynatrace experience a plus.
Knowledge, understanding and experience of DevOps and Agile Methodologies.
Experience in Microsoft Azure Technologies.
Experience in Tanzu Application/Container Services (TAS/TKS) (Previously Pivotal Cloud Foundry) or equivalent container based platforms/products like Openshift, Azure Kubernetes Services, Google Container Services etc.
Experience using ServiceNow ITOM and ITSM to create catalogs or to automate processes by integrating with other systems.
Knowledge and understanding of how software is built and managed.
Hours: Monday - Friday, 8:00AM - 4:30PM
Location: 820 Follin Lane, Vienna, VA 22180 | 5510 Heritage Oaks Drive Pensacola, FL 32526 | 141 Security Drive Winchester, VA 22602 | 9999 Willow Creek Road San Diego, CA 92131 | 295 Bendix Road, Suite 250, Virginia Beach, VA 23452 | 11270 Saint Johns Industrial Parkway South, Jacksonville, FL 32246 | 9001 Airport Freeway, Suite 925, North Richland Hills, TX 76180 | 4 Concourse Parkway, #100, Sandy Springs, GA 30328
About Us
Navy Federal provides much more than a job. We provide a meaningful career experience, including a culture that is energized, engaged and committed; and fierce appreciation for our teams, who are rewarded with highly competitive pay and generous benefits and perks.
• Best Companies for Latinos to Work for 2024
• Computerworld® Best Places to Work in IT
• Forbes® 2024 America's Best Large Employers
• Forbes® 2023 The Best Employers for New Grads
• Fortune Best Workplaces for Millennials™ 2023
• Fortune Best Workplaces for Women ™ 2023
• Fortune 100 Best Companies to Work For® 2024
• Military Times 2023 Best for Vets Employers
• Newsweek Most Loved Workplaces
• Ripplematch Campus Forward Award - Excellence in Early Career Hiring
• Yello and WayUp Top 100 Internship Programs
From Fortune. ©2024 Fortune Media IP Limited. All rights reserved. Used under license. Fortune and Fortune Media IP Limited are not affiliated with, and do not endorse products or services of, Navy Federal Credit Union.
Equal Employment Opportunity: Navy Federal values, celebrates, and enacts diversity in the workplace. Navy Federal takes affirmative action to employ and advance in employment qualified individuals with disabilities, disabled veterans, Armed Forces service medal veterans, recently separated veterans, and other protected veterans. EOE/AA/M/F/Veteran/Disability EOE/AA/M/F/Veteran/Disability
Hybrid Workplace: Navy Federal Credit Union is a hybrid workplace, and details will be discussed during your interview process.
Disclaimers: Navy Federal reserves the right to fill this role at a higher/lower grade level based on business need. An assessment may be required to compete for this position. Job postings are subject to close early or extend out longer than the anticipated closing date at the hiring team’s discretion based on qualified applicant volume. Navy Federal Credit Union assesses market data to establish salary ranges that enable us to remain competitive. You are paid within the salary range, based on your experience, location and market position
Bank Secrecy Act: Remains cognizant of and adheres to Navy Federal policies and procedures, and regulations pertaining to the Bank Secrecy Act.
REQNUMBER: 22001