Site Reliability Engineer (Roseville, US)

Site Reliability Engineer (Roseville, US)

As a Site Reliability Engineer, you will ensure AMI’s SaaS solutions maintain high availability, the best customer experience, and optimal system uptime. In this role, you will benefit from the opportunity to work on a SaaS platform with cutting-edge technology in the parking industry.

Primary Responsibilities/Accountabilities

  • Leverage your expertise in coding, algorithms, complex analysis, enterprise incident coordination, and large-scale system design to triage customer instances and platform issues, and tune resource usage.
  • Model SRE culture of intellectual curiosity, problem-solving, openness, collaboration, reasonable risk-taking, and big thinking in a self-directed environment.
  • Informing stakeholders of service level objectives and impact on services and cost.
  • Analyze root-cause complex problems involving multiple integrated systems and services, networks, hardware, and software that relate to scaling and performance.
  • Set standards for deployments at scale, infrastructure reliability, and scalability.
  • Influence engineering teams with customer focus towards quick and constructive resolution of conflicts.
  • Manage service availability and scalability through the process, tools, and automation.
  • Perform post-mortems and optimize incident response processes.
  • Lead incident response for production incidents; Drive investigation, analysis, and troubleshooting to resolve production incidents and systematically drive down detection and mitigation times.
  • Bring a strong engineering focus to operations, putting your energy into preventing incidents, automation frameworks, self-service infrastructure, logging and metrics, and operational scorecards.
  • Assist with CI/CD processes to improve cadence.
  • Identify or utilize existing tools for logging, monitoring, event management, notification, runbook automation, and root cause analysis.
  • Develop, communicate, and monitor standard processes to promote the long-term health of the platforms.
  • Participate in security compliance efforts; experience drafting and/or reviewing IT policies.
  • Improve capacity planning, configuration management, and monitoring.
  • Occasional off-hours, on-call work required.
  • Additional duties as assigned.

Qualifications: (Skills, Abilities, Knowledge)

  • 2+ years of experience supporting internet-facing production services and distributed
  • systems.
  • Passion for designing, building, managing, and documenting resilient applications and infrastructures at scale.
  • Bachelor’s degree or an equivalent combination of education and related work experience.
  • 2+ years of hands-on experience with performance monitoring and diagnostic tools.
  • Excellent written and verbal communication skills.
  • Advanced knowledge of Linux Administration.
  • Extensive experience with Git.
  • Understanding of microservice architectures and the complexities surrounding deployments.
  • A foundational understanding of security best practices.
  • Exposure to programming languages such as C#, Java, Python or Go.
  • Experience with scripting languages such as PowerShell, Bash, or Python.
  • Troubleshooting experience with Docker containers and Kubernetes.
  • Knowledge of best practices for running applications in containerized environments including
  • health checks and rolling update strategies.
  • Understand how to read network packet captures and troubleshoot connectivity issues.
  • Knowledge of CI/CD Pipelines Implementation for applications and infrastructure.
  • Knowledge of Microsoft Azure, AWS, GCP, or similar cloud platforms. Preferred experience with AWS.
  • Experience using Terraform IaC
Apply for this job