compLogoSite Reliability EngineerCompany: Kloc TechnologiesOn-sitePune, Maharashtra, India

Site Reliability Engineer

Location: Pune, India
Workplace Type: Onsite
Shift: US Shift

About the Role

We are seeking an experienced Site Reliability Engineer to join our dynamic team in Pune. In this role, you will be instrumental in managing our multi-cloud infrastructure, focusing on AWS and Azure. You will be responsible for setting up and maintaining the infrastructure to support our cloud migration and future division expansion. This position offers a unique opportunity to work in a global environment, collaborate with Automotive and corporate IT teams, learn new skills, and shape the future direction of our infrastructure. The ideal candidate will have a strong background in cloud computing, infrastructure as code, and automation, with a proactive approach to problem-solving and performance optimization. You will be part of the Tech Ops / SRE Team, which operates in a sharing and learning culture to maintain continuous access to our products.

Key Responsibilities

  • Gather and analyze metrics from operating systems and applications to assist in performance tuning and fault finding.
  • Partner with development teams to improve services through rigorous testing and release procedures.
  • Participate in system design consulting, platform management, and capacity planning.
  • Create sustainable systems and services through automation.
  • Balance feature development speed and reliability with well-defined service-level objectives.
  • Manage day-to-day operations of AWS/Azure Infrastructure.
  • Build and document automation processes for Infrastructure as a Service/Infrastructure as code.
  • Manage backup and patch management processes.
  • Provide adequate support in architecture planning, migration, and installation for new projects.
  • Lead the structural/architectural design of platforms, middleware, databases, and backups according to system requirements.
  • Conduct technology capacity planning by reviewing current and future requirements.
  • Strategize and implement disaster recovery plans, including creating and implementing backup and recovery plans.
  • Manage day-to-day operations by troubleshooting issues, conducting root cause analysis (RCA), and developing fixes.
  • Plan for and manage upgrades, migrations, maintenance, backups, installations, and configurations.
  • Review technical performance and deploy ways to improve efficiency and fine-tune performance.
  • Develop shift rosters to ensure no disruption in the tower.
  • Create and update SOPs, Data Responsibility Matrices, operations manuals, and daily test plans.
  • Provide weekly status reports to client leadership and internal stakeholders.
  • Leverage technology to develop Service Improvement Plans (SIP) through automation.

Required Skills & Qualifications

  • Bachelor’s degree (or equivalent) in computer science or a related discipline with at least 7 years of experience.
  • Strong understanding and hands-on experience with EKS, including configuring, deploying, maintaining, troubleshooting, upgrading, and monitoring EKS on AWS.
  • Hands-on experience with CI/CD pipelines and DevOps tooling, including Git-based version control (GitLab preferred), pipeline design and maintenance, automated builds, testing, and deployments for cloud-native and containerized workloads.
  • Hands-on Experience with Linux Server, AD, LDAP, DNS, Network Storage, AWS Compute services (EC2, FSX, Managed AD, Route 53, etc…).
  • Ability to program using scripting with tools or languages, such as PowerShell, Python, Ansible, Terraform, and Bash.
  • Familiarity with ITSM processes like Incident, Problem, and Change Management using ServiceNow (preferable).
  • Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.
  • Strong interpersonal skills, analytical and problem-solving ability, along with strong written and verbal communication.
  • Ability to communicate ideas in both technical and non-technical ways.
  • A strong capacity for teamwork and a sense of ownership, with the ability to work independently and be self-driven.
  • Experience with Infra Cloud Computing Consulting.