job description - Site Reliability Engineer
- At Softcom Limited, we’re passionate about building software that solves problems.
- As we expand our customer deployments, we are currently seeking an experienced SRE to deliver insights from massive scale data in real time.
- Specifically, we are searching for someone who brings fresh ideas, demonstrates a unique and informed viewpoint, and enjoys collaborating with cross-functional teams to develop real-world solutions and positive user experiences at every interaction.
Objectives of this Role
- Run the production environment by monitoring availability and taking a holistic view of system health
- Build software and systems to manage platform infrastructure and applications
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
- Provide primary operational support and engineering for multiple large distributed software applications
Daily and Monthly Responsibilities
- Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures
- Participate in system design consulting, platform management, and capacity planning
- Create sustainable systems and services through automation and uplifts
- Balance feature development speed and reliability with well-defined service level objectives
Requirements
- Degree in Computer Science or a Technology-related field required.
- 3 years experience working in software engineering teams as a SRE or DevOps engineer.
- Practical experience of computer operating systems such as MS Windows, UNIX/Linux a
- An overall understanding of the scripting and source code programming languages, such as Javascript, Go, Python etc
- Experience architecting, deploying and scaling production workloads on AWS using services such as EC2, S3, EKS, VPC, IAM etc.
- Experience with containers and container orchestration tools such as Docker and Kubernetes.
- Experience with CI/CD tools such as Jenkins, Bitbucket pipelines, AWS CodeDeploy, AWS CodeBuild or similar.
- Experience with monitoring and observability tools such as ELK stack, Prometheus, Cloudwatch etc.
- Experience with incident management tools such as Opsgenie, Pagerduty.
- Experience automating infrastructure, testing, and deployments using tools like Terraform or Cloudformation and can explain the Infrastructure as Code paradigm.
- Good understanding of Chaos Engineering, even if you haven't yet implemented it yourself yet.
- Experience debugging complex problems.
- Good understanding of computer networking and messaging, especially between services.
- Has hands-on experience using source control (Git).
- Has experience with a variety of databases. (MongoDB, PostgreSQL, MySQL).
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks.
- Excellent written and verbal communication skills and high level of personal integrity
- Innovative thinking and leadership with an ability to lead and motivate cross-functional, interdisciplinary teams
- Experience with contract and vendor negotiations and management including managed services.
- Specific experience in Agile (scaled) software development or other best in class development practices.
- Experience with Cloud computing/Elastic computing across virtualized environments.
- Knowledge of relevant IT Security related hardware, software and vendor solutions.
- Deep thinking analytical mind with the ability to quickly get to the root cause of issues.
Report
About the company
100 followers
Follow