Didn't find the right job?

Get expert career advice to help you find the ideal role and improve your job search strategy.

120 Devops Engineers jobs in South Africa

Site Reliability Engineer

Rosebank, Gauteng R250000 - R450000 Y Cartrack

Posted today

Tap Again To Close

Job Description

We're a world-leading smart mobility SaaS tech company with over 2,000,000 active users. Our teams are collaborative, vibrant and fast-growing, and all team members are empowered with the freedom to influence our products and technology.

Are you curious, innovative and passionate? Do you take ownership, embrace challenges, and love problem-solving?

We're looking for a Site Reliability Engineer (SRE) who will enable us to build industry disruptive tech products and revolutionize the way our customers use technology.

The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, performance, and scalability of Cartrack' Linux-based systems and services. This role combines software engineering with operations, focusing on automation, monitoring, and incident response. The position requires working in shifts and rotations to support 24/7 operations.

You want to

Maintain and improve the reliability, scalability, and performance of Cartrack' infrastructure and applications.
Implement automation for deployments, monitoring, and system management.
Troubleshoot production issues, perform root cause analysis, and implement permanent fixes.
Develop and manage monitoring, alerting, and incident response processes.
Work with development teams to design resilient and scalable systems.
Participate in on-call shifts and rotation schedules to manage incidents and ensure uptime.
Optimize system efficiency and cost-effectiveness in an open-source environment.

You have

Strong background in Linux/Unix system administration (open-source stack).
Familiarity with monitoring and logging tools (Prometheus, Grafana etc.).
Knowledge of networking, load balancing, and system security best practices.
Strong problem-solving and debugging skills in a production environment.
Proven experience in automation and scripting (Python, Bash, Go, or similar).
Ability to design and maintain automation frameworks for deployments, monitoring, and system recovery.
Hands-on experience with CI/CD pipelines and configuration management tools (e.g., GitLab CI, Ansible, Puppet, Terraform).
Experience building self-healing and auto-remediation solutions for production environments.

Nice to Have

Experience with containerization and orchestration (Docker, Kubernetes).
Exposure to microservices and service mesh environments.
Knowledge of database reliability and performance tuning (PostgreSQL).

Qualifications

Bachelor's degree in Computer Science, Information Systems, or equivalent practical experience.
3+ years of experience in SRE, DevOps, or related infrastructure/operations roles.
Ability to work flexible hours, including shift rotations and on-call duties.

Job Type: Full-time

Ability to commute/relocate:

Rosebank, Gauteng: Reliably commute or planning to relocate before starting work (Preferred)

Experience:

Linux: 4 years (Preferred)
SRE: 3 years (Preferred)
Network monitoring: 3 years (Preferred)

Work Location: In person

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R900000 - R1200000 Y Flash Group

Posted today

Tap Again To Close

Job Description

Flash

2024/12/12 Western Cape

Job Reference Number:
T169

Department:
Technology

Business Unit:
Industry:
Fintech

Job Type:
Permanent

Positions Available:
3

Salary:
Market Related

We are looking for an individual passionate about technology and experience in developing and managing cutting-edge environment monitoring solutions, as well as using software and automation to solve problems and manage production systems.

Job Description
RESPONSIBILITIES:

Master multiple scripting and programming languages to achieve advanced proficiency and deliver robust solutions.
Drive the design and implementation of sophisticated automation tools and processes for managing large-scale systems.
Lead critical incident responses with composure and efficiency, followed by thorough post-incident reviews to implement preventative measures.
Shape system architecture and design, bringing your vision and expertise to influence high-impact decisions.
Champion the creation and adherence to reliability standards, ensuring scalable and sustainable system operations.
Demonstrate strong strategic thinking and planning abilities to drive team and organizational success.
Exhibit exceptional leadership skills, with the capacity to influence key technical decisions and inspire cross-functional teams.
Possess mentorship and coaching expertise to nurture and develop junior and intermediate team members, fostering a collaborative and growth-oriented environment

Job Requirements

MINIMUM REQUIREMENTS:

8-10years relevant experience in SRE, DevOps, or system engineering Matric
Proficiency in scripting languages
Relevant certification such as Oracle, Cloud,Dev Ops

TECHNICAL SKILLS:

Continuous delivery
Cloud skills & best practices
Observability (System and Application Performance Monitoring)
Infrastructure as code
Configuration management (Infrastructure as a Service)
Containers
Automation
Collaboration and Communication
Coding and Scripting
Azure DevOps
General systems uptimes
SLO (Service-level Objectives)
Latency
Incident and outage management
Change management
Capacity planning

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Mind Detect

Posted today

Tap Again To Close

Job Description

Our ultra-modern, scaling, payments platform client is seeking a
Site Reliability Engineer
(SRE) to join their world-class Engineering team, located in
Cape Town
(hybrid). Due to their unique market positioning and backing by world-leading payment companies, VCs and fintech platforms alike, they are set for high growth and expansion in the coming years.

As
SRE
, you will be responsible for the design, implementation, and maintenance of the technology infrastructure and you will strive to make such systems as performant, reliable, and as scalable as possible, as well as resolving any issues that arise. The SRE reports to the Engineering Team Lead.

Given the fact that this is a younger company, the environment is highly dynamic and fast-paced. Your working mentality must be one of adaptability, resilience and passion. This is a fantastic company to work for with truly vast amounts of personal and professional upside.

Responsibilities

Secure, monitor and maintain product infrastructure and connectivity.
Work closely with engineering to improve development, deployment and release processes.
Build tooling to allow developers to monitor reliability of their services and ensure a level of reliability in new services.
Provide guidance and support to teams implementing SLOs
Reducing the cost of failure by minimising problem discovery time and time to repair
Automate engineering operations where possible

Qualifications

Bachelor's Degree in Computer Science, Engineering, or other related field
Experience with operating a service-oriented architecture, container orchestration (Kubernetes, Nomad or similar) and cloud environments
4+ years professional experience in a similar role
Knowledge of network architecture and design patterns
Knowledge of disaster recovery mechanisms
Knowledge of secret management
Knowledge of various automation tools
Knowledge of coding and common programming languages
Experience in IT infrastructure monitoring and management

Bonus points for:

Experience in the FinTech and/or the financial services industry, particularly with PCI DSS frameworks
Experience with security automation and cloud security

Benefits

Equity in the business
Generous leave/solid work-life balance
Great remuneration package
Remote working
Plenty of perks
Strong professional development
An open, international and inclusive culture
Advanced equipment/technology

This position is open to people already eligible for work in South Africa

About us

We're a dedicated recruiter bringing together the brightest talent with organisations creating cutting-edge technology to change the world for the better.

We partner with technology providers at the forefront of meaningful innovation. And we're here for talented individuals who are passionate about using their skills to drive positive change.

Mind Detect provides exceptional recruitment services to businesses who are leading the way in Data, Machine Learning and AI-driven technologies throughout Europe, the US and Asia.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R120000 - R180000 Y Robin AI

Posted today

Tap Again To Close

Job Description

About Robin
Robin is on a mission to rebuild the legal industry — starting with making contracts simple for everyone. We are a pioneer in Legal AI, built on proprietary models, licensed data, and deep partnerships with Anthropic and AWS. Since 2019, we've expanded our footprint to 4 continents and have been supporting many of the world's most successful businesses, including GE, Pfizer, KPMG, and UBS.

What will you do as an SRE?
As an SRE at Robin AI, you'll help build and maintain our cloud infrastructure and applications that powers our cutting-edge Legal AI platform. You'll collaborate with engineering teams to establish robust monitoring, incident response, and deployment strategies that ensure high availability and reliability of our proprietary models and services, maintaining optimal SLOs for our global customer base.

Your Day-to-day Responsibilities

You will be responsible for ensuring the Robin systems are highly available and scalable.
Standardise and implement observability practices in our service-based architecture through logging, traces, metrics and monitors
Design, deploy, and operate infrastructure to support Robin's product teams as we expand into new regions.
Adding automation around manual operational tasks
Collaborate with development team leads to optimise build, test, and deployment processes
Participating in and improving our on-call and incident handling processes to ensure 24/7 system reliability

Ideally, You Should Have The Following Qualifications

3+ years of experience in DevOps or Site Reliability Engineering roles
Proficiency in at least one backend programming language (We use Python)
Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by Terraform
Comfortable troubleshooting across the full stack, starting from the browser, through the networking components, into the containerised applications and then onto data stores.
Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)
Excellent problem-solving and communication skills
Experience with AI/ML infrastructure deployments is a plus

What's In It For You

Salary: Competitive
Equity package: Generous equity scheme - everyone gets to be an owner of Robin AI
Annual leave: 20 days PTO, in addition to the public holidays observed in South Africa.
Growth opportunities: We prioritise promotions for high performers and help you to progress your career.

What's it like working at Robin?
Our culture and values attract people who are creative, resourceful, and share our passion for excellence. At Robin, you're encouraged to push yourself and empowered to take risks. We support each other to think big, try new ideas, and navigate uncertainty. Whether you're at our headquarters or one of our worldwide offices, you'll find a world of opportunities to grow, thrive, and make a meaningful impact. See what life is like at Robin.

Diversity, Equity and Inclusion at Robin
We are committed to building one of the most diverse technology companies in the world. As of 2024, more than 30% of our employees come from ethnic minority backgrounds, and 51% of roles are held by women. We know that transforming the legal industry requires diverse perspectives, so we're creating an environment where innovation thrives through inclusion.

Robin operates a direct hiring model and any speculative CVs shared via agencies will be treated as a gift.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R250000 - R750000 Y Deloitte

Posted today

Tap Again To Close

Job Description

Site Reliability Engineer

Location: South Africa (Onsite)

Experience: Minimum 5 years (hands-on technical role)

Key Responsibilities

Ensure reliability, availability, and performance of critical banking systems.
Proactively monitor, troubleshoot, and resolve production issues.
Collaborate with development and operations teams to automate processes and improve system resilience.
Support and optimise payment, collection, and debit order platforms.

Technical Skills

Proven experience with ZA Collection, Debit Order, and Payment systems.
Strong Java (versions 8 and 17) and Spring Boot expertise.
Proficient in Oracle 19c PL/SQM and Microsoft SQM.

Non-Negotiable Requirements

Minimum 5 years in a hands-on SRE or similar technical role.
Banking sector experience (insurance experience not considered).

Preferred Skills

Mainframe experience (highly advantageous but not essential).
Strong preference for Azure cloud experience.
IBM VS and COBOL II programming skills are a plus.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R900000 - R1200000 Y ExecutivePlacements - The JOB Portal

Posted today

Tap Again To Close

Job Description

Site Reliability Engineer (Datadog)

Recruiter:
Data Centrix

Job Ref:
JHB /LD

Date posted:
Tuesday, October 7, 2025

Location:
Johannesburg, South Africa

SUMMARY:
Are you a
Site Reliability Engineer
with solid
Datadog
experience? Our client in the Warehousing and Logistics sector is looking to employ an Engineer to Support the design, implementation, and optimization of
Datadog
monitoring solutions across infrastructure, applications and services.

POSITION INFO:
Qualifications and Experience:

Datadog Certified Fundamentals – Must have
Degree in Information Technology or Computer Science
Management of operations on virtualized and distributed infrastructures,
Management of operations on environment with clustering, replication, load balancer
ITIL Practitioner (V3) / ITIL Specialist (V4)
Windows Server: Advantage
1–3 years of experience working with a modern monitoring/observability tool, ideally Datadog (or alternatives like Prometheus, Grafana, New Relic, or Dynatrace).
Experience in:
Deploying and configuring monitoring agents
Creating dashboards and monitors
Parameterizing tags and labels for proper data correlation
Basic familiarity with cloud platforms (AWS, Azure or GCP) and container environments (Docker/Kubernetes)
Experience working with Centreon - Advantage
Strong interest in monitoring, DevOps, SRE, or cloud infrastructure
Knowledge of basic scripting (e.g., Bash, Python) is a plus

Duties:

Support the design, implementation, and optimization of Datadog monitoring solutions across infrastructure, applications, and services.
Work alongside DevOps, infrastructure, and application teams to ensure complete observability using custom dashboards, alerts, and tagging strategies.
Assist in the deployment and onboarding of new systems into the monitoring ecosystem.
Serve as the go-to person for building visualizations, improving signal-to-noise ratios in alerting, and aligning monitoring with business objectives.
Ideal for a young and motivated engineer looking to grow within observability and cloud-native monitoring.
Deploy and configure Datadog agents across various environments (cloud and on-prem).
Create and customize dashboards, monitors, and alerts for systems, services, containers, and applications.
Implement tagging strategies to organize, filter, and correlate metrics and logs effectively.
Integrate Datadog with various platforms (AWS, Azure, GCP, Kubernetes, Docker, etc.) to collect telemetry data.
Collaborate with developers, DevOps, and infrastructure teams to identify key business and system metrics to monitor.
Continuously tune and optimize monitors to reduce false positives and improve actionable alerting.
Document dashboards, alert logic, best practices, and knowledge for cross-team enablement.
Analyze incidents and outages post-mortem to identify monitoring gaps and enhance visibility.
Assist in evangelizing observability practices within the organization and contribute to monitoring as code efforts (e.g., Terraform for Datadog resources).
Stay up to date with new Datadog features and industry trends in observability and monitoring.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R250000 - R600000 Y SprintHive - Intelligent Customer Onboarding

Posted today

Tap Again To Close

Job Description

Junior Site Reliability Engineer - SprintHiveAbout SprintHive

SprintHive is a South African fintech enabling seamless end-to-end customer onboarding that drives conversion rates, prevents fraud, and reduces risk. Our automated solutions fully onboard new customers in under two minutes.

The Role

We're seeking a Junior Site Reliability Engineer eager to launch their career in infrastructure and platform engineering. You'll work directly with our CTO, receiving hands-on mentorship while contributing to production systems from day one. This fully remote position offers exceptional learning opportunities in a modern cloud-native environment.

What You'll DoLearn & Build

Gain hands-on experience with infrastructure-as-code using Terraform
Work with Kubernetes and containerised applications
Learn cloud platforms (Google Cloud and AWS) through real projects
Develop automation scripts in Golang, Bash, or Python
Participate in code reviews to improve your skills

Operational Support

Assist in maintaining and improving our monitoring stack
Help debug production issues alongside senior team members
Document procedures and infrastructure components
Support change control processes for production deployments
Participate in on-call rotation (with full support and gradual responsibility increase)

Growth Opportunities

Own small to medium infrastructure projects with guidance
Contribute to security initiatives and incident response
Collaborate with development team on infrastructure needs
Progress toward independent project ownership

Requirements

Strong problem-solving mindset and eagerness to learn
Basic programming ability in any language
Fundamental understanding of Linux/Unix systems
Interest in cloud infrastructure and DevOps practices
Ability to work independently in a remote environment

Preferred Qualifications

Computer Science degree or relevant coursework (bootcamps, self-study welcome)
Personal projects demonstrating infrastructure or automation work
Familiarity with Git, containers, and cloud services
Contributions to open source projects

Our Tech Stack

Infrastructure
: Kubernetes (GKE), Terraform, Kong API Gateway
Monitoring
: Prometheus, Grafana, Elastic, Kibana, Mezmo, Falco
Languages
: Kotlin, Python, JavaScript, Golang
Architecture
: Microservices with Event Sourcing and CQRS
Database
: MongoDB Atlas
CI/CD
: Jenkins

Compensation & Benefits

Salary: Competitive market rate
21 days paid leave
Direct mentorship from experienced CTO
Flexible working hours
Fully remote position
High-quality hardware (MacBook Pro, 34" Dell monitor)
AI assistant subscriptions
Clear growth path to Senior SRE role

Intermediate Site Reliability Engineer - SprintHiveAbout SprintHiveThe Role

We're seeking an Intermediate Site Reliability Engineer ready to take ownership of significant infrastructure projects. Working directly with our CTO, you'll have the autonomy to drive improvements while continuing to grow your expertise. This fully remote position offers the perfect balance of independence and support.

What You'll DoInfrastructure Ownership

Maintain and improve infrastructure using Terraform across GCP and AWS
Deploy and optimise applications on Kubernetes (GKE)
Ensure infrastructure automation and reproducibility
Own medium to large infrastructure projects independently
Debug and resolve complex production issues

Operational Excellence

Build automation tooling to eliminate manual processes
Improve monitoring and alerting systems
Write and execute production change plans
Maintain security best practices in all deployments
Participate confidently in on-call rotation

Collaboration & Growth

Contribute to infrastructure architecture discussions
Document systems and share knowledge with team
Partner with development team on infrastructure requirements
Participate in security incident response
Begin mentoring junior team members as we grow

RequirementsMust-Haves

2-4 years of infrastructure/DevOps/SRE experience
Hands-on experience with cloud platforms (GCP, AWS, or Azure)
Working knowledge of infrastructure-as-code tools
Container and orchestration experience (Docker, Kubernetes)
Proven ability to complete projects independently
Solid programming skills in at least one language (Python, Go, Bash)

Preferred Qualifications

Terraform experience
GCP/AWS certification
Experience with monitoring tools (Prometheus, Grafana, ELK)
Exposure to microservices architectures
Computer Science degree or equivalent

Our Tech Stack

Infrastructure
: Kubernetes (GKE), Terraform, Kong API Gateway
Monitoring
: Prometheus, Grafana, Elastic, Kibana, Mezmo, Falco
Languages
: Kotlin, Python, JavaScript, Golang
Architecture
: Microservices with Event Sourcing and CQRS
Database
: MongoDB Atlas
CI/CD
: Jenkins

Compensation & Benefits

Salary: Competitive market rate
21 days paid leave
Flexible working hours
Fully remote position
High-quality hardware (MacBook Pro, 34" Dell monitor)
AI assistant subscriptions
High project autonomy
Clear growth path to Senior role

Why This Role?

Perfect for engineers ready to move beyond junior tasks but not yet requiring senior-level compensation. You'll own real projects, make meaningful decisions, and grow rapidly in a small team environment.

Senior Site Reliability Engineer - SprintHiveAbout SprintHiveThe Role

We're seeking a Senior Site Reliability Engineer to co-architect our infrastructure future. As the second senior SRE working directly with our CTO, you'll have exceptional influence over platform strategy and technical direction. This fully remote position is ideal for an expert seeking impact without bureaucracy.

What You'll DoStrategic Leadership

Co-design long-term infrastructure vision and roadmap
Lead complex, multi-quarter infrastructure transformations
Define SRE standards, practices, and tooling strategies
Make architectural decisions balancing scale, cost, and complexity
Evaluate and introduce new technologies

Technical Excellence

Architect sophisticated infrastructure solutions using Terraform
Design zero-downtime deployment strategies and disaster recovery plans
Lead Kubernetes platform optimisation for scale and efficiency
Drive security architecture including identity management (OAuth2)
Build advanced automation and self-healing systems

Team & Culture Building

Mentor team members and establish engineering excellence standards
Lead incident response and drive blameless post-mortem culture
Interface with executive team on infrastructure strategy
Own vendor relationships and technology evaluations
Help build and scale the SRE team as we grow

RequirementsMust-Haves

5+ years of SRE/DevOps experience in production environments
Expert-level knowledge of at least one major cloud provider
Advanced Terraform skills with large-scale infrastructure experience
Deep Kubernetes expertise including performance tuning and troubleshooting
Proven track record of leading infrastructure initiatives
Strong programming skills (Go, Python) for tooling development
Experience mentoring engineers and driving technical standards

Preferred Qualifications

Fintech or high-compliance environment experience
Multi-cloud architecture experience
Security certifications or demonstrated security expertise
Open source contributions to infrastructure tools
Experience scaling startups through rapid growth

Our Tech Stack

Infrastructure
: Kubernetes (GKE), Terraform, Kong API Gateway
Monitoring
: Prometheus, Grafana, Elastic, Kibana, Mezmo, Falco
Languages
: Kotlin, Python, JavaScript, Golang
Architecture
: Microservices with Event Sourcing and CQRS
Database
: MongoDB Atlas
CI/CD
: Jenkins

Compensation & Benefits

Salary: Competitive market rate
21 days paid leave
Flexible working hours
Fully remote position
Premium hardware setup (MacBook Pro, 34" Dell monitor)
AI assistant subscriptions
Strategic influence at executive level
Budget authority for infrastructure decisions
Opportunity to build and lead growing SRE team

Why This Role?

Rare opportunity to join as a founding SRE member with CTO-level partnership. You'll have the authority of a Head of Infrastructure without the bureaucracy, the technical challenges of a scale-up without legacy constraints, and the influence to build the team and culture from scratch.

This advertiser has chosen not to accept applicants from your region.

Be The First To Know

About the latest Devops engineers Jobs in South Africa !

Set Email Alert:

Enter your email

Job title

Location

Site Reliability Engineer

R600000 - R1200000 Y Electrum Payments

Posted today

Tap Again To Close

Job Description

Electrum is the next-generation payments technology company that provides cloud-native software to optimise the processing of financial transactions. Since 2012, we have established ourselves as a respected payments technology partner through our deep expertise and track record in delivering trusted enterprise-grade payments solutions.

We've built a reputation in providing solutions for high-volume, low-value payment schemes and services that enable our clients to deliver to their customers at scale. We love that the projects we work on touch the lives of millions of South Africans daily, making a real difference.

We hire the best of the best and we offer great opportunities for personal growth and career progression.

Responsibilities

Site Reliability Engineers (SREs) are responsible for monitoring, automating, and improving the reliability, scalability, performance and availability of our services. SREs work on tasks such as preventing incidents, managing infrastructure reliability, building effective monitoring systems and ensuring smooth operations of cloud production systems.

Requirements

Service Reliability and Availability

Collaborate with teams to develop reliable, available, and scalable applications.
Work closely with the development team to understand, address, and prevent technical issues.
Participate in on-call rotations and manage critical incidents.
Develop and maintain incident response processes and alerting mechanisms.
Develop and maintain tools to monitor application and service SLIs and SLOs.

System Troubleshooting and Problem Resolution

Diagnose and resolve infrastructure and system-level issues, ensuring minimal downtime and swift problem resolution.
Respond to and investigate incidents related to infrastructure and applications, utilising diagnostic tools to track down and remediate issues.
Participate in on-call rotations to provide 24/7 operational support as necessary.

Observability and Automation

Utilise technologies to develop and maintain effective log management and monitoring solutions for internal and external customers.
Evaluate system health, identify performance bottlenecks and proactively optimise performance and cost-effectiveness.
Implement automation tools and frameworks for deployment, configuration, and monitoring processes.
Capacity management and planning for systems to ensure continued reliability.

Process Improvements

Offer recommendations and improvements to enhance performance, security, and scalability.
Evaluate and integrate emerging technologies, cloud services and automation tools to improve operational efficiency.
Drive cost-optimization initiatives by identifying opportunities for resource right-sizing, efficiency and other cost-saving measures.

Disaster Recovery

Design and implement disaster recovery strategies, including backup and restoration processes, to ensure business continuity.
Develop and update incident management procedures, ensuring effective incident response by providing technical solutions and implementing preventative measures.
Regularly assess system performance, identify irregularities, troubleshoot issues, and ensure high system availability. This includes performing or facilitating Disaster Recovery tests.

Requirements

Bachelor's degree in Computer Science, Information Technology, or related field preferred.
3+ years experience in an SRE or similar role.
Familiarity with AWS services like EC2, S3, RDS, Lambda, EKS and CloudWatch.
Demonstrable experience with observability tools like Elastic and Grafana.
Development skills advantageous.
Proficient troubleshooting and problem-solving skills.
Excellent collaboration, communication, and time management skills.
Attention to detail and ability to work effectively in a team environment.

Benefits

A good work-life balance is very important at Electrum. To help you manage your own time and energy, Electrum offers benefits such as:

Flexibility around core working hours (nature of flexibility is negotiated per role based on business need)
Daily cooked lunches and a stocked kitchen for the mid-day nibbles
Team socialising, getaways, and social outings

We have created a safe, transparent environment where we know mistakes happen, and that's okay. We even have a 3 step approach to dealing with them:

Tell everyone about it
Fix the mistake
Tell everyone about it

You are responsible for your actions – both the successes and the failures.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R600000 - R1200000 Y Electrum

Posted today

Tap Again To Close

Job Description

We hire the best of the best and we offer great opportunities for personal growth and career progression.
*Responsibilities *
Site Reliability Engineers (SREs) are responsible for monitoring, automating, and improving the reliability, scalability, performance and availability of our services. SREs work on tasks such as preventing incidents, managing infrastructure reliability, building effective monitoring systems and ensuring smooth operations of cloud production systems.

*Requirements
Service Reliability and Availability *

Collaborate with teams to develop reliable, available, and scalable applications.
Work closely with the development team to understand, address, and prevent technical issues
Participate in on-call rotations and manage critical incidents
Develop and maintain incident response processes and alerting mechanisms
Develop and maintain tools to monitor application and service SLIs and SLOs

System Troubleshooting and Problem Resolution

Diagnose and resolve infrastructure and system-level issues, ensuring minimal downtime and swift problem resolution
Respond to and investigate incidents related to infrastructure and applications, utilising diagnostic tools to track down and remediate issues
Participate in on-call rotations to provide 24/7 operational support as necessary

Observability and Automation

Utilise technologies to develop and maintain effective log management and monitoring solutions for internal and external customers
Evaluate system health, identify performance bottlenecks and proactively optimise performance and cost-effectiveness
Implement automation tools and frameworks for deployment, configuration, and monitoring processes
Capacity management and planning for systems to ensure continued reliability

Process Improvements

Offer recommendations and improvements to enhance performance, security, and scalability
Evaluate and integrate emerging technologies, cloud services and automation tools to improve operational efficiency
Drive cost-optimization initiatives by identifying opportunities for resource right-sizing, efficiency and other cost-saving measures

Disaster Recovery

Design and implement disaster recovery strategies, including backup and restoration processes, to ensure business continuity
Develop and update incident management procedures, ensuring effective incident response by providing technical solutions and implementing preventative measures
Regularly assess system performance, identify irregularities, troubleshoot issues, and ensure high system availability. This includes performing or facilitating Disaster Recovery tests

* *Requirements***

Bachelor's degree in Computer Science, Information Technology, or related field preferred
3+ years experience in an SRE or similar role
Familiarity with AWS services like EC2, S3, RDS, Lambda, EKS and CloudWatch
Demonstrable experience with observability tools like Elastic and Grafana
Development skills advantageous
Proficient troubleshooting and problem-solving skills
Excellent collaboration, communication, and time management skills
Attention to detail and ability to work effectively in a team environment

*Benefits *
A good work-life balance is very important at Electrum. To help you manage your own time and energy, Electrum offers benefits such as:

Flexibility around core working hours (nature of flexibility is negotiated per role based on business need)
Daily cooked lunches and a stocked kitchen for the mid-day nibbles
Team socialising, getaways, and social outings

We have created a safe, transparent environment where we know mistakes happen, and that's okay. We even have a 3 step approach to dealing with them:

Tell everyone about it
Fix the mistake
Tell everyone about it

You are responsible for your actions - both the successes and the failures.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R250000 - R450000 Y Dariel

Posted today

Tap Again To Close

Job Description

Site Reliability Engineer - Level 1

(MSP – Frontline Operations)

The Site Reliability Engineer (SRE) is the first responder for MSP customers, ensuring incident management, triage, and resolution within SLA-defined timelines.

As the front-line escalation point, this role diagnoses, troubleshoots, and resolves cloud infrastructure issues, escalating complex incidents to specialized teams (Senior SREs, SecOps, FinOps, or Cloud Engineering) when required.

Key Responsibilities

Act as the first responder for MSP client incidents, providing rapid troubleshooting, diagnosis, and resolution within SLA timelines.
Triage incoming issues to determine whether they can be resolved directly or require escalation to specialized teams.
Monitor cloud environments using AWS CloudWatch, Datadog, and FreshService to detect performance, security, and availability issues proactively.
Maintain incident records, documenting resolution steps, escalation timelines, and root causes.
Collaborate with Senior SREs and internal teams to ensure seamless incident resolution and escalation workflows.
Participate in post-incident reviews, contributing insights to improve operational efficiency and prevent future incidents.
Provide 24/7 on-call support for one full week per month, ensuring availability for high-priority incidents and urgent operational needs.

Required Skills & Experience

2+ years of experience in cloud operations, incident management, or a similar role, with hands-on AWS experience.
Strong knowledge of AWS core services (compute, networking, storage, IAM).
Hands-on experience with monitoring and logging tools (e.g., AWS CloudWatch, Datadog, ELK stack, Prometheus, Grafana).
Understanding of Kubernetes (ability to support containerized applications) is a plus.
Familiarity with incident management systems (e.g., FreshService) for tracking and responding to alerts.
Excellent troubleshooting and problem-solving skills, able to quickly determine root causes of issues.
Strong organizational and multitasking abilities, especially under pressure in fast-paced operational environments.
Good communication skills, ensuring clear documentation and effective collaboration with technical teams.
Familiarity with ITIL-based service management workflows is a plus.

Required Qualifications

AWS Certified SysOps Administrator OR AWS Solutions Architect Associate Required - None Negotiable
Terraform Associate Certification or equivalent hands-on experience (Preferred, but not required for front-line troubleshooting).
Experience working in an MSP or cloud operations team (Preferred).
Ability to work independently, prioritize incidents effectively, and escalate when necessary.

This advertiser has chosen not to accept applicants from your region.

Industry

View All Devops Engineers Jobs

Menu

Search Suggestions

Recent Searches

Popular Searches

Location Suggestions

Popular Locations

Nearby Locations

Other Jobs Near Me

Industry

120 Devops Engineers jobs in South Africa

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Be The First To Know

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Nearby Locations

Other Jobs Near Me

Industry