120 Devops Engineers jobs in South Africa

Site Reliability Engineer

Rosebank, Gauteng R250000 - R450000 Y Cartrack

Posted today

Job Viewed

Tap Again To Close

Job Description

We're a world-leading smart mobility SaaS tech company with over 2,000,000 active users. Our teams are collaborative, vibrant and fast-growing, and all team members are empowered with the freedom to influence our products and technology.

Are you curious, innovative and passionate? Do you take ownership, embrace challenges, and love problem-solving?

We're looking for a Site Reliability Engineer (SRE) who will enable us to build industry disruptive tech products and revolutionize the way our customers use technology.

The Site Reliability Engineer (SRE) will be responsible for ensuring the reliability, performance, and scalability of Cartrack' Linux-based systems and services. This role combines software engineering with operations, focusing on automation, monitoring, and incident response. The position requires working in shifts and rotations to support 24/7 operations.

You want to

  • Maintain and improve the reliability, scalability, and performance of Cartrack' infrastructure and applications.
  • Implement automation for deployments, monitoring, and system management.
  • Troubleshoot production issues, perform root cause analysis, and implement permanent fixes.
  • Develop and manage monitoring, alerting, and incident response processes.
  • Work with development teams to design resilient and scalable systems.
  • Participate in on-call shifts and rotation schedules to manage incidents and ensure uptime.
  • Optimize system efficiency and cost-effectiveness in an open-source environment.

You have

  • Strong background in Linux/Unix system administration (open-source stack).
  • Familiarity with monitoring and logging tools (Prometheus, Grafana etc.).
  • Knowledge of networking, load balancing, and system security best practices.
  • Strong problem-solving and debugging skills in a production environment.
  • Proven experience in automation and scripting (Python, Bash, Go, or similar).
  • Ability to design and maintain automation frameworks for deployments, monitoring, and system recovery.
  • Hands-on experience with CI/CD pipelines and configuration management tools (e.g., GitLab CI, Ansible, Puppet, Terraform).
  • Experience building self-healing and auto-remediation solutions for production environments.

Nice to Have

  • Experience with containerization and orchestration (Docker, Kubernetes).
  • Exposure to microservices and service mesh environments.
  • Knowledge of database reliability and performance tuning (PostgreSQL).

Qualifications

  • Bachelor's degree in Computer Science, Information Systems, or equivalent practical experience.
  • 3+ years of experience in SRE, DevOps, or related infrastructure/operations roles.
  • Ability to work flexible hours, including shift rotations and on-call duties.

Job Type: Full-time

Ability to commute/relocate:

  • Rosebank, Gauteng: Reliably commute or planning to relocate before starting work (Preferred)

Experience:

  • Linux: 4 years (Preferred)
  • SRE: 3 years (Preferred)
  • Network monitoring: 3 years (Preferred)

Work Location: In person

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R900000 - R1200000 Y Flash Group

Posted today

Job Viewed

Tap Again To Close

Job Description

Flash

2024/12/12 Western Cape

Job Reference Number:
T169

Department:
Technology

Business Unit:
Industry:
Fintech

Job Type:
Permanent

Positions Available:
3

Salary:
Market Related

We are looking for an individual passionate about technology and experience in developing and managing cutting-edge environment monitoring solutions, as well as using software and automation to solve problems and manage production systems.

Job Description
RESPONSIBILITIES:

  • Master multiple scripting and programming languages to achieve advanced proficiency and deliver robust solutions.
  • Drive the design and implementation of sophisticated automation tools and processes for managing large-scale systems.
  • Lead critical incident responses with composure and efficiency, followed by thorough post-incident reviews to implement preventative measures.
  • Shape system architecture and design, bringing your vision and expertise to influence high-impact decisions.
  • Champion the creation and adherence to reliability standards, ensuring scalable and sustainable system operations.
  • Demonstrate strong strategic thinking and planning abilities to drive team and organizational success.
  • Exhibit exceptional leadership skills, with the capacity to influence key technical decisions and inspire cross-functional teams.
  • Possess mentorship and coaching expertise to nurture and develop junior and intermediate team members, fostering a collaborative and growth-oriented environment

Job Requirements

MINIMUM REQUIREMENTS:

  • 8-10years relevant experience in SRE, DevOps, or system engineering Matric
  • Proficiency in scripting languages
  • Relevant certification such as Oracle, Cloud,Dev Ops

TECHNICAL SKILLS:

  • Continuous delivery
  • Cloud skills & best practices
  • Observability (System and Application Performance Monitoring)
  • Infrastructure as code
  • Configuration management (Infrastructure as a Service)
  • Containers
  • Automation
  • Collaboration and Communication
  • Coding and Scripting
  • Azure DevOps
  • General systems uptimes
  • SLO (Service-level Objectives)
  • Latency
  • Incident and outage management
  • Change management
  • Capacity planning
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Mind Detect

Posted today

Job Viewed

Tap Again To Close

Job Description

Our ultra-modern, scaling, payments platform client is seeking a
Site Reliability Engineer
(SRE) to join their world-class Engineering team, located in
Cape Town
(hybrid). Due to their unique market positioning and backing by world-leading payment companies, VCs and fintech platforms alike, they are set for high growth and expansion in the coming years.

As
SRE
, you will be responsible for the design, implementation, and maintenance of the technology infrastructure and you will strive to make such systems as performant, reliable, and as scalable as possible, as well as resolving any issues that arise. The SRE reports to the Engineering Team Lead.

Given the fact that this is a younger company, the environment is highly dynamic and fast-paced. Your working mentality must be one of adaptability, resilience and passion. This is a fantastic company to work for with truly vast amounts of personal and professional upside.

Responsibilities

  • Secure, monitor and maintain product infrastructure and connectivity.
  • Work closely with engineering to improve development, deployment and release processes.
  • Build tooling to allow developers to monitor reliability of their services and ensure a level of reliability in new services.
  • Provide guidance and support to teams implementing SLOs
  • Reducing the cost of failure by minimising problem discovery time and time to repair
  • Automate engineering operations where possible

Qualifications

  • Bachelor's Degree in Computer Science, Engineering, or other related field
  • Experience with operating a service-oriented architecture, container orchestration (Kubernetes, Nomad or similar) and cloud environments
  • 4+ years professional experience in a similar role
  • Knowledge of network architecture and design patterns
  • Knowledge of disaster recovery mechanisms
  • Knowledge of secret management
  • Knowledge of various automation tools
  • Knowledge of coding and common programming languages
  • Experience in IT infrastructure monitoring and management

Bonus points for:

  • Experience in the FinTech and/or the financial services industry, particularly with PCI DSS frameworks
  • Experience with security automation and cloud security

Benefits

  • Equity in the business
  • Generous leave/solid work-life balance
  • Great remuneration package
  • Remote working
  • Plenty of perks
  • Strong professional development
  • An open, international and inclusive culture
  • Advanced equipment/technology

--

This position is open to people already eligible for work in South Africa

--

About us

We're a dedicated recruiter bringing together the brightest talent with organisations creating cutting-edge technology to change the world for the better.

We partner with technology providers at the forefront of meaningful innovation. And we're here for talented individuals who are passionate about using their skills to drive positive change.

Mind Detect provides exceptional recruitment services to businesses who are leading the way in Data, Machine Learning and AI-driven technologies throughout Europe, the US and Asia.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R120000 - R180000 Y Robin AI

Posted today

Job Viewed

Tap Again To Close

Job Description

About Robin
Robin is on a mission to rebuild the legal industry — starting with making contracts simple for everyone. We are a pioneer in Legal AI, built on proprietary models, licensed data, and deep partnerships with Anthropic and AWS. Since 2019, we've expanded our footprint to 4 continents and have been supporting many of the world's most successful businesses, including GE, Pfizer, KPMG, and UBS.

What will you do as an SRE?
As an SRE at Robin AI, you'll help build and maintain our cloud infrastructure and applications that powers our cutting-edge Legal AI platform. You'll collaborate with engineering teams to establish robust monitoring, incident response, and deployment strategies that ensure high availability and reliability of our proprietary models and services, maintaining optimal SLOs for our global customer base.

Your Day-to-day Responsibilities

  • You will be responsible for ensuring the Robin systems are highly available and scalable.
  • Standardise and implement observability practices in our service-based architecture through logging, traces, metrics and monitors
  • Design, deploy, and operate infrastructure to support Robin's product teams as we expand into new regions.
  • Adding automation around manual operational tasks
  • Collaborate with development team leads to optimise build, test, and deployment processes
  • Participating in and improving our on-call and incident handling processes to ensure 24/7 system reliability

Ideally, You Should Have The Following Qualifications

  • 3+ years of experience in DevOps or Site Reliability Engineering roles
  • Proficiency in at least one backend programming language (We use Python)
  • Strong knowledge of AWS services (ECS, S3, RDS, Lambda, etc.), managed by Terraform
  • Comfortable troubleshooting across the full stack, starting from the browser, through the networking components, into the containerised applications and then onto data stores.
  • Knowledge of observability frameworks and tools (We use OpenTelemetry, Cloudwatch & DataDog)
  • Excellent problem-solving and communication skills
  • Experience with AI/ML infrastructure deployments is a plus

What's In It For You

  • Salary: Competitive
  • Equity package: Generous equity scheme - everyone gets to be an owner of Robin AI
  • Annual leave: 20 days PTO, in addition to the public holidays observed in South Africa.
  • Growth opportunities: We prioritise promotions for high performers and help you to progress your career.

What's it like working at Robin?
Our culture and values attract people who are creative, resourceful, and share our passion for excellence. At Robin, you're encouraged to push yourself and empowered to take risks. We support each other to think big, try new ideas, and navigate uncertainty. Whether you're at our headquarters or one of our worldwide offices, you'll find a world of opportunities to grow, thrive, and make a meaningful impact. See what life is like at Robin.

Diversity, Equity and Inclusion at Robin
We are committed to building one of the most diverse technology companies in the world. As of 2024, more than 30% of our employees come from ethnic minority backgrounds, and 51% of roles are held by women. We know that transforming the legal industry requires diverse perspectives, so we're creating an environment where innovation thrives through inclusion.

Robin operates a direct hiring model and any speculative CVs shared via agencies will be treated as a gift.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R250000 - R750000 Y Deloitte

Posted today

Job Viewed

Tap Again To Close

Job Description

Site Reliability Engineer

Location: South Africa (Onsite)

Experience: Minimum 5 years (hands-on technical role)

Key Responsibilities

  • Ensure reliability, availability, and performance of critical banking systems.
  • Proactively monitor, troubleshoot, and resolve production issues.
  • Collaborate with development and operations teams to automate processes and improve system resilience.
  • Support and optimise payment, collection, and debit order platforms.

Technical Skills

  • Proven experience with ZA Collection, Debit Order, and Payment systems.
  • Strong Java (versions 8 and 17) and Spring Boot expertise.
  • Proficient in Oracle 19c PL/SQM and Microsoft SQM.

Non-Negotiable Requirements

  • Minimum 5 years in a hands-on SRE or similar technical role.
  • Banking sector experience (insurance experience not considered).

Preferred Skills

  • Mainframe experience (highly advantageous but not essential).
  • Strong preference for Azure cloud experience.
  • IBM VS and COBOL II programming skills are a plus.
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R900000 - R1200000 Y ExecutivePlacements - The JOB Portal

Posted today

Job Viewed

Tap Again To Close

Job Description

Site Reliability Engineer (Datadog)

Recruiter:
Data Centrix

Job Ref:
JHB /LD

Date posted:
Tuesday, October 7, 2025

Location:
Johannesburg, South Africa

SUMMARY:
Are you a
Site Reliability Engineer
with solid
Datadog
experience? Our client in the Warehousing and Logistics sector is looking to employ an Engineer to Support the design, implementation, and optimization of
Datadog
monitoring solutions across infrastructure, applications and services.

POSITION INFO:
Qualifications and Experience:

  • Datadog Certified Fundamentals – Must have
  • Degree in Information Technology or Computer Science
  • Management of operations on virtualized and distributed infrastructures,
  • Management of operations on environment with clustering, replication, load balancer
  • ITIL Practitioner (V3) / ITIL Specialist (V4)
  • Windows Server: Advantage
  • 1–3 years of experience working with a modern monitoring/observability tool, ideally Datadog (or alternatives like Prometheus, Grafana, New Relic, or Dynatrace).
  • Experience in:

  • Deploying and configuring monitoring agents

  • Creating dashboards and monitors
  • Parameterizing tags and labels for proper data correlation

  • Basic familiarity with cloud platforms (AWS, Azure or GCP) and container environments (Docker/Kubernetes)

  • Experience working with Centreon - Advantage
  • Strong interest in monitoring, DevOps, SRE, or cloud infrastructure
  • Knowledge of basic scripting (e.g., Bash, Python) is a plus

Duties:

  • Support the design, implementation, and optimization of Datadog monitoring solutions across infrastructure, applications, and services.
  • Work alongside DevOps, infrastructure, and application teams to ensure complete observability using custom dashboards, alerts, and tagging strategies.
  • Assist in the deployment and onboarding of new systems into the monitoring ecosystem.
  • Serve as the go-to person for building visualizations, improving signal-to-noise ratios in alerting, and aligning monitoring with business objectives.
  • Ideal for a young and motivated engineer looking to grow within observability and cloud-native monitoring.
  • Deploy and configure Datadog agents across various environments (cloud and on-prem).
  • Create and customize dashboards, monitors, and alerts for systems, services, containers, and applications.
  • Implement tagging strategies to organize, filter, and correlate metrics and logs effectively.
  • Integrate Datadog with various platforms (AWS, Azure, GCP, Kubernetes, Docker, etc.) to collect telemetry data.
  • Collaborate with developers, DevOps, and infrastructure teams to identify key business and system metrics to monitor.
  • Continuously tune and optimize monitors to reduce false positives and improve actionable alerting.
  • Document dashboards, alert logic, best practices, and knowledge for cross-team enablement.
  • Analyze incidents and outages post-mortem to identify monitoring gaps and enhance visibility.
  • Assist in evangelizing observability practices within the organization and contribute to monitoring as code efforts (e.g., Terraform for Datadog resources).
  • Stay up to date with new Datadog features and industry trends in observability and monitoring.
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R250000 - R600000 Y SprintHive - Intelligent Customer Onboarding

Posted today

Job Viewed

Tap Again To Close

Job Description

Junior Site Reliability Engineer - SprintHiveAbout SprintHive

SprintHive is a South African fintech enabling seamless end-to-end customer onboarding that drives conversion rates, prevents fraud, and reduces risk. Our automated solutions fully onboard new customers in under two minutes.

The Role

We're seeking a Junior Site Reliability Engineer eager to launch their career in infrastructure and platform engineering. You'll work directly with our CTO, receiving hands-on mentorship while contributing to production systems from day one. This fully remote position offers exceptional learning opportunities in a modern cloud-native environment.

What You'll DoLearn & Build

  • Gain hands-on experience with infrastructure-as-code using Terraform
  • Work with Kubernetes and containerised applications
  • Learn cloud platforms (Google Cloud and AWS) through real projects
  • Develop automation scripts in Golang, Bash, or Python
  • Participate in code reviews to improve your skills

Operational Support

  • Assist in maintaining and improving our monitoring stack
  • Help debug production issues alongside senior team members
  • Document procedures and infrastructure components
  • Support change control processes for production deployments
  • Participate in on-call rotation (with full support and gradual responsibility increase)

Growth Opportunities

  • Own small to medium infrastructure projects with guidance
  • Contribute to security initiatives and incident response
  • Collaborate with development team on infrastructure needs
  • Progress toward independent project ownership

Requirements

  • Strong problem-solving mindset and eagerness to learn
  • Basic programming ability in any language
  • Fundamental understanding of Linux/Unix systems
  • Interest in cloud infrastructure and DevOps practices
  • Ability to work independently in a remote environment

Preferred Qualifications

  • Computer Science degree or relevant coursework (bootcamps, self-study welcome)
  • Personal projects demonstrating infrastructure or automation work
  • Familiarity with Git, containers, and cloud services
  • Contributions to open source projects

Our Tech Stack

  • Infrastructure
    : Kubernetes (GKE), Terraform, Kong API Gateway
  • Monitoring
    : Prometheus, Grafana, Elastic, Kibana, Mezmo, Falco
  • Languages
    : Kotlin, Python, JavaScript, Golang
  • Architecture
    : Microservices with Event Sourcing and CQRS
  • Database
    : MongoDB Atlas
  • CI/CD
    : Jenkins

Compensation & Benefits

  • Salary: Competitive market rate
  • 21 days paid leave
  • Direct mentorship from experienced CTO
  • Flexible working hours
  • Fully remote position
  • High-quality hardware (MacBook Pro, 34" Dell monitor)
  • AI assistant subscriptions
  • Clear growth path to Senior SRE role

Intermediate Site Reliability Engineer - SprintHiveAbout SprintHiveThe Role

We're seeking an Intermediate Site Reliability Engineer ready to take ownership of significant infrastructure projects. Working directly with our CTO, you'll have the autonomy to drive improvements while continuing to grow your expertise. This fully remote position offers the perfect balance of independence and support.

What You'll DoInfrastructure Ownership

  • Maintain and improve infrastructure using Terraform across GCP and AWS
  • Deploy and optimise applications on Kubernetes (GKE)
  • Ensure infrastructure automation and reproducibility
  • Own medium to large infrastructure projects independently
  • Debug and resolve complex production issues

Operational Excellence

  • Build automation tooling to eliminate manual processes
  • Improve monitoring and alerting systems
  • Write and execute production change plans
  • Maintain security best practices in all deployments
  • Participate confidently in on-call rotation

Collaboration & Growth

  • Contribute to infrastructure architecture discussions
  • Document systems and share knowledge with team
  • Partner with development team on infrastructure requirements
  • Participate in security incident response
  • Begin mentoring junior team members as we grow

RequirementsMust-Haves

  • 2-4 years of infrastructure/DevOps/SRE experience
  • Hands-on experience with cloud platforms (GCP, AWS, or Azure)
  • Working knowledge of infrastructure-as-code tools
  • Container and orchestration experience (Docker, Kubernetes)
  • Proven ability to complete projects independently
  • Solid programming skills in at least one language (Python, Go, Bash)

Preferred Qualifications

  • Terraform experience
  • GCP/AWS certification
  • Experience with monitoring tools (Prometheus, Grafana, ELK)
  • Exposure to microservices architectures
  • Computer Science degree or equivalent

Our Tech Stack

  • Infrastructure
    : Kubernetes (GKE), Terraform, Kong API Gateway
  • Monitoring
    : Prometheus, Grafana, Elastic, Kibana, Mezmo, Falco
  • Languages
    : Kotlin, Python, JavaScript, Golang
  • Architecture
    : Microservices with Event Sourcing and CQRS
  • Database
    : MongoDB Atlas
  • CI/CD
    : Jenkins

Compensation & Benefits

  • Salary: Competitive market rate
  • 21 days paid leave
  • Flexible working hours
  • Fully remote position
  • High-quality hardware (MacBook Pro, 34" Dell monitor)
  • AI assistant subscriptions
  • High project autonomy
  • Clear growth path to Senior role

Why This Role?

Perfect for engineers ready to move beyond junior tasks but not yet requiring senior-level compensation. You'll own real projects, make meaningful decisions, and grow rapidly in a small team environment.

Senior Site Reliability Engineer - SprintHiveAbout SprintHiveThe Role

We're seeking a Senior Site Reliability Engineer to co-architect our infrastructure future. As the second senior SRE working directly with our CTO, you'll have exceptional influence over platform strategy and technical direction. This fully remote position is ideal for an expert seeking impact without bureaucracy.

What You'll DoStrategic Leadership

  • Co-design long-term infrastructure vision and roadmap
  • Lead complex, multi-quarter infrastructure transformations
  • Define SRE standards, practices, and tooling strategies
  • Make architectural decisions balancing scale, cost, and complexity
  • Evaluate and introduce new technologies

Technical Excellence

  • Architect sophisticated infrastructure solutions using Terraform
  • Design zero-downtime deployment strategies and disaster recovery plans
  • Lead Kubernetes platform optimisation for scale and efficiency
  • Drive security architecture including identity management (OAuth2)
  • Build advanced automation and self-healing systems

Team & Culture Building

  • Mentor team members and establish engineering excellence standards
  • Lead incident response and drive blameless post-mortem culture
  • Interface with executive team on infrastructure strategy
  • Own vendor relationships and technology evaluations
  • Help build and scale the SRE team as we grow

RequirementsMust-Haves

  • 5+ years of SRE/DevOps experience in production environments
  • Expert-level knowledge of at least one major cloud provider
  • Advanced Terraform skills with large-scale infrastructure experience
  • Deep Kubernetes expertise including performance tuning and troubleshooting
  • Proven track record of leading infrastructure initiatives
  • Strong programming skills (Go, Python) for tooling development
  • Experience mentoring engineers and driving technical standards

Preferred Qualifications

  • Fintech or high-compliance environment experience
  • Multi-cloud architecture experience
  • Security certifications or demonstrated security expertise
  • Open source contributions to infrastructure tools
  • Experience scaling startups through rapid growth

Our Tech Stack

  • Infrastructure
    : Kubernetes (GKE), Terraform, Kong API Gateway
  • Monitoring
    : Prometheus, Grafana, Elastic, Kibana, Mezmo, Falco
  • Languages
    : Kotlin, Python, JavaScript, Golang
  • Architecture
    : Microservices with Event Sourcing and CQRS
  • Database
    : MongoDB Atlas
  • CI/CD
    : Jenkins

Compensation & Benefits

  • Salary: Competitive market rate
  • 21 days paid leave
  • Flexible working hours
  • Fully remote position
  • Premium hardware setup (MacBook Pro, 34" Dell monitor)
  • AI assistant subscriptions
  • Strategic influence at executive level
  • Budget authority for infrastructure decisions
  • Opportunity to build and lead growing SRE team

Why This Role?

Rare opportunity to join as a founding SRE member with CTO-level partnership. You'll have the authority of a Head of Infrastructure without the bureaucracy, the technical challenges of a scale-up without legacy constraints, and the influence to build the team and culture from scratch.

This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Devops engineers Jobs in South Africa !

Site Reliability Engineer

R600000 - R1200000 Y Electrum Payments

Posted today

Job Viewed

Tap Again To Close

Job Description

Electrum is the next-generation payments technology company that provides cloud-native software to optimise the processing of financial transactions. Since 2012, we have established ourselves as a respected payments technology partner through our deep expertise and track record in delivering trusted enterprise-grade payments solutions.

We've built a reputation in providing solutions for high-volume, low-value payment schemes and services that enable our clients to deliver to their customers at scale. We love that the projects we work on touch the lives of millions of South Africans daily, making a real difference.

We hire the best of the best and we offer great opportunities for personal growth and career progression.

Responsibilities

Site Reliability Engineers (SREs) are responsible for monitoring, automating, and improving the reliability, scalability, performance and availability of our services. SREs work on tasks such as preventing incidents, managing infrastructure reliability, building effective monitoring systems and ensuring smooth operations of cloud production systems.

Requirements

Service Reliability and Availability

  • Collaborate with teams to develop reliable, available, and scalable applications.
  • Work closely with the development team to understand, address, and prevent technical issues.
  • Participate in on-call rotations and manage critical incidents.
  • Develop and maintain incident response processes and alerting mechanisms.
  • Develop and maintain tools to monitor application and service SLIs and SLOs.

System Troubleshooting and Problem Resolution

  • Diagnose and resolve infrastructure and system-level issues, ensuring minimal downtime and swift problem resolution.
  • Respond to and investigate incidents related to infrastructure and applications, utilising diagnostic tools to track down and remediate issues.
  • Participate in on-call rotations to provide 24/7 operational support as necessary.

Observability and Automation

  • Utilise technologies to develop and maintain effective log management and monitoring solutions for internal and external customers.
  • Evaluate system health, identify performance bottlenecks and proactively optimise performance and cost-effectiveness.
  • Implement automation tools and frameworks for deployment, configuration, and monitoring processes.
  • Capacity management and planning for systems to ensure continued reliability.

Process Improvements

  • Offer recommendations and improvements to enhance performance, security, and scalability.
  • Evaluate and integrate emerging technologies, cloud services and automation tools to improve operational efficiency.
  • Drive cost-optimization initiatives by identifying opportunities for resource right-sizing, efficiency and other cost-saving measures.

Disaster Recovery

  • Design and implement disaster recovery strategies, including backup and restoration processes, to ensure business continuity.
  • Develop and update incident management procedures, ensuring effective incident response by providing technical solutions and implementing preventative measures.
  • Regularly assess system performance, identify irregularities, troubleshoot issues, and ensure high system availability. This includes performing or facilitating Disaster Recovery tests.
Requirements
  • Bachelor's degree in Computer Science, Information Technology, or related field preferred.
  • 3+ years experience in an SRE or similar role.
  • Familiarity with AWS services like EC2, S3, RDS, Lambda, EKS and CloudWatch.
  • Demonstrable experience with observability tools like Elastic and Grafana.
  • Development skills advantageous.
  • Proficient troubleshooting and problem-solving skills.
  • Excellent collaboration, communication, and time management skills.
  • Attention to detail and ability to work effectively in a team environment.
Benefits

A good work-life balance is very important at Electrum. To help you manage your own time and energy, Electrum offers benefits such as:

  • Flexibility around core working hours (nature of flexibility is negotiated per role based on business need)

  • Daily cooked lunches and a stocked kitchen for the mid-day nibbles

  • Team socialising, getaways, and social outings

We have created a safe, transparent environment where we know mistakes happen, and that's okay. We even have a 3 step approach to dealing with them:

  1. Tell everyone about it
  2. Fix the mistake
  3. Tell everyone about it

You are responsible for your actions – both the successes and the failures.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R600000 - R1200000 Y Electrum

Posted today

Job Viewed

Tap Again To Close

Job Description

Electrum is the next-generation payments technology company that provides cloud-native software to optimise the processing of financial transactions. Since 2012, we have established ourselves as a respected payments technology partner through our deep expertise and track record in delivering trusted enterprise-grade payments solutions.

We've built a reputation in providing solutions for high-volume, low-value payment schemes and services that enable our clients to deliver to their customers at scale. We love that the projects we work on touch the lives of millions of South Africans daily, making a real difference.

We hire the best of the best and we offer great opportunities for personal growth and career progression.
*Responsibilities *
Site Reliability Engineers (SREs) are responsible for monitoring, automating, and improving the reliability, scalability, performance and availability of our services. SREs work on tasks such as preventing incidents, managing infrastructure reliability, building effective monitoring systems and ensuring smooth operations of cloud production systems.

*Requirements
Service Reliability and Availability *

  • Collaborate with teams to develop reliable, available, and scalable applications.
  • Work closely with the development team to understand, address, and prevent technical issues
  • Participate in on-call rotations and manage critical incidents
  • Develop and maintain incident response processes and alerting mechanisms
  • Develop and maintain tools to monitor application and service SLIs and SLOs

System Troubleshooting and Problem Resolution

  • Diagnose and resolve infrastructure and system-level issues, ensuring minimal downtime and swift problem resolution
  • Respond to and investigate incidents related to infrastructure and applications, utilising diagnostic tools to track down and remediate issues
  • Participate in on-call rotations to provide 24/7 operational support as necessary

Observability and Automation

  • Utilise technologies to develop and maintain effective log management and monitoring solutions for internal and external customers
  • Evaluate system health, identify performance bottlenecks and proactively optimise performance and cost-effectiveness
  • Implement automation tools and frameworks for deployment, configuration, and monitoring processes
  • Capacity management and planning for systems to ensure continued reliability

Process Improvements

  • Offer recommendations and improvements to enhance performance, security, and scalability
  • Evaluate and integrate emerging technologies, cloud services and automation tools to improve operational efficiency
  • Drive cost-optimization initiatives by identifying opportunities for resource right-sizing, efficiency and other cost-saving measures

Disaster Recovery

  • Design and implement disaster recovery strategies, including backup and restoration processes, to ensure business continuity
  • Develop and update incident management procedures, ensuring effective incident response by providing technical solutions and implementing preventative measures
  • Regularly assess system performance, identify irregularities, troubleshoot issues, and ensure high system availability. This includes performing or facilitating Disaster Recovery tests

* *Requirements***

  • Bachelor's degree in Computer Science, Information Technology, or related field preferred
  • 3+ years experience in an SRE or similar role
  • Familiarity with AWS services like EC2, S3, RDS, Lambda, EKS and CloudWatch
  • Demonstrable experience with observability tools like Elastic and Grafana
  • Development skills advantageous
  • Proficient troubleshooting and problem-solving skills
  • Excellent collaboration, communication, and time management skills
  • Attention to detail and ability to work effectively in a team environment

*Benefits *
A good work-life balance is very important at Electrum. To help you manage your own time and energy, Electrum offers benefits such as:

  • Flexibility around core working hours (nature of flexibility is negotiated per role based on business need)
  • Daily cooked lunches and a stocked kitchen for the mid-day nibbles
  • Team socialising, getaways, and social outings

We have created a safe, transparent environment where we know mistakes happen, and that's okay. We even have a 3 step approach to dealing with them:

  • Tell everyone about it
  • Fix the mistake
  • Tell everyone about it

You are responsible for your actions - both the successes and the failures.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

R250000 - R450000 Y Dariel

Posted today

Job Viewed

Tap Again To Close

Job Description

Site Reliability Engineer - Level 1

(MSP – Frontline Operations)

The Site Reliability Engineer (SRE) is the first responder for MSP customers, ensuring incident management, triage, and resolution within SLA-defined timelines.

As the front-line escalation point, this role diagnoses, troubleshoots, and resolves cloud infrastructure issues, escalating complex incidents to specialized teams (Senior SREs, SecOps, FinOps, or Cloud Engineering) when required.

Key Responsibilities

  • Act as the first responder for MSP client incidents, providing rapid troubleshooting, diagnosis, and resolution within SLA timelines.
  • Triage incoming issues to determine whether they can be resolved directly or require escalation to specialized teams.
  • Monitor cloud environments using AWS CloudWatch, Datadog, and FreshService to detect performance, security, and availability issues proactively.
  • Maintain incident records, documenting resolution steps, escalation timelines, and root causes.
  • Collaborate with Senior SREs and internal teams to ensure seamless incident resolution and escalation workflows.
  • Participate in post-incident reviews, contributing insights to improve operational efficiency and prevent future incidents.
  • Provide 24/7 on-call support for one full week per month, ensuring availability for high-priority incidents and urgent operational needs.

Required Skills & Experience

  • 2+ years of experience in cloud operations, incident management, or a similar role, with hands-on AWS experience.
  • Strong knowledge of AWS core services (compute, networking, storage, IAM).
  • Hands-on experience with monitoring and logging tools (e.g., AWS CloudWatch, Datadog, ELK stack, Prometheus, Grafana).
  • Understanding of Kubernetes (ability to support containerized applications) is a plus.
  • Familiarity with incident management systems (e.g., FreshService) for tracking and responding to alerts.
  • Excellent troubleshooting and problem-solving skills, able to quickly determine root causes of issues.
  • Strong organizational and multitasking abilities, especially under pressure in fast-paced operational environments.
  • Good communication skills, ensuring clear documentation and effective collaboration with technical teams.
  • Familiarity with ITIL-based service management workflows is a plus.

Required Qualifications

  • AWS Certified SysOps Administrator OR AWS Solutions Architect Associate Required - None Negotiable
  • Terraform Associate Certification or equivalent hands-on experience (Preferred, but not required for front-line troubleshooting).
  • Experience working in an MSP or cloud operations team (Preferred).
  • Ability to work independently, prioritize incidents effectively, and escalate when necessary.
This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Devops Engineers Jobs