27 Service Reliability Engineering Head jobs in Johannesburg
Site Reliability Engineer
Posted 20 days ago
Job Viewed
Job Description
Monitoring and Alerting: Implementing and maintaining monitoring systems to track system health and performance, alerting on symptoms rather than just outages.
Incident Response: Responding to and resolving production incidents, troubleshooting across the entire stack, and providing support for product teams.
Automation: Developing and implementing automation to streamline operational tasks, improve efficiency, and reducing manual effort.
Infrastructure Management: Managing and maintaining infrastructure, including platforms
Performance Optimization: Identifying and addressing performance bottlenecks, optimizing existing systems, and contributing to system design and capacity planning.
Collaboration: Working closely with development, operations, and other teams to ensure smooth deployments and efficient operations.
Continuous Improvement: Continuously improving systems and processes through post-incident reviews, documentation, and knowledge sharing.
Proactive Problem Solving: Identifying potential problems before they occur and developing solutions to prevent future issues.
Capacity Planning: Ensuring that systems can handle current and future demands.
Mentoring and Coaching: Sharing knowledge and providing guidance to junior engineers.
Skills and Qualifications:
- Strong understanding of system architect, automation, and infrastructure tools.
- Proficiency in programming languages like Python, Go, or Jave.
- Experience with cloud platforms like AWS, Azure or GCP.
- Familiarity with containerization technologies like Docket and Kubernetes.
- Experience with monitoring and alerting tools like Prometheus, Grafana, or New Relic.
- Strong problem-solving and troubleshooting skills.
- Excellent communication and collaboration skills.
- Ability to work independently and as part of a team.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Location: North Riding, Johannesburg, South Africa
Type: Full-time
Office: Hybrid, 3 days in office a week
ABOUT YOU
A committed and capable Site Reliability Engineer (SRE) to take ownership of the uptime, performance, and scalability of our production and development systems. You will be responsible for managing the hosting environments of our ERP, customer platforms, internal applications, databases, and websites, ensuring they are secure, available, and optimised across all stages of deployment. This position is based in Johannesburg, offers a competitive salary, and provides an opportunity to build the foundations of infrastructure excellence for one of South Africa’s most promising fintech ventures.
As a Site Reliability Engineer, you will be the guardian of our technical stability and infrastructure performance. You will manage and optimise hosting environments across production and development instances, covering platforms like Odoo ERP, WhatsApp chatbot systems, APIs, internal tools, external facing websites and reporting databases. Your work ensures that the systems powering over 50 000 Sales Force members and thousands of end users remain resilient, scalable, and secure.
You will collaborate with engineers, product managers, and business teams to design infrastructure strategies, improve observability, manage deployments, respond to incidents, and drive continuous improvement. This is a rare opportunity to shape the infrastructure blueprint of a high growth, impact focused business from the ground up. Infrastructure Management Security & Uptime Automation & CI/CD Collaboration with Engineers
ABOUT US
Who we are and what we do.Asuer is a fintech company committed to making life simpler and more secure for African communities through innovative financial and technology solutions. We operate across insurance and telecommunications, with plans to expand into digital payments. Our focus is on removing barriers and helping people achieve their goals.
Born from the ongoing digital transformation of Botle Buhle Brands (BBB), one of Africa’s leading direct-selling businesses, Asuer has grown into an independent company centred on financial inclusion and accessible technology. Everything we build is guided by our core values: Impact, Innovation, and Integrity.
- Managing and monitoring the infrastructure of our ERP systems, applications, APIs, and databases.
- Ensuring high availability and scalability of production environments and development pipelines.
- Administering cloud environments including deployments, rollbacks, and updates.
- Establishing and maintaining CI CD workflows for rapid and safe deployments.
- Setting up monitoring, logging, and alerting systems to track system health and performance.
- Investigating and resolving production incidents in a timely and thorough manner.
- Implementing backup, recovery, and failover processes to ensure data integrity.
- Improving observability and reporting across environments and services.
- Hardening infrastructure security and enforcing access controls and best practices.
- Supporting development teams with staging, test, and release environments.
- Automating routine tasks to improve system efficiency and reduce human error.
- Experience managing Linux based production environments preferably on Ubuntu
- Strong proficiency in cloud hosting platforms such as AWS or Google Cloud
- Solid understanding of containerisation using Docker and orchestration tools
- Experience with CI CD tools and pipeline automation
- Familiarity with infrastructure as code tools such as Terraform or Ansible
- Comfortable working with PostgreSQL and database administration best practices
- Networking, DNS, and load balancing
- Monitoring and alerting using tools like Grafana, Prometheus, or cloud native solutions
- Understanding of secure deployment practices including firewalls, SSL, and API rate limiting
- Set up and manage reliable and scalable hosting environments
- Diagnose and resolve incidents efficiently with minimal downtime
- Collaborate with software teams to enable faster and safer deployments
- Document infrastructure processes and maintain infrastructure knowledge bases
- Implement DevOps and SRE practices tailored to a fast moving startup context
- Build processes that are robust and scale as the company grows
- Balance performance, security, and simplicity in all infrastructure decisions
- Odoo hosting and maintenance workflows
- Hosting ERP systems, databases, and API driven platforms
- Securing web infrastructure and access credentials
- Optimising costs and performance in cloud environments
- Scripting and automation using Bash, Python, or similar
- Logging and system observability tools
- Fast recovery planning and disaster mitigation
- A tertiary qualification in Computer Science, Information Technology, or a related field
- Minimum of 3 years of experience in a systems administration, DevOps, or SRE role
- Strong problem solving, troubleshooting, and communication skills
- Proficiency in English reading, writing, and speaking
A BIT MORE ABOUT US
At Asuer, you’ll join a mission with real meaning, where your work empowers thousands of people across Africa. You’ll collaborate with smart, curious teammates who move fast and build with purpose, without the drag of legacy systems. We offer competitive pay, a flexible environment, and the autonomy to shape systems from the ground up. This is a place for real growth, where you scale products that matter and make a tangible impact every day.
#J-18808-LjbffrSite Reliability Engineer
Posted 9 days ago
Job Viewed
Job Description
We are looking for a proactive and detail-oriented Site Reliability Engineer to help bridge the gap between software development and IT operations, ensuring our systems are not only fast and reliable, but continuously improving.
As a Site Reliability Engineer (SRE), you’ll play a key role in ensuring the scalability, performance, and uptime of systems that support the companies global digital ecosystem.
Requirements:
- IT Degree and/or relevant qualifications
- 10+ years of experience in a SRE, DevOps engineer, or a similar role, preferably in a technology-driven environment
- Strong understanding of networking fundamentals
- Skilled with AWS
- Proficiency in at least one programming language: Python, Go, or JavaScript/TypeScript
- Understanding of containerization (Docker) and orchestration principles
- Experience with monitoring and alerting systems
- Understanding of CI/CD principles
- Version control with Git
- Any additional responsibilities assigned in the Agile Working Model (AWM) Charter
- Advanced Kubernetes knowledge and certification (CKA/CKAD)
- Experience with the complete Grafana stack (Grafana, Loki, Tempo)
- Proficiency with GitOps tools (Flux, ArgoCD)
- Advanced programming skills in Go or TypeScript
- Knowledge of terraform
- Database experience with PostgreSQL, MongoDB
Reference number for this position is GZ60580 which is a contract position based in Midrand/ Centurion/ Semi-Remote offering a cost to company salary of R750 per hour negotiable on experience and ability. Contact Garth on or call him on to discuss this and other opportunities.
Are you ready for a change of scenery? The e-Merge IT recruitment is a specialist niche recruitment agency. We offer our candidates options so that we can successfully place the right developers with the right companies in the right roles. Check out the e-Merge website for more great positions.
Do you have a friend who is a developer or technology specialist? We pay cash for successful referrals!
Senior Site Reliability Engineer
Posted 9 days ago
Job Viewed
Job Description
Our client in the Manufacturing industry specialises with building premium vehicles—they engineer the future of mobility.
Currently in search for a Senior Site Reliability Engineer , you will play a critical role in ensuring the availability, performance, and resilience of companies’ digital platforms and connected services across the globe.
If you're passionate about automation, cloud infrastructure, and high-scale systems, join us in shaping what’s next — on the road and in technology.
Requirements:
- At least six years’ worth of experience using C# or similar MS technologies
- Experience in testing (manual or automated testing)
- Agile working experience advantageous
- Infrastructure Management (Cloud and on-Prem)
- High level of skills in the Kubernetes, Automation and Infrastructure
- Solid understanding of infrastructure as code principles and practical experience with Terraform or similar tools.
- Hands on experience with Docker, containerisation and microservices architecture
- Solid understanding of monitoring and alerting practices (tools e.g Grafana, Prometheus, Elasticsearch) be able to develop new application metrics.
- Any additional responsibilities assigned in the Agile Working Model (AWM) Charter
- Familiarity with CI/CD concepts and experience with GitOps tools like argoCD
- Experience with Unix/Linux operating systems internals and administration or in-depth knowledge of the Unix networking stack
- Problem Management and Incident Management – Proactive and Reactive
- Defect Management
- Change Management
- Optimise application performance.
- Strong understanding and troubleshooting skills of distributed services.
- Service Delivery Management
- Excellent communication skills and team-oriented work behaviour in a distributed team
- Software development background (C# experience)
- Strong ability to understand and interpret Business needs and requirements with the ability to move concepts through to proposal and finally successful implementation
- Confluence / Jira, DevOps
Reference Number for this position is GZ60549 which is a contract position based in Midrand /Centurion/ Semi-Remote offering a contract rate of R650 per hour negotiable on experience and ability. Contact Garth on or call him on to discuss this and other opportunities.
Are you ready for a change of scenery? The e-Merge IT recruitment is a specialist niche recruitment agency. We offer our candidates options so that we can successfully place the right developers with the right companies in the right roles. Check out the e-Merge website for more great positions.
Do you have a friend who is a developer or technology specialist? We pay cash for successful referrals!
Engineer, Site Reliability
Posted 6 days ago
Job Viewed
Job Description
Business Segment: Business & Commercial Banking
Location: ZA, GP, Johannesburg, 3 Simmonds Street
Responsible for the resilience of Group Information Technology across the entire eco system of the bank by improving availability, reliability and performance of business-critical customer facing systems, whilst building sustainable capability. This complex task is delivered in conjunction with the CIO and CTO communities.
Qualifications
Type of Qualification: Post Graduate Degree
Field of Study: Information Studies
Type of Qualification: Post Graduate Degree
Field of Study: Information Technology
Experience Required
Software Engineering
Technology
8-10 years
Experience as a software engineer or operations engineer, using large scale production systems and technologies. Experience in design and execution small to medium scale systems automation projects with strong autonomy. Broad experience in translating business and functional requirements into technical specifications. Experience in engaging with delivery partners both internal and external to the organisation with a focus on optimising partner performance.
More than 10 years
Experience in transformational projects with a strong technology platform component, demonstrating the realisation of business objectives and affecting client experience. Experience in working with cross-functional business stakeholder groups in order to facilitate ideation and solution design, ensuring that initiatives have client and business relevance. Experience in ensuring the commercial viability of solution and creating value for clients, shareholders and business.
Additional Information
- Adopting Practical Approaches
- Articulating Information
- Checking Things
- Developing Expertise
- Documenting Facts
- Examining Information
- Interpreting Data
- Managing Tasks
- Producing Output
- Taking Action
- Team Working
- Benefits Management
- IT Applications
- IT Systems
- Technical Analysis
- Use of Build and Test Automation
- Use of Version Control
- Splunk, Appdynamics, Dynatrace
- Python
- IaC (AWS CDK or Terraform)
Please note: All our recruitment processes comply with the applicable local laws and regulations.
We will never ask for money or any form of payment as part of our recruitment process. If you experience this, please contact our Fraud line on +27 800222050
Please note: All our recruitment processes comply with the applicable local laws and regulations. We will never ask for money or any from of payment as part of our recruitment process. If you experience this, please contact our Fraud line on +27 800222050 or
#J-18808-LjbffrEngineer, Site Reliability
Posted today
Job Viewed
Job Description
Business Segment: Business & Commercial Banking
Location: ZA, GP, Johannesburg, 3 Simmonds Street
Responsible for the resilience of Group Information Technology across the entire eco system of the bank by improving availability, reliability and performance of business-critical customer facing systems, whilst building sustainable capability. This complex task is delivered in conjunction with the CIO and CTO communities.
Qualifications
Type of Qualification: Post Graduate Degree
Field of Study: Information Studies
Type of Qualification: Post Graduate Degree
Field of Study: Information Technology Experience Required
Software Engineering
Technology
8-10 years
Experience as a software engineer or operations engineer, using large scale production systems and technologies. Experience in design and execution small to medium scale systems automation projects with strong autonomy. Broad experience in translating business and functional requirements into technical specifications. Experience in engaging with delivery partners both internal and external to the organisation with a focus on optimising partner performance. More than 10 years
Experience in transformational projects with a strong technology platform component, demonstrating the realisation of business objectives and affecting client experience. Experience in working with cross-functional business stakeholder groups in order to facilitate ideation and solution design, ensuring that initiatives have client and business relevance. Experience in ensuring the commercial viability of solution and creating value for clients, shareholders and business.
Additional Information
- Adopting Practical Approaches
- Articulating Information
- Checking Things
- Developing Expertise
- Documenting Facts
- Examining Information
- Interpreting Data
- Managing Tasks
- Producing Output
- Taking Action
- Team Working
- Benefits Management
- IT Applications
- IT Systems
- Technical Analysis
- Use of Build and Test Automation
- Use of Version Control
- Splunk, Appdynamics, Dynatrace
- Python
- IaC (AWS CDK or Terraform)
Please note: All our recruitment processes comply with the applicable local laws and regulations.
We will never ask for money or any form of payment as part of our recruitment process. If you experience this, please contact our Fraud line on +27 800222050
Please note: All our recruitment processes comply with the applicable local laws and regulations. We will never ask for money or any from of payment as part of our recruitment process. If you experience this, please contact our Fraud line on +27 800222050 or
#J-18808-LjbffrPlatform / DevOps / Site Reliability Engineer
Posted 6 days ago
Job Viewed
Job Description
Platform / DevOps / Site Reliability Engineer
Location: Remote but ideally based in Johannesburg, Cape Town, Durban
Company: Part of a large ICT group, this company offers globally available cloud services, solutions, and platforms for all. Their expertise empowers clients to adopt and migrate to any cloud, wherever they choose.
The purpose of the role is to create and manage platforms to guarantee the smooth operation of application systems. This involves aligning the planning, execution, and management of cloud infrastructure and software services with the overall business strategy. Collaborates with other teams to ensure the infrastructure remains dependable, scalable, and equipped to meet the evolving requirements of the applications.
Duties & ResponsibilitiesRequirements:
- 3 - 5yrs + DevOps / Site Reliability / Platform Engineer or System Administration experience in software environment.
- Experience working with IaaS and public cloud platforms.
- Managing and provisioning of infrastructure through code (IaaC).
- Solid experience with and knowledge of Docker.
- Experience with VMware Cloud Services, Amazon Web Services, Microsoft Azure, or Google Cloud Platform.
- SOLID experience with both or one of the following: HashiCorp and / or Kubernetes.
- Infrastructure-as-Code tools Terraform, Ansible – essential.
- CircleCI, GitHub Actions, GitLab CI, etc.
- Experience using version control tools such as Git and GitHub.
- Strong knowledge of security risks and mitigation thereof.
- MySQL, Postgres database administration.
Strong skills in the following:
- Network routing and core principles.
- Solid experience and good knowledge working with Linux containers and virtual machines.
- Knowledge of cloud platform environments.
- Experience within a software development environment including a good understanding of software development principles.
- Good knowledge of infrastructure-as-code and automation.
- Solid experience in Unix/Linux administration.
- Experience in Linux container orchestration.
R 35 000 - R 65 000 - Monthly
#J-18808-LjbffrBe The First To Know
About the latest Service reliability engineering head Jobs in Johannesburg !
Site Reliability Engineer (Expert) 0630
Posted 5 days ago
Job Viewed
Job Description
- Work on high-availability, multi-region deployments
- Shape our observability strategy and implement automation at scale
- Collaborate with development teams to enhance service reliability
- Lead incident response and drive systematic improvements
- 10+ years in SRE, DevOps, or similar roles
- Strong networking fundamentals
- Skilled with AWS and cloud-native technologies
- Proficiency in Python, Go, or JavaScript/TypeScript
- Experience with Docker, Kubernetes, CI/CD, and GitOps (Flux/ArgoCD)
- Knowledge of monitoring tools (Grafana, Prometheus, Loki, Tempo)
- Advanced Kubernetes certification (CKA/CKAD)
- Experience with Terraform, PostgreSQL, MongoDB
- Expertise in performance optimization & cost management
- Security hardening & compliance implementation
- Containerization: Kubernetes, Docker
- Observability: Grafana Stack, Prometheus
- Infrastructure: Cloud-native technologies
- Programming: Go, Python, TypeScript/JavaScript
- CI/CD: Modern pipeline tools
- Multi-region deployments & microservices architecture
- System Reliability: Design and implement scalable infrastructure solutions
- Observability: Architect and maintain monitoring & alerting systems
- Automation: Develop automated workflows to reduce manual effort
- Incident Management: Lead major incident response and drive improvements
- Technical Leadership: Mentor team members and influence engineering decisions
- Tool Development: Build internal tools to enhance operational efficiency
- Best Practices: Establish and enforce SRE methodologies
ð© Ready to take on this challenge? Apply now with your latest and detailed CV!
Technology/Domain Specialist II (Site Reliability Engineer)
Posted 18 days ago
Job Viewed
Job Description
Details
Location:
Johannesburg, ZA
Reference: 140754
Job Classification140754 - Technology Domain Specialist (Site Reliability Engineer)
Closing date - 10 July 2025
Job FamilyInformation Technology
Application Development
Manage Self: Technical
Job PurposeTo actively own and participate in the overall evolution of the Technology or Domain asset while influencing and maintaining the health of the asset. Play a leadership role on the associated COE’s
- Collaborating with stakeholders, engineers, and operational SMEs to ensure all relevant parties are up to date with what is top of mind within the reliability service offerings
- Evolve services based on customer needs and technology to ensure we remain competitive in the market
- Influence and collaborate with squads during service or platform design to proactively prevent system failures and enhance performance
- Engage with Asset/Journey squads to adopt SRE practices with a core focus to contribute towards incident management and advocate for blameless post mortems.
- Engage and influence squads with regards to observability, high availability utilising new or existing technology and Improve disaster recovery plans.
- Implement automated-based solutions to achieve high availability, efficiency, reduce cost and performance to systems.
- Coach squads on best practices within the organisation via internal forums to position SRE fundamental knowledge and promote enterprise-wide knowledge sharing
- Assist with creating and maintaining system health and performance metrics reflecting real-time data, enabling proactive resolution and faster troubleshooting.
- Collaborate and partner with DevOps engineer/coach to ensure efficient (CI/CD) pipelines and resolve any failures or improve.
- Take charge of technical leadership, engage, with squads to identify best solutions, and support and guide Junior SRE's.
- Assist in defining and implementing metrics related to performance of services such as SLO's, SLI's and SLO's.
- Defining and delivering Site Reliability Engineering technical standards in partnership with all disciplines of software engineering.
- Participate and closely work with relevant COE's to improve release of new features to facilitate time to market.
- Ability to build and maintain strategic relationships with the business units and vendors in order to be in sync on current ways of work and business decisions that are being embraced
- Conduct assessments within squads to measure SRE maturity, provide report and outline a plan to assist on moving to next level with continuous feedback.
- Adhere and comply with Nedbank group information management, data integrity and security policies and best practices.
- Participate and support corporate responsibility initiatives for the achievement of business strategy.
- Manage multiple concurrent objectives, projects, groups, or activities, making effective judgements as to prioritisation and time allocation
- Working Experience of Operating System (Linux or Windows)
- Knowledgeable with microservices and containerization; K8s or Docker
- Troubleshooting and rout cause Analysis
- SRE Best practices
- In-depth knowledge of DevOps framework
- Experience and knowledge of programming languages(C#, Java, Python, Bash)
- Proactivity in seeking Improvement opportunities
- Experience with troubleshooting production systems/applications
- Advanced Diplomas/National 1st Degrees
- Professional Qualifications/Honour’s Degree
Degree or Diploma in IT
Preferred CertificationsCertificate in relevant Technology or Domain
Minimum Experience LevelMin 5IT Experience with 3 years in relevant technology or domain
Technical / Professional Knowledge- Asset management
- Data Warehousing
- Information Technology (IT) Architecture
- Decision Making
- Courage
- Stress Tolerance
- Quality Orientation
- Technical/Professional Knowledge and Skills
- Resolving Conflict
---
Please contact the Nedbank Recruiting Team at +27 860 555 566
If you can't find the job you're looking for, activate job alerts to be one of the first to know when new positions open up.
Nedbank Ltd Reg No 1951/0009/06.
Authorised financial services and registered credit provider (NCRCP16).
For assistance please contact the Nedbank Recruiting Team at +27 860 555 566
#J-18808-LjbffrQuality & Reliability Engineer (QRE)
Posted 5 days ago
Job Viewed
Job Description
WatersEdge Solutions is hiring a Quality & Reliability Engineer (QRE) to lead the charge on software stability, release quality, and development tooling for a global incentive technology platform. If you thrive in the space between engineering, QA, and operations—this is your opportunity to own quality at scale.
About the Role
Reporting to the Tech Lead, you’ll play a central role in building, maintaining, and improving CI/CD pipelines, test automation, observability, and reliability metrics. You’ll collaborate across functions to ensure stable, secure, and high-quality releases, with a strong focus on developer experience and platform integrity.
Key Responsibilities
- Maintain and enhance CI/CD pipelines (GitHub Actions, Heroku)
- Automate quality gates: tests, linting, coverage, type-checks
- Own release processes: staging, production deployments, rollbacks, feature flags
- Monitor production health via logs, Sentry, performance dashboards
- Track SLIs/SLOs and report on metrics like MTTR, CFR, and defect escape rates
- Support incident response and prepare operational runbooks
- Maintain test automation infrastructure and flaky test backlog
- Collaborate with devs to ensure testable, regression-resistant code
- Integrate security checks into pipelines and uphold compliance standards (e.g. SOC 2)
What You’ll Bring
- Strong CI/CD experience (GitHub Actions, CircleCI, GitLab CI)
- Familiarity with cloud platform pipelines (e.g., Heroku)
- Proficient in Python and shell scripting (Django a bonus)
- Experience managing automated test frameworks
- Comfort with observability tools like Sentry and log monitors
- Proven track record in delivering secure, reliable SaaS or FinTech systems
Nice to Have
- Experience with feature flags (e.g., LaunchDarkly, Unleash)
- Knowledge of SOC 2 / ISO 27001 controls in CI/CD
- Exposure to data privacy and multi-tenant architecture
- Experience conducting post-mortems and tracking incident actions
What’s On Offer
- Competitive compensation
- Full ownership of the CI/CD and reliability function within a fast-scaling SaaS product
- High autonomy and low red tape environment
- A mission-driven platform enabling financial equity globally
Company Culture
We value pragmatism, systems thinking, and a passion for enabling others through great tooling. You’ll join a smart, humble, and impact-driven team working in a no-blame culture where quality is everyone’s responsibility. If you believe tests are leverage, incidents are learning opportunities, and performance is about velocity with stability—this is your place.
If you have not been contacted within 10 working days, please consider your application unsuccessful.
#J-18808-Ljbffr