110 Sre Manager jobs in South Africa
Site Reliability Engineering (SRE) Lead
Posted 10 days ago
Job Viewed
Job Description
Job Overview
As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.
You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.
Key Focus Areas
You will primarily be responsible for:
- Team Management and Growth:
- Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
- Collaboration:
- Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
- Infrastructure reliability and performance:
- Monitoring, measuring, and improving the reliability and performance of our systems
- Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
- Maintenance, upgrades, and security updates
- Automation and tooling:
- You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
- Assisting other teams with deployment and updates of their applications and services.
- Administration:
- Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
- Innovation:
- You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
- Data Management and Security:
- Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.
As a lead in the Engineering department, you will contribute to key focus areas in the following ways:
Technical Leadership
- Lead on architectural design and actively manage technical risk/debt against project goals.
- Able to describe, analyse, and convince others about major technical tradeoffs and decisions.
Thought Leadership
- Present at relevant conferences, webinars and other opportunities to showcase Reach.
Workflow
- Take ownership for team and technical documentation.
- Ensure your team adopts and follows process & workflow best practices.
Delivery
- Take responsibility for risks with your team’s work.
- Take initiative to identify problems and propose solutions to resolve them.
Strategy
- Provide input into the organisation's technology strategy.
- Play an active role in meeting engineering team KPIs.
Ways of Working
- Suggest and implement improvements to current ways of working.
- Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).
Communication (Internal)
- Share ideas, decisions and plans effectively within Engineering and across the organisation.
- Build relationships with other leads and heads across the organisation to support strong cross-functional teams.
Communication (External)
- Take ownership of external engagements and respond timeously to stakeholders.
Team Management
- Ensure you’re delegating the right things effectively, so that you can work at the right level.
- Identify and support opportunities for growth within your team.
Partnerships & Growth
- Lead on technical proposals and concept notes
- Pursue opportunities for new partnerships and services.
People Operations
- Proactively highlight gaps and skills needed in Engineering.
- Draft job descriptions and drive recruitment for relevant roles.
Responsibilities and Duties
- Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
- Establish and enforce SRE best practices to improve system reliability and operational efficiency.
- Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
- Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
- Conduct performance reviews, set goals, and facilitate professional development for team members.
- Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
- Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
- Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
- Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
- Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
- Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
- Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.
Qualifications
- An honours degree in Computer Science or Engineering or equivalent experience.
- 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
- 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.
Skills and Experience Required
- Proficient in one or more programming languages, such as Python, Go, Java, or C++.
- Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
- Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
- Proficient in one or more UNIX-like operating systems.
- Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
- Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
- Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
- Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
- Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
- Proficient in one or more version control and collaboration tools, such as Git.
- Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
- Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
- Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
- Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
- Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
- Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
- Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
Site Reliability Engineering (SRE) Lead
Posted 12 days ago
Job Viewed
Job Description
Job Overview
As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.
You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.
Key Focus Areas
You will primarily be responsible for:
- Team Management and Growth:
- Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
- Collaboration:
- Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
- Infrastructure reliability and performance:
- Monitoring, measuring, and improving the reliability and performance of our systems
- Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
- Maintenance, upgrades, and security updates
- Automation and tooling:
- You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
- Assisting other teams with deployment and updates of their applications and services.
- Administration:
- Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
- Innovation:
- You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
- Data Management and Security:
- Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.
As a lead in the Engineering department, you will contribute to key focus areas in the following ways:
Technical Leadership
- Lead on architectural design and actively manage technical risk/debt against project goals.
- Able to describe, analyse, and convince others about major technical tradeoffs and decisions.
Thought Leadership
- Present at relevant conferences, webinars and other opportunities to showcase Reach.
Workflow
- Take ownership for team and technical documentation.
- Ensure your team adopts and follows process & workflow best practices.
Delivery
- Take responsibility for risks with your team’s work.
- Take initiative to identify problems and propose solutions to resolve them.
Strategy
- Provide input into the organisation's technology strategy.
- Play an active role in meeting engineering team KPIs.
Ways of Working
- Suggest and implement improvements to current ways of working.
- Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).
Communication (Internal)
- Share ideas, decisions and plans effectively within Engineering and across the organisation.
- Build relationships with other leads and heads across the organisation to support strong cross-functional teams.
Communication (External)
- Take ownership of external engagements and respond timeously to stakeholders.
Team Management
- Ensure you’re delegating the right things effectively, so that you can work at the right level.
- Identify and support opportunities for growth within your team.
Partnerships & Growth
- Lead on technical proposals and concept notes
- Pursue opportunities for new partnerships and services.
People Operations
- Proactively highlight gaps and skills needed in Engineering.
- Draft job descriptions and drive recruitment for relevant roles.
Responsibilities and Duties
- Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
- Establish and enforce SRE best practices to improve system reliability and operational efficiency.
- Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
- Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
- Conduct performance reviews, set goals, and facilitate professional development for team members.
- Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
- Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
- Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
- Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
- Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
- Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
- Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.
Qualifications
- An honours degree in Computer Science or Engineering or equivalent experience.
- 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
- 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.
Skills and Experience Required
- Proficient in one or more programming languages, such as Python, Go, Java, or C++.
- Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
- Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
- Proficient in one or more UNIX-like operating systems.
- Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
- Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
- Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
- Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
- Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
- Proficient in one or more version control and collaboration tools, such as Git.
- Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
- Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
- Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
- Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
- Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
- Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
- Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
Site Reliability Engineering (SRE) Lead
Posted 12 days ago
Job Viewed
Job Description
Job Overview
As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.
You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.
Key Focus Areas
You will primarily be responsible for:
- Team Management and Growth:
- Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
- Collaboration:
- Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
- Infrastructure reliability and performance:
- Monitoring, measuring, and improving the reliability and performance of our systems
- Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
- Maintenance, upgrades, and security updates
- Automation and tooling:
- You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
- Assisting other teams with deployment and updates of their applications and services.
- Administration:
- Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
- Innovation:
- You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
- Data Management and Security:
- Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.
As a lead in the Engineering department, you will contribute to key focus areas in the following ways:
Technical Leadership
- Lead on architectural design and actively manage technical risk/debt against project goals.
- Able to describe, analyse, and convince others about major technical tradeoffs and decisions.
Thought Leadership
- Present at relevant conferences, webinars and other opportunities to showcase Reach.
Workflow
- Take ownership for team and technical documentation.
- Ensure your team adopts and follows process & workflow best practices.
Delivery
- Take responsibility for risks with your team’s work.
- Take initiative to identify problems and propose solutions to resolve them.
Strategy
- Provide input into the organisation's technology strategy.
- Play an active role in meeting engineering team KPIs.
Ways of Working
- Suggest and implement improvements to current ways of working.
- Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).
Communication (Internal)
- Share ideas, decisions and plans effectively within Engineering and across the organisation.
- Build relationships with other leads and heads across the organisation to support strong cross-functional teams.
Communication (External)
- Take ownership of external engagements and respond timeously to stakeholders.
Team Management
- Ensure you’re delegating the right things effectively, so that you can work at the right level.
- Identify and support opportunities for growth within your team.
Partnerships & Growth
- Lead on technical proposals and concept notes
- Pursue opportunities for new partnerships and services.
People Operations
- Proactively highlight gaps and skills needed in Engineering.
- Draft job descriptions and drive recruitment for relevant roles.
Responsibilities and Duties
- Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
- Establish and enforce SRE best practices to improve system reliability and operational efficiency.
- Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
- Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
- Conduct performance reviews, set goals, and facilitate professional development for team members.
- Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
- Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
- Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
- Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
- Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
- Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
- Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.
Qualifications
- An honours degree in Computer Science or Engineering or equivalent experience.
- 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
- 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.
Skills and Experience Required
- Proficient in one or more programming languages, such as Python, Go, Java, or C++.
- Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
- Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
- Proficient in one or more UNIX-like operating systems.
- Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
- Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
- Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
- Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
- Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
- Proficient in one or more version control and collaboration tools, such as Git.
- Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
- Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
- Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
- Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
- Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
- Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
- Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
Site Reliability Engineering (SRE) Lead
Posted today
Job Viewed
Job Description
Job Overview
As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.
You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.
Key Focus Areas
You will primarily be responsible for:
- Team Management and Growth:
- Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
- Collaboration:
- Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
- Infrastructure reliability and performance:
- Monitoring, measuring, and improving the reliability and performance of our systems
- Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
- Maintenance, upgrades, and security updates
- Automation and tooling:
- You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
- Assisting other teams with deployment and updates of their applications and services.
- Administration:
- Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
- Innovation:
- You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
- Data Management and Security:
- Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.
As a lead in the Engineering department, you will contribute to key focus areas in the following ways:
Technical Leadership
- Lead on architectural design and actively manage technical risk/debt against project goals.
- Able to describe, analyse, and convince others about major technical tradeoffs and decisions.
Thought Leadership
- Present at relevant conferences, webinars and other opportunities to showcase Reach.
Workflow
- Take ownership for team and technical documentation.
- Ensure your team adopts and follows process & workflow best practices.
Delivery
- Take responsibility for risks with your team’s work.
- Take initiative to identify problems and propose solutions to resolve them.
Strategy
- Provide input into the organisation's technology strategy.
- Play an active role in meeting engineering team KPIs.
Ways of Working
- Suggest and implement improvements to current ways of working.
- Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).
Communication (Internal)
- Share ideas, decisions and plans effectively within Engineering and across the organisation.
- Build relationships with other leads and heads across the organisation to support strong cross-functional teams.
Communication (External)
- Take ownership of external engagements and respond timeously to stakeholders.
Team Management
- Ensure you’re delegating the right things effectively, so that you can work at the right level.
- Identify and support opportunities for growth within your team.
Partnerships & Growth
- Lead on technical proposals and concept notes
- Pursue opportunities for new partnerships and services.
People Operations
- Proactively highlight gaps and skills needed in Engineering.
- Draft job descriptions and drive recruitment for relevant roles.
Responsibilities and Duties
- Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
- Establish and enforce SRE best practices to improve system reliability and operational efficiency.
- Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
- Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
- Conduct performance reviews, set goals, and facilitate professional development for team members.
- Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
- Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
- Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
- Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
- Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
- Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
- Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.
Qualifications
- An honours degree in Computer Science or Engineering or equivalent experience.
- 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
- 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.
Skills and Experience Required
- Proficient in one or more programming languages, such as Python, Go, Java, or C++.
- Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
- Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
- Proficient in one or more UNIX-like operating systems.
- Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
- Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
- Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
- Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
- Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
- Proficient in one or more version control and collaboration tools, such as Git.
- Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
- Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
- Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
- Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
- Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
- Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
- Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
Site Reliability Engineering (SRE) Lead
Posted today
Job Viewed
Job Description
Job Overview
As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.
You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.
Key Focus Areas
You will primarily be responsible for:
- Team Management and Growth:
- Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
- Collaboration:
- Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
- Infrastructure reliability and performance:
- Monitoring, measuring, and improving the reliability and performance of our systems
- Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
- Maintenance, upgrades, and security updates
- Automation and tooling:
- You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
- Assisting other teams with deployment and updates of their applications and services.
- Administration:
- Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
- Innovation:
- You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
- Data Management and Security:
- Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.
As a lead in the Engineering department, you will contribute to key focus areas in the following ways:
Technical Leadership
- Lead on architectural design and actively manage technical risk/debt against project goals.
- Able to describe, analyse, and convince others about major technical tradeoffs and decisions.
Thought Leadership
- Present at relevant conferences, webinars and other opportunities to showcase Reach.
Workflow
- Take ownership for team and technical documentation.
- Ensure your team adopts and follows process & workflow best practices.
Delivery
- Take responsibility for risks with your team’s work.
- Take initiative to identify problems and propose solutions to resolve them.
Strategy
- Provide input into the organisation's technology strategy.
- Play an active role in meeting engineering team KPIs.
Ways of Working
- Suggest and implement improvements to current ways of working.
- Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).
Communication (Internal)
- Share ideas, decisions and plans effectively within Engineering and across the organisation.
- Build relationships with other leads and heads across the organisation to support strong cross-functional teams.
Communication (External)
- Take ownership of external engagements and respond timeously to stakeholders.
Team Management
- Ensure you’re delegating the right things effectively, so that you can work at the right level.
- Identify and support opportunities for growth within your team.
Partnerships & Growth
- Lead on technical proposals and concept notes
- Pursue opportunities for new partnerships and services.
People Operations
- Proactively highlight gaps and skills needed in Engineering.
- Draft job descriptions and drive recruitment for relevant roles.
Responsibilities and Duties
- Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
- Establish and enforce SRE best practices to improve system reliability and operational efficiency.
- Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
- Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
- Conduct performance reviews, set goals, and facilitate professional development for team members.
- Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
- Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
- Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
- Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
- Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
- Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
- Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.
Qualifications
- An honours degree in Computer Science or Engineering or equivalent experience.
- 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
- 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.
Skills and Experience Required
- Proficient in one or more programming languages, such as Python, Go, Java, or C++.
- Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
- Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
- Proficient in one or more UNIX-like operating systems.
- Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
- Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
- Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
- Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
- Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
- Proficient in one or more version control and collaboration tools, such as Git.
- Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
- Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
- Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
- Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
- Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
- Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
- Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
Site reliability engineering (sre) lead
Posted today
Job Viewed
Job Description
Site reliability engineering (sre) lead
Posted today
Job Viewed
Job Description
Be The First To Know
About the latest Sre manager Jobs in South Africa !
Site reliability engineering (sre) lead
Posted today
Job Viewed
Job Description
Cloud Infrastructure Manager
Posted 14 days ago
Job Viewed
Job Description
Cloud Infrastructure Manager
Mukuru is on the lookout for a Cloud Infrastructure Manager to lead and scale the engineering capability behind our cloud platforms. This role forms a key part of the Technology Solutions leadership team , sitting alongside our Cloud Architect and reporting to the Head of Technology Solutions. If you are based in Johannesburg , Pretoria, Cape Town or anywhere in South Africa feel free to apply!
At Mukuru, we don’t just build infrastructure – we build the backbone of meaningful, borderless financial access. You’ll lead a dynamic team of Cloud Engineers, SREs, and DevSecOps professionals, empowering them to deliver secure, scalable, and reliable infrastructure that fuels real-world impact across the continent.
What You’ll Be Doing
Lead, mentor, and grow a high-performing team of Cloud Engineers, SREs, and DevSecOps experts who thrive on innovation and impact.
Drive the execution of our Cloud Infrastructure roadmap , aligning with Mukuru’s strategic platform and business goals.
Take ownership of Mukuru’s AWS-based cloud environments – defined as infrastructure-as-code with Terraform and containerised with Kubernetes – ensuring performance, cost-efficiency, and resilience at every stage of the SDLC.
Champion DevOps culture across engineering, fostering collaboration, shared ownership, and continuous delivery practices.
Ensure uptime and recovery goals are met, and oversee compliance with RTOs, RPOs, patching, monitoring, and alerting standards.
Partner closely with our Cloud Architect to deliver well-architected, observable, and cost-optimised infrastructure solutions.
Collaborate across the business – from Platform Engineering and Product, to Software Engineering, Security, and Governance – to enable cross-functional success.
Implement and maintain controls aligned to compliance frameworks like PCI-DSS and ISO27001 .
Build and manage tooling across CI/CD, observability, documentation, and cloud cost monitoring.
Drive innovation by enabling teams with the right tools, autonomy, and environment to experiment, iterate, and deliver value.
Manage team capacity, budgets, and ongoing capability growth in line with emerging technologies and business needs.
What You’ll Need to Succeed
Grade 12 or equivalent (essential)
Related tertiary qualification (desirable)
7+ years in DevOps, Cloud Infrastructure, or Systems Engineering
3+ years in a leadership or technical management role
Proven experience designing and operating production-grade AWS environments
Expertise in Terraform, Kubernetes, Linux systems, CI/CD, and incident management
Experience in high-growth sectors like Fintech, Retail, or Technology
Familiarity with cloud cost optimisation , security, and governance best practices
Experience with regulatory frameworks like PCI-DSS and ISO27001
A working knowledge of agile practices and a strong DevOps mindset
What Sets You Apart
You empower engineers, celebrate delivery, and lead with empathy
You bring calm, clarity, and focus during complex situations
You love turning complex challenges into elegant, automated solutions
You value collaboration and communicate clearly across tech and business lines
You’re passionate about infrastructure that’s secure, observable, and scalable
You care about creating long-term value , not just short-term fixes
Why Mukuru?
Be part of a fast-scaling, purpose-driven fintech making real change across Africa.
Join teams that thrive on collaboration, impact, and innovation .
Embrace a culture of diversity, continuous learning, and inclusion .
Work in an environment that values your growth, your voice, and your ideas .
Not 100% sure you tick every box?
That’s okay! We believe that passion, potential, and purpose go a long way. If this role excites you and you believe you can contribute to our mission, we’d love to hear from you. Tell us how you’d add value at Mukuru – your future team is waiting.
Let me know if you’d like this adapted for LinkedIn, a careers site, or if you want a more visual version too!
I am sure you are reading this job description and meet majority of the criteria BUT you may also still not be 100% comfortable in applying. We believe that there is a place for everyone under the Mukuru sun and we want YOU to contribute to our diverse tapestry of talent. So come on, take a leap of faith, and send your application if you meet majority of our requirements. Remember to include a snippet of how you will bring value and help us build a future of success that will help us determine where and how you may best be suited” Maybe you are just the future Mukurian we need!
Should you be appointed in a remote/work from home role at Mukuru, it is your responsibility to ensure that you have uninterrupted internet connectivity and a ‘work-like’ environment at your home location, in order to deliver your best in terms of performance, productivity and service to our customers.
If you do not receive any response after two weeks, please consider your application unsuccessful.
NB: ALL STAFF APPOINTMENTS WILL BE MADE WITH DUE CONSIDERATION OF THE COMPANY’S DIVERSITY AND INCLUSION PLANS
Cloud infrastructure manager
Posted today
Job Viewed