110 Sre Manager jobs in South Africa

Site Reliability Engineering (SRE) Lead

Rippleworks

Posted 10 days ago

Job Viewed

Tap Again To Close

Job Description

Job Overview

As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.

You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.

Key Focus Areas

You will primarily be responsible for:

  • Team Management and Growth:
    • Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
  • Collaboration:
    • Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
  • Infrastructure reliability and performance:
    • Monitoring, measuring, and improving the reliability and performance of our systems
    • Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
    • Maintenance, upgrades, and security updates
  • Automation and tooling:
    • You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
    • Assisting other teams with deployment and updates of their applications and services.
  • Administration:
    • Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
  • Innovation:
    • You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
  • Data Management and Security:
    • Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.

As a lead in the Engineering department, you will contribute to key focus areas in the following ways:

Technical Leadership

  • Lead on architectural design and actively manage technical risk/debt against project goals.
  • Able to describe, analyse, and convince others about major technical tradeoffs and decisions.

Thought Leadership

  • Present at relevant conferences, webinars and other opportunities to showcase Reach.

Workflow

  • Take ownership for team and technical documentation.
  • Ensure your team adopts and follows process & workflow best practices.

Delivery

  • Take responsibility for risks with your team’s work.
  • Take initiative to identify problems and propose solutions to resolve them.

Strategy

  • Provide input into the organisation's technology strategy.
  • Play an active role in meeting engineering team KPIs.

Ways of Working

  • Suggest and implement improvements to current ways of working.
  • Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).

Communication (Internal)

  • Share ideas, decisions and plans effectively within Engineering and across the organisation.
  • Build relationships with other leads and heads across the organisation to support strong cross-functional teams.

Communication (External)

  • Take ownership of external engagements and respond timeously to stakeholders.

Team Management

  • Ensure you’re delegating the right things effectively, so that you can work at the right level.
  • Identify and support opportunities for growth within your team.

Partnerships & Growth

  • Lead on technical proposals and concept notes
  • Pursue opportunities for new partnerships and services.

People Operations

  • Proactively highlight gaps and skills needed in Engineering.
  • Draft job descriptions and drive recruitment for relevant roles.

Responsibilities and Duties

  • Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
  • Establish and enforce SRE best practices to improve system reliability and operational efficiency.
  • Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
  • Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
  • Conduct performance reviews, set goals, and facilitate professional development for team members.
  • Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
  • Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
  • Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
  • Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
  • Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
  • Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
  • Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.

Qualifications

  • An honours degree in Computer Science or Engineering or equivalent experience.
  • 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
  • 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.

Skills and Experience Required

  • Proficient in one or more programming languages, such as Python, Go, Java, or C++.
  • Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
  • Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
  • Proficient in one or more UNIX-like operating systems.
  • Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
  • Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
  • Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
  • Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
  • Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
  • Proficient in one or more version control and collaboration tools, such as Git.
  • Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
  • Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
  • Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
  • Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
  • Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
  • Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
  • Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineering (SRE) Lead

Cape Town, Western Cape Reach Digital Health

Posted 12 days ago

Job Viewed

Tap Again To Close

Job Description

Job Overview

As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.

You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.

Key Focus Areas

You will primarily be responsible for:

  • Team Management and Growth:
    • Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
  • Collaboration:
    • Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
  • Infrastructure reliability and performance:
    • Monitoring, measuring, and improving the reliability and performance of our systems
    • Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
    • Maintenance, upgrades, and security updates
  • Automation and tooling:
    • You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
    • Assisting other teams with deployment and updates of their applications and services.
  • Administration:
    • Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
  • Innovation:
    • You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
  • Data Management and Security:
    • Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.

As a lead in the Engineering department, you will contribute to key focus areas in the following ways:

Technical Leadership

  • Lead on architectural design and actively manage technical risk/debt against project goals.
  • Able to describe, analyse, and convince others about major technical tradeoffs and decisions.

Thought Leadership

  • Present at relevant conferences, webinars and other opportunities to showcase Reach.

Workflow

  • Take ownership for team and technical documentation.
  • Ensure your team adopts and follows process & workflow best practices.

Delivery

  • Take responsibility for risks with your team’s work.
  • Take initiative to identify problems and propose solutions to resolve them.

Strategy

  • Provide input into the organisation's technology strategy.
  • Play an active role in meeting engineering team KPIs.

Ways of Working

  • Suggest and implement improvements to current ways of working.
  • Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).

Communication (Internal)

  • Share ideas, decisions and plans effectively within Engineering and across the organisation.
  • Build relationships with other leads and heads across the organisation to support strong cross-functional teams.

Communication (External)

  • Take ownership of external engagements and respond timeously to stakeholders.

Team Management

  • Ensure you’re delegating the right things effectively, so that you can work at the right level.
  • Identify and support opportunities for growth within your team.

Partnerships & Growth

  • Lead on technical proposals and concept notes
  • Pursue opportunities for new partnerships and services.

People Operations

  • Proactively highlight gaps and skills needed in Engineering.
  • Draft job descriptions and drive recruitment for relevant roles.

Responsibilities and Duties

  • Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
  • Establish and enforce SRE best practices to improve system reliability and operational efficiency.
  • Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
  • Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
  • Conduct performance reviews, set goals, and facilitate professional development for team members.
  • Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
  • Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
  • Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
  • Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
  • Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
  • Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
  • Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.

Qualifications

  • An honours degree in Computer Science or Engineering or equivalent experience.
  • 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
  • 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.

Skills and Experience Required

  • Proficient in one or more programming languages, such as Python, Go, Java, or C++.
  • Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
  • Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
  • Proficient in one or more UNIX-like operating systems.
  • Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
  • Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
  • Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
  • Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
  • Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
  • Proficient in one or more version control and collaboration tools, such as Git.
  • Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
  • Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
  • Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
  • Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
  • Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
  • Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
  • Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineering (SRE) Lead

Johannesburg, Gauteng Reach Digital Health

Posted 12 days ago

Job Viewed

Tap Again To Close

Job Description

Job Overview

As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.

You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.

Key Focus Areas

You will primarily be responsible for:

  • Team Management and Growth:
    • Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
  • Collaboration:
    • Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
  • Infrastructure reliability and performance:
    • Monitoring, measuring, and improving the reliability and performance of our systems
    • Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
    • Maintenance, upgrades, and security updates
  • Automation and tooling:
    • You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
    • Assisting other teams with deployment and updates of their applications and services.
  • Administration:
    • Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
  • Innovation:
    • You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
  • Data Management and Security:
    • Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.

As a lead in the Engineering department, you will contribute to key focus areas in the following ways:

Technical Leadership

  • Lead on architectural design and actively manage technical risk/debt against project goals.
  • Able to describe, analyse, and convince others about major technical tradeoffs and decisions.

Thought Leadership

  • Present at relevant conferences, webinars and other opportunities to showcase Reach.

Workflow

  • Take ownership for team and technical documentation.
  • Ensure your team adopts and follows process & workflow best practices.

Delivery

  • Take responsibility for risks with your team’s work.
  • Take initiative to identify problems and propose solutions to resolve them.

Strategy

  • Provide input into the organisation's technology strategy.
  • Play an active role in meeting engineering team KPIs.

Ways of Working

  • Suggest and implement improvements to current ways of working.
  • Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).

Communication (Internal)

  • Share ideas, decisions and plans effectively within Engineering and across the organisation.
  • Build relationships with other leads and heads across the organisation to support strong cross-functional teams.

Communication (External)

  • Take ownership of external engagements and respond timeously to stakeholders.

Team Management

  • Ensure you’re delegating the right things effectively, so that you can work at the right level.
  • Identify and support opportunities for growth within your team.

Partnerships & Growth

  • Lead on technical proposals and concept notes
  • Pursue opportunities for new partnerships and services.

People Operations

  • Proactively highlight gaps and skills needed in Engineering.
  • Draft job descriptions and drive recruitment for relevant roles.

Responsibilities and Duties

  • Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
  • Establish and enforce SRE best practices to improve system reliability and operational efficiency.
  • Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
  • Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
  • Conduct performance reviews, set goals, and facilitate professional development for team members.
  • Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
  • Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
  • Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
  • Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
  • Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
  • Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
  • Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.

Qualifications

  • An honours degree in Computer Science or Engineering or equivalent experience.
  • 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
  • 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.

Skills and Experience Required

  • Proficient in one or more programming languages, such as Python, Go, Java, or C++.
  • Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
  • Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
  • Proficient in one or more UNIX-like operating systems.
  • Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
  • Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
  • Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
  • Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
  • Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
  • Proficient in one or more version control and collaboration tools, such as Git.
  • Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
  • Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
  • Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
  • Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
  • Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
  • Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
  • Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineering (SRE) Lead

Johannesburg, Gauteng Reach Digital Health

Posted today

Job Viewed

Tap Again To Close

Job Description

Job Overview

As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.

You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.

Key Focus Areas

You will primarily be responsible for:

  • Team Management and Growth:
    • Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
  • Collaboration:
    • Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
  • Infrastructure reliability and performance:
    • Monitoring, measuring, and improving the reliability and performance of our systems
    • Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
    • Maintenance, upgrades, and security updates
  • Automation and tooling:
    • You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
    • Assisting other teams with deployment and updates of their applications and services.
  • Administration:
    • Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
  • Innovation:
    • You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
  • Data Management and Security:
    • Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.

As a lead in the Engineering department, you will contribute to key focus areas in the following ways:

Technical Leadership

  • Lead on architectural design and actively manage technical risk/debt against project goals.
  • Able to describe, analyse, and convince others about major technical tradeoffs and decisions.

Thought Leadership

  • Present at relevant conferences, webinars and other opportunities to showcase Reach.

Workflow

  • Take ownership for team and technical documentation.
  • Ensure your team adopts and follows process & workflow best practices.

Delivery

  • Take responsibility for risks with your team’s work.
  • Take initiative to identify problems and propose solutions to resolve them.

Strategy

  • Provide input into the organisation's technology strategy.
  • Play an active role in meeting engineering team KPIs.

Ways of Working

  • Suggest and implement improvements to current ways of working.
  • Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).

Communication (Internal)

  • Share ideas, decisions and plans effectively within Engineering and across the organisation.
  • Build relationships with other leads and heads across the organisation to support strong cross-functional teams.

Communication (External)

  • Take ownership of external engagements and respond timeously to stakeholders.

Team Management

  • Ensure you’re delegating the right things effectively, so that you can work at the right level.
  • Identify and support opportunities for growth within your team.

Partnerships & Growth

  • Lead on technical proposals and concept notes
  • Pursue opportunities for new partnerships and services.

People Operations

  • Proactively highlight gaps and skills needed in Engineering.
  • Draft job descriptions and drive recruitment for relevant roles.

Responsibilities and Duties

  • Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
  • Establish and enforce SRE best practices to improve system reliability and operational efficiency.
  • Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
  • Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
  • Conduct performance reviews, set goals, and facilitate professional development for team members.
  • Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
  • Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
  • Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
  • Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
  • Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
  • Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
  • Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.

Qualifications

  • An honours degree in Computer Science or Engineering or equivalent experience.
  • 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
  • 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.

Skills and Experience Required

  • Proficient in one or more programming languages, such as Python, Go, Java, or C++.
  • Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
  • Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
  • Proficient in one or more UNIX-like operating systems.
  • Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
  • Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
  • Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
  • Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
  • Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
  • Proficient in one or more version control and collaboration tools, such as Git.
  • Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
  • Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
  • Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
  • Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
  • Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
  • Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
  • Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineering (SRE) Lead

Cape Town, Western Cape Reach Digital Health

Posted today

Job Viewed

Tap Again To Close

Job Description

Job Overview

As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations.

You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems.

Key Focus Areas

You will primarily be responsible for:

  • Team Management and Growth:
    • Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
  • Collaboration:
    • Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
  • Infrastructure reliability and performance:
    • Monitoring, measuring, and improving the reliability and performance of our systems
    • Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
    • Maintenance, upgrades, and security updates
  • Automation and tooling:
    • You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations
    • Assisting other teams with deployment and updates of their applications and services.
  • Administration:
    • Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
  • Innovation:
    • You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.
  • Data Management and Security:
    • Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.

As a lead in the Engineering department, you will contribute to key focus areas in the following ways:

Technical Leadership

  • Lead on architectural design and actively manage technical risk/debt against project goals.
  • Able to describe, analyse, and convince others about major technical tradeoffs and decisions.

Thought Leadership

  • Present at relevant conferences, webinars and other opportunities to showcase Reach.

Workflow

  • Take ownership for team and technical documentation.
  • Ensure your team adopts and follows process & workflow best practices.

Delivery

  • Take responsibility for risks with your team’s work.
  • Take initiative to identify problems and propose solutions to resolve them.

Strategy

  • Provide input into the organisation's technology strategy.
  • Play an active role in meeting engineering team KPIs.

Ways of Working

  • Suggest and implement improvements to current ways of working.
  • Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same).

Communication (Internal)

  • Share ideas, decisions and plans effectively within Engineering and across the organisation.
  • Build relationships with other leads and heads across the organisation to support strong cross-functional teams.

Communication (External)

  • Take ownership of external engagements and respond timeously to stakeholders.

Team Management

  • Ensure you’re delegating the right things effectively, so that you can work at the right level.
  • Identify and support opportunities for growth within your team.

Partnerships & Growth

  • Lead on technical proposals and concept notes
  • Pursue opportunities for new partnerships and services.

People Operations

  • Proactively highlight gaps and skills needed in Engineering.
  • Draft job descriptions and drive recruitment for relevant roles.

Responsibilities and Duties

  • Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise.
  • Establish and enforce SRE best practices to improve system reliability and operational efficiency.
  • Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
  • Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
  • Conduct performance reviews, set goals, and facilitate professional development for team members.
  • Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
  • Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
  • Advise on and/or contribute to new or emerging technologies that might be relevant to Reach.
  • Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
  • Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
  • Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
  • Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.

Qualifications

  • An honours degree in Computer Science or Engineering or equivalent experience.
  • 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.
  • 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.

Skills and Experience Required

  • Proficient in one or more programming languages, such as Python, Go, Java, or C++.
  • Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
  • Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
  • Proficient in one or more UNIX-like operating systems.
  • Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
  • Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
  • Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
  • Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
  • Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
  • Proficient in one or more version control and collaboration tools, such as Git.
  • Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
  • Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
  • Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
  • Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
  • Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
  • Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
  • Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site reliability engineering (sre) lead

Johannesburg, Gauteng Reach Digital Health

Posted today

Job Viewed

Tap Again To Close

Job Description

permanent
Job Overview As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations. You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems. Key Focus Areas You will primarily be responsible for: Team Management and Growth: Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities. Collaboration: Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation. Infrastructure reliability and performance: Monitoring, measuring, and improving the reliability and performance of our systems Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands. Maintenance, upgrades, and security updates Automation and tooling: You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations Assisting other teams with deployment and updates of their applications and services. Administration: Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms. Innovation: You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value. Data Management and Security: Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery. As a lead in the Engineering department, you will contribute to key focus areas in the following ways: Technical Leadership Lead on architectural design and actively manage technical risk/debt against project goals. Able to describe, analyse, and convince others about major technical tradeoffs and decisions. Thought Leadership Present at relevant conferences, webinars and other opportunities to showcase Reach. Workflow Take ownership for team and technical documentation. Ensure your team adopts and follows process & workflow best practices. Delivery Take responsibility for risks with your team’s work. Take initiative to identify problems and propose solutions to resolve them. Strategy Provide input into the organisation's technology strategy. Play an active role in meeting engineering team KPIs. Ways of Working Suggest and implement improvements to current ways of working. Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same). Communication (Internal) Share ideas, decisions and plans effectively within Engineering and across the organisation. Build relationships with other leads and heads across the organisation to support strong cross-functional teams. Communication (External) Take ownership of external engagements and respond timeously to stakeholders. Team Management Ensure you’re delegating the right things effectively, so that you can work at the right level. Identify and support opportunities for growth within your team. Partnerships & Growth Lead on technical proposals and concept notes Pursue opportunities for new partnerships and services. People Operations Proactively highlight gaps and skills needed in Engineering. Draft job descriptions and drive recruitment for relevant roles. Responsibilities and Duties Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise. Establish and enforce SRE best practices to improve system reliability and operational efficiency. Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure. Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues. Conduct performance reviews, set goals, and facilitate professional development for team members. Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices. Monitor system health, analyze trends, and implement proactive measures to prevent incidents. Advise on and/or contribute to new or emerging technologies that might be relevant to Reach. Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department. Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices. Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same. Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole. Qualifications An honours degree in Computer Science or Engineering or equivalent experience. 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems. 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers. Skills and Experience Required Proficient in one or more programming languages, such as Python, Go, Java, or C++. Proficient in one or more scripting languages, such as Bash, Perl, or Ruby. Proficient in one or more cloud platforms, such as AWS, Azure, or GCP. Proficient in one or more UNIX-like operating systems. Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform. Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk. Proficient in one or more container and orchestration tools, such as Docker, Kubernetes. Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy. Proficient in one or more databases and data stores, such as My SQL, Postgre SQL, Mongo DB, or Redis. Proficient in one or more version control and collaboration tools, such as Git. Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture. Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation. Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning. Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus. Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment. Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner. Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies. #J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site reliability engineering (sre) lead

Cape Town, Western Cape Reach Digital Health

Posted today

Job Viewed

Tap Again To Close

Job Description

permanent
Job Overview As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations. You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems. Key Focus Areas You will primarily be responsible for: Team Management and Growth: Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities. Collaboration: Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation. Infrastructure reliability and performance: Monitoring, measuring, and improving the reliability and performance of our systems Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands. Maintenance, upgrades, and security updates Automation and tooling: You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations Assisting other teams with deployment and updates of their applications and services. Administration: Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms. Innovation: You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value. Data Management and Security: Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery. As a lead in the Engineering department, you will contribute to key focus areas in the following ways: Technical Leadership Lead on architectural design and actively manage technical risk/debt against project goals. Able to describe, analyse, and convince others about major technical tradeoffs and decisions. Thought Leadership Present at relevant conferences, webinars and other opportunities to showcase Reach. Workflow Take ownership for team and technical documentation. Ensure your team adopts and follows process & workflow best practices. Delivery Take responsibility for risks with your team’s work. Take initiative to identify problems and propose solutions to resolve them. Strategy Provide input into the organisation's technology strategy. Play an active role in meeting engineering team KPIs. Ways of Working Suggest and implement improvements to current ways of working. Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same). Communication (Internal) Share ideas, decisions and plans effectively within Engineering and across the organisation. Build relationships with other leads and heads across the organisation to support strong cross-functional teams. Communication (External) Take ownership of external engagements and respond timeously to stakeholders. Team Management Ensure you’re delegating the right things effectively, so that you can work at the right level. Identify and support opportunities for growth within your team. Partnerships & Growth Lead on technical proposals and concept notes Pursue opportunities for new partnerships and services. People Operations Proactively highlight gaps and skills needed in Engineering. Draft job descriptions and drive recruitment for relevant roles. Responsibilities and Duties Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise. Establish and enforce SRE best practices to improve system reliability and operational efficiency. Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure. Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues. Conduct performance reviews, set goals, and facilitate professional development for team members. Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices. Monitor system health, analyze trends, and implement proactive measures to prevent incidents. Advise on and/or contribute to new or emerging technologies that might be relevant to Reach. Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department. Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices. Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same. Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole. Qualifications An honours degree in Computer Science or Engineering or equivalent experience. 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems. 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers. Skills and Experience Required Proficient in one or more programming languages, such as Python, Go, Java, or C++. Proficient in one or more scripting languages, such as Bash, Perl, or Ruby. Proficient in one or more cloud platforms, such as AWS, Azure, or GCP. Proficient in one or more UNIX-like operating systems. Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform. Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk. Proficient in one or more container and orchestration tools, such as Docker, Kubernetes. Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy. Proficient in one or more databases and data stores, such as My SQL, Postgre SQL, Mongo DB, or Redis. Proficient in one or more version control and collaboration tools, such as Git. Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture. Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation. Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning. Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus. Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment. Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner. Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies. #J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Sre manager Jobs in South Africa !

Site reliability engineering (sre) lead

Cape Town, Western Cape Reach Digital Health

Posted today

Job Viewed

Tap Again To Close

Job Description

permanent
Job Overview As Lead Site Reliability Engineer, you will lead the SRE team in applying software engineering principles and practices to infrastructure and operations problems, and design and implement solutions that automate and improve the availability, scalability, and efficiency of our systems. You will also collaborate with other engineering teams, project teams, and other stakeholders to deliver high-quality products and services that meet our customers’ needs and expectations. You’ll be exposed to unique challenges assisting with the maintenance and stability of Reach infrastructure and services, and have the opportunity to contribute to internal and external projects. We believe in choosing the right tools for the job, and support creativity in solving problems. Key Focus Areas You will primarily be responsible for: Team Management and Growth: Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities. Collaboration: Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation. Infrastructure reliability and performance: Monitoring, measuring, and improving the reliability and performance of our systems Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands. Maintenance, upgrades, and security updates Automation and tooling: You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations Assisting other teams with deployment and updates of their applications and services. Administration: Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms. Innovation: You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value. Data Management and Security: Ensure data, security and infrastructure policies and best practices are adhered to, working with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery. As a lead in the Engineering department, you will contribute to key focus areas in the following ways: Technical Leadership Lead on architectural design and actively manage technical risk/debt against project goals. Able to describe, analyse, and convince others about major technical tradeoffs and decisions. Thought Leadership Present at relevant conferences, webinars and other opportunities to showcase Reach. Workflow Take ownership for team and technical documentation. Ensure your team adopts and follows process & workflow best practices. Delivery Take responsibility for risks with your team’s work. Take initiative to identify problems and propose solutions to resolve them. Strategy Provide input into the organisation's technology strategy. Play an active role in meeting engineering team KPIs. Ways of Working Suggest and implement improvements to current ways of working. Promote consistent adoption of best practices (if you implement a change or improvement, help others do the same). Communication (Internal) Share ideas, decisions and plans effectively within Engineering and across the organisation. Build relationships with other leads and heads across the organisation to support strong cross-functional teams. Communication (External) Take ownership of external engagements and respond timeously to stakeholders. Team Management Ensure you’re delegating the right things effectively, so that you can work at the right level. Identify and support opportunities for growth within your team. Partnerships & Growth Lead on technical proposals and concept notes Pursue opportunities for new partnerships and services. People Operations Proactively highlight gaps and skills needed in Engineering. Draft job descriptions and drive recruitment for relevant roles. Responsibilities and Duties Lead a team of Site Reliability Engineers and Information Security Officers, providing mentorship, guidance, and technical expertise. Establish and enforce SRE best practices to improve system reliability and operational efficiency. Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure. Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues. Conduct performance reviews, set goals, and facilitate professional development for team members. Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices. Monitor system health, analyze trends, and implement proactive measures to prevent incidents. Advise on and/or contribute to new or emerging technologies that might be relevant to Reach. Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department. Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices. Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same. Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole. Qualifications An honours degree in Computer Science or Engineering or equivalent experience. 8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems. 4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers. Skills and Experience Required Proficient in one or more programming languages, such as Python, Go, Java, or C++. Proficient in one or more scripting languages, such as Bash, Perl, or Ruby. Proficient in one or more cloud platforms, such as AWS, Azure, or GCP. Proficient in one or more UNIX-like operating systems. Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform. Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk. Proficient in one or more container and orchestration tools, such as Docker, Kubernetes. Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy. Proficient in one or more databases and data stores, such as My SQL, Postgre SQL, Mongo DB, or Redis. Proficient in one or more version control and collaboration tools, such as Git. Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture. Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation. Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning. Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus. Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment. Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner. Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies. #J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Cloud Infrastructure Manager

Western Cape, Western Cape Mukuru

Posted 14 days ago

Job Viewed

Tap Again To Close

Job Description

workfromhome

Cloud Infrastructure Manager

Mukuru is on the lookout for a Cloud Infrastructure Manager to lead and scale the engineering capability behind our cloud platforms. This role forms a key part of the Technology Solutions leadership team , sitting alongside our Cloud Architect and reporting to the Head of Technology Solutions. If you are based in Johannesburg , Pretoria, Cape Town or anywhere in South Africa feel free to apply!

At Mukuru, we don’t just build infrastructure – we build the backbone of meaningful, borderless financial access. You’ll lead a dynamic team of Cloud Engineers, SREs, and DevSecOps professionals, empowering them to deliver secure, scalable, and reliable infrastructure that fuels real-world impact across the continent.

What You’ll Be Doing

  • Lead, mentor, and grow a high-performing team of Cloud Engineers, SREs, and DevSecOps experts who thrive on innovation and impact.

  • Drive the execution of our Cloud Infrastructure roadmap , aligning with Mukuru’s strategic platform and business goals.

  • Take ownership of Mukuru’s AWS-based cloud environments – defined as infrastructure-as-code with Terraform and containerised with Kubernetes – ensuring performance, cost-efficiency, and resilience at every stage of the SDLC.

  • Champion DevOps culture across engineering, fostering collaboration, shared ownership, and continuous delivery practices.

  • Ensure uptime and recovery goals are met, and oversee compliance with RTOs, RPOs, patching, monitoring, and alerting standards.

  • Partner closely with our Cloud Architect to deliver well-architected, observable, and cost-optimised infrastructure solutions.

  • Collaborate across the business – from Platform Engineering and Product, to Software Engineering, Security, and Governance – to enable cross-functional success.

  • Implement and maintain controls aligned to compliance frameworks like PCI-DSS and ISO27001 .

  • Build and manage tooling across CI/CD, observability, documentation, and cloud cost monitoring.

  • Drive innovation by enabling teams with the right tools, autonomy, and environment to experiment, iterate, and deliver value.

  • Manage team capacity, budgets, and ongoing capability growth in line with emerging technologies and business needs.

What You’ll Need to Succeed

  • Grade 12 or equivalent (essential)

  • Related tertiary qualification (desirable)

  • 7+ years in DevOps, Cloud Infrastructure, or Systems Engineering

  • 3+ years in a leadership or technical management role

  • Proven experience designing and operating production-grade AWS environments

  • Expertise in Terraform, Kubernetes, Linux systems, CI/CD, and incident management

  • Experience in high-growth sectors like Fintech, Retail, or Technology

  • Familiarity with cloud cost optimisation , security, and governance best practices

  • Experience with regulatory frameworks like PCI-DSS and ISO27001

  • A working knowledge of agile practices and a strong DevOps mindset

What Sets You Apart

  • You empower engineers, celebrate delivery, and lead with empathy

  • You bring calm, clarity, and focus during complex situations

  • You love turning complex challenges into elegant, automated solutions

  • You value collaboration and communicate clearly across tech and business lines

  • You’re passionate about infrastructure that’s secure, observable, and scalable

  • You care about creating long-term value , not just short-term fixes

Why Mukuru?

  • Be part of a fast-scaling, purpose-driven fintech making real change across Africa.

  • Join teams that thrive on collaboration, impact, and innovation .

  • Embrace a culture of diversity, continuous learning, and inclusion .

  • Work in an environment that values your growth, your voice, and your ideas .

Not 100% sure you tick every box?

That’s okay! We believe that passion, potential, and purpose go a long way. If this role excites you and you believe you can contribute to our mission, we’d love to hear from you. Tell us how you’d add value at Mukuru – your future team is waiting.

Let me know if you’d like this adapted for LinkedIn, a careers site, or if you want a more visual version too!

I am sure you are reading this job description and meet majority of the criteria BUT you may also still not be 100% comfortable in applying. We believe that there is a place for everyone under the Mukuru sun and we want YOU to contribute to our diverse tapestry of talent. So come on, take a leap of faith, and send your application if you meet majority of our requirements. Remember to include a snippet of how you will bring value and help us build a future of success that will help us determine where and how you may best be suited” Maybe you are just the future Mukurian we need!


Should you be appointed in a remote/work from home role at Mukuru, it is your responsibility to ensure that you have uninterrupted internet connectivity and a ‘work-like’ environment at your home location, in order to deliver your best in terms of performance, productivity and service to our customers.

If you do not receive any response after two weeks, please consider your application unsuccessful.


NB: ALL STAFF APPOINTMENTS WILL BE MADE WITH DUE CONSIDERATION OF THE COMPANY’S DIVERSITY AND INCLUSION PLANS

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Cloud infrastructure manager

Western Cape, Western Cape Mukuru

Posted today

Job Viewed

Tap Again To Close

Job Description

permanent
Cloud Infrastructure Manager Mukuru is on the lookout for a Cloud Infrastructure Manager to lead and scale the engineering capability behind our cloud platforms. This role forms a key part of the Technology Solutions leadership team , sitting alongside our Cloud Architect and reporting to the Head of Technology Solutions. If you are based in Johannesburg , Pretoria, Cape Town or anywhere in South Africa feel free to apply! At Mukuru, we don’t just build infrastructure – we build the backbone of meaningful, borderless financial access. You’ll lead a dynamic team of Cloud Engineers, SREs, and Dev Sec Ops professionals, empowering them to deliver secure, scalable, and reliable infrastructure that fuels real-world impact across the continent. What You’ll Be Doing Lead, mentor, and grow a high-performing team of Cloud Engineers, SREs, and Dev Sec Ops experts who thrive on innovation and impact. Drive the execution of our Cloud Infrastructure roadmap , aligning with Mukuru’s strategic platform and business goals. Take ownership of Mukuru’s AWS-based cloud environments – defined as infrastructure-as-code with Terraform and containerised with Kubernetes – ensuring performance, cost-efficiency, and resilience at every stage of the SDLC. Champion Dev Ops culture across engineering, fostering collaboration, shared ownership, and continuous delivery practices. Ensure uptime and recovery goals are met, and oversee compliance with RTOs, RPOs, patching, monitoring, and alerting standards. Partner closely with our Cloud Architect to deliver well-architected, observable, and cost-optimised infrastructure solutions. Collaborate across the business – from Platform Engineering and Product, to Software Engineering, Security, and Governance – to enable cross-functional success. Implement and maintain controls aligned to compliance frameworks like PCI-DSS and ISO27001 . Build and manage tooling across CI/CD, observability, documentation, and cloud cost monitoring. Drive innovation by enabling teams with the right tools, autonomy, and environment to experiment, iterate, and deliver value. Manage team capacity, budgets, and ongoing capability growth in line with emerging technologies and business needs. What You’ll Need to Succeed Grade 12 or equivalent (essential) Related tertiary qualification (desirable) 7+ years in Dev Ops, Cloud Infrastructure, or Systems Engineering 3+ years in a leadership or technical management role Proven experience designing and operating production-grade AWS environments Expertise in Terraform, Kubernetes, Linux systems, CI/CD, and incident management Experience in high-growth sectors like Fintech, Retail, or Technology Familiarity with cloud cost optimisation , security, and governance best practices Experience with regulatory frameworks like PCI-DSS and ISO27001 A working knowledge of agile practices and a strong Dev Ops mindset What Sets You Apart You empower engineers, celebrate delivery, and lead with empathy You bring calm, clarity, and focus during complex situations You love turning complex challenges into elegant, automated solutions You value collaboration and communicate clearly across tech and business lines You’re passionate about infrastructure that’s secure, observable, and scalable You care about creating long-term value , not just short-term fixes Why Mukuru? Be part of a fast-scaling, purpose-driven fintech making real change across Africa. Join teams that thrive on collaboration, impact, and innovation . Embrace a culture of diversity, continuous learning, and inclusion . Work in an environment that values your growth, your voice, and your ideas . Not 100% sure you tick every box? That’s okay! We believe that passion, potential, and purpose go a long way. If this role excites you and you believe you can contribute to our mission, we’d love to hear from you. Tell us how you’d add value at Mukuru – your future team is waiting. Let me know if you’d like this adapted for Linked In, a careers site, or if you want a more visual version too! I am sure you are reading this job description and meet majority of the criteria BUT you may also still not be 100% comfortable in applying. We believe that there is a place for everyone under the Mukuru sun and we want YOU to contribute to our diverse tapestry of talent. So come on, take a leap of faith, and send your application if you meet majority of our requirements. Remember to include a snippet of how you will bring value and help us build a future of success that will help us determine where and how you may best be suited” Maybe you are just the future Mukurian we need! Should you be appointed in a remote/work from home role at Mukuru, it is your responsibility to ensure that you have uninterrupted internet connectivity and a ‘work-like’ environment at your home location, in order to deliver your best in terms of performance, productivity and service to our customers. If you do not receive any response after two weeks, please consider your application unsuccessful. NB: ALL STAFF APPOINTMENTS WILL BE MADE WITH DUE CONSIDERATION OF THE COMPANY’S DIVERSITY AND INCLUSION PLANS #J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Sre Manager Jobs