Didn't find the right job?

Get expert career advice to help you find the ideal role and improve your job search strategy.

61 Site Reliability Engineer jobs in South Africa

Devops

Gauteng, Gauteng Network It

Tap Again To Close

Job Description

Job Reference : NWA-BOM-2

Do you want to level up by working for an out-of-this-world software powerhouse?

Duties & Responsibilities

Job & Company Description: The client is based in Pretoria and they are looking for talented Developers to join their development team.

They specialise in the insurance industry.

They encourage continuous career growth and have great support systems in place such as allocated training time / study time within the weekly dealings of the company.

The Mid-Level Software Developer is responsible for using development languages and tools to write, edit, maintain, and test computer software.

The position will be required to follow the software development lifecycle (SDLC) to plan, design, build, test, and deploy software applications.

In addition to creating new software, you will be required to improve and maintain the working order of existing software.

What's In It For You

Relaxed dress code
Access to Microsoft Certifications
Excellent career growth opportunities

Job Experience & Skills Required

BSc Computer Studies / BEng Computer Engineering
Azure certified
3+ years' experience in Systems Administration / DevOps Engineering / Network Administrator
Networking: Knowledge on private vs public IP's and subnets
Private network routing
VPN: Has configured OpenVPN before
System Configuration Management: Has done automatic system configuration management
Configuration Management Skills: CFEngine, Rudder, Chef, Puppet, Ansible, Salt
Linux: Worked on RedHat / CentOS
Bash scripting
Can configure system
Ability to configure PXE boot
Ability to configure IPTables
Experience with LVM
Cloud: Working experience on Azure
Knowledge of what an Azure WebApp is
DB: Ability to DBA Postgres
Experience with live WAL streaming
Has restored a Postgres DB from WAL files with point in time recovery
Other: Package Installation
Azure SQL: Continuous deployment
DevOps and Agile principles

If you are interested in this opportunity, please apply directly.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Job No Longer Available

This position is no longer listed on WhatJobs. The employer may be reviewing applications, filled the role, or has removed the listing.

However, we have similar jobs available for you below.

Site Reliability Engineer

Flash Group

Posted 2 days ago

Tap Again To Close

Job Description

Flash

2024/12/12 Western Cape

Job Reference Number: T169

Department: Technology

Business Unit:

Industry: Fintech

Job Type: Permanent

Positions Available: 2

Salary: Market Related

We are looking for an individual passionate about technology with experience in developing and managing cutting-edge environment monitoring solutions, as well as using software and automation to solve problems and manage production systems.

Job Description

RESPONSIBILITIES:

Master multiple scripting and programming languages to achieve advanced proficiency and deliver robust solutions.
Drive the design and implementation of sophisticated automation tools and processes for managing large-scale systems.
Lead critical incident responses with composure and efficiency, followed by thorough post-incident reviews to implement preventative measures.
Shape system architecture and design, bringing your vision and expertise to influence high-impact decisions.
Champion the creation and adherence to reliability standards, ensuring scalable and sustainable system operations.
Demonstrate strong strategic thinking and planning abilities to drive team and organizational success.
Exhibit exceptional leadership skills, with the capacity to influence key technical decisions and inspire cross-functional teams.
Possess mentorship and coaching expertise to nurture and develop junior and intermediate team members, fostering a collaborative and growth-oriented environment.

Job Requirements

MINIMUM REQUIREMENTS:

8-10 years relevant experience in SRE, DevOps, or system engineering. Matric
Proficiency in scripting languages.
Relevant certifications such as Oracle, Cloud, DevOps.

TECHNICAL SKILLS:

Continuous delivery
Cloud skills & best practices
Observability (System and Application Performance Monitoring)
Infrastructure as code
Configuration management (Infrastructure as a Service)
Containers
Automation
Collaboration and Communication
Coding and Scripting
Azure DevOps
General system uptimes
SLO (Service-level Objectives)
Latency
Incident and outage management
Change management
Capacity planning

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

CXP are now part of the Huntswood Group

Posted 3 days ago

Tap Again To Close

Job Description

Job Overview

The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.
Job Responsibilities

Identify and rectify shortcomings and weaknesses in:

Windows
- Operating system deployment (AutoPilot) and application deployment (Intune)
- Frequently raised issues for agents
Automation
- We use PowerShell and/or Python
Networking
- Capacity (we use Site24x7 and Meraki to monitor our systems)
- Responsiveness (with several offices in two countries, addressing delays is important)
Regional requirements
- Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
Own the delivery of service for the offices
Provide technical escalation point to the ServiceDesk
Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
Able to occasionally work out of hours to avoid disruption to end users
Able to join an on-call out of hour rota

Job Requirements

Matric / NQF Level 4
Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
IT qualification advantageous

Required Skills

Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how https works
Used to working in reactive environment, able to prioritise issues based on impact and other factors
Excellent business communication skills, able to speak to people at all levels

Core Behaviour

Huntswood’s employees are described as dependable, driven and collaborative.
The job holder should align to our 6 Fundamental Values:

Bring Your “A” Game
Strive For Greater
Enable and empower all employees
Do the right thing
Own it
Deliver unbelievable service

"It's not just about what we do, but the way we do it. And it's our values that make us special."

NB: All appointments are subject to the positive outcome of pre-employment verification checks.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Gauteng, Gauteng Asuer (Pty)

Posted 4 days ago

Tap Again To Close

Job Description

workfromhome

Location: North Riding, Johannesburg, South Africa

Type: Full-time

Office: Hybrid, 3 days in office a week

ABOUT YOU

We are looking for.

A committed and capable Site Reliability Engineer (SRE) to take ownership of the uptime, performance, and scalability of our production and development systems. You will be responsible for managing the hosting environments of our ERP, customer platforms, internal applications, databases, and websites, ensuring they are secure, available, and optimised across all stages of deployment. This position is based in Johannesburg, offers a competitive salary, and provides an opportunity to build the foundations of infrastructure excellence for one of South Africa’s most promising fintech ventures.

What you'll get to do and why we need you.

As a Site Reliability Engineer, you will be the guardian of our technical stability and infrastructure performance. You will manage and optimise hosting environments across production and development instances, covering platforms like Odoo ERP, WhatsApp chatbot systems, APIs, internal tools, external facing websites and reporting databases. Your work ensures that the systems powering over 50 000 Sales Force members and thousands of end users remain resilient, scalable, and secure.

You will collaborate with engineers, product managers, and business teams to design infrastructure strategies, improve observability, manage deployments, respond to incidents, and drive continuous improvement. This is a rare opportunity to shape the infrastructure blueprint of a high growth, impact focused business from the ground up.

Infrastructure Management Security & Uptime Automation & CI/CD Collaboration with Engineers

ABOUT US

Who we are and what we do.

Asuer is a fintech company committed to making life simpler and more secure for African communities through innovative financial and technology solutions. We operate across insurance and telecommunications, with plans to expand into digital payments. Our focus is on removing barriers and helping people achieve their goals.

Born from the ongoing digital transformation of Botle Buhle Brands (BBB), one of Africa’s leading direct-selling businesses, Asuer has grown into an independent company centred on financial inclusion and accessible technology. Everything we build is guided by our core values: Impact, Innovation, and Integrity.

Managing and monitoring the infrastructure of our ERP systems, applications, APIs, and databases.
Ensuring high availability and scalability of production environments and development pipelines.
Administering cloud environments including deployments, rollbacks, and updates.
Establishing and maintaining CI CD workflows for rapid and safe deployments.
Setting up monitoring, logging, and alerting systems to track system health and performance.
Investigating and resolving production incidents in a timely and thorough manner.
Implementing backup, recovery, and failover processes to ensure data integrity.
Improving observability and reporting across environments and services.
Hardening infrastructure security and enforcing access controls and best practices.
Supporting development teams with staging, test, and release environments.
Automating routine tasks to improve system efficiency and reduce human error.

Our requirements include. Technical skills in:

Experience managing Linux based production environments preferably on Ubuntu
Strong proficiency in cloud hosting platforms such as AWS or Google Cloud
Solid understanding of containerisation using Docker and orchestration tools
Experience with CI CD tools and pipeline automation
Familiarity with infrastructure as code tools such as Terraform or Ansible
Comfortable working with PostgreSQL and database administration best practices
Networking, DNS, and load balancing
Monitoring and alerting using tools like Grafana, Prometheus, or cloud native solutions
Understanding of secure deployment practices including firewalls, SSL, and API rate limiting

Mustbe able to:

Set up and manage reliable and scalable hosting environments
Diagnose and resolve incidents efficiently with minimal downtime
Collaborate with software teams to enable faster and safer deployments
Document infrastructure processes and maintain infrastructure knowledge bases
Implement DevOps and SRE practices tailored to a fast moving startup context
Build processes that are robust and scale as the company grows
Balance performance, security, and simplicity in all infrastructure decisions

Knowledge & experience:

Odoo hosting and maintenance workflows
Hosting ERP systems, databases, and API driven platforms
Securing web infrastructure and access credentials
Optimising costs and performance in cloud environments
Scripting and automation using Bash, Python, or similar
Logging and system observability tools
Fast recovery planning and disaster mitigation

Prerequisites:

A tertiary qualification in Computer Science, Information Technology, or a related field
Minimum of 3 years of experience in a systems administration, DevOps, or SRE role
Strong problem solving, troubleshooting, and communication skills
Proficiency in English reading, writing, and speaking

A BIT MORE ABOUT US

What we offer.

At Asuer, you’ll join a mission with real meaning, where your work empowers thousands of people across Africa. You’ll collaborate with smart, curious teammates who move fast and build with purpose, without the drag of legacy systems. We offer competitive pay, a flexible environment, and the autonomy to shape systems from the ground up. This is a place for real growth, where you scale products that matter and make a tangible impact every day.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Durban, KwaZulu Natal CXP are now part of the Huntswood Group

Posted 5 days ago

Tap Again To Close

Job Description

Job Description

Job Overview

The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.

Job Responsibilities

Identify and rectify shortcomings and weaknesses in:

Windows

Operating system deployment (AutoPilot) and application deployment (Intune)
Frequently raised issues for agents

Automation

We use PowerShell and/or Python

Networking

Capacity (we use Site24x7 and Meraki to monitor our systems)
Responsiveness (with several offices in two countries, addressing delays is important)

Regional requirements

Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed

Own the delivery of service for the offices
Provide technical escalation point to the ServiceDesk
Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
Able to occasionally work out of hours to avoid disruption to end users
Able to join an on-call out of hour rota

Job Requirements

Matric / NQF Level 4
Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
IT qualification advantageous

Required Skills

Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how https works
Used to working in reactive environment, able to prioritise issues based on impact and other factors
Excellent business communication skills, able to speak to people at all levels

Core Behaviour

Huntswood’s employees are described as dependable, driven and collaborative.

The job holder should align to our 6 Fundamental Values:

Bring Your “A” Game
Strive For Greater
Enable and empower all employees
Do the right thing
Own it
Deliver unbelievable service

"It's not just about what we do, but the way we do it. And it's our values that make us special."

NB: All appointments are subject to the positive outcome of pre-employment verification checks.

Apply #J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Durban, KwaZulu Natal CXP are now part of the Huntswood Group

Posted 8 days ago

Tap Again To Close

Job Description

Join to apply for the Site Reliability Engineer role at CXP are now part of the Huntswood Group

Get AI-powered advice on this job and more exclusive features.

Windows

Operating system deployment (AutoPilot) and application deployment (Intune)
Frequently raised issues for agents

Automation

We use PowerShell and/or Python

Networking

Capacity (we use Site24x7 and Meraki to monitor our systems)
Responsiveness (with several offices in two countries, addressing delays is important)

Regional requirements

Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed

Own the delivery of service for the offices
Provide technical escalation point to the ServiceDesk
Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
Able to occasionally work out of hours to avoid disruption to end users
Able to join an on-call out of hour rota

Job Requirements

Matric / NQF Level 4
Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
IT qualification advantageous

Required Skills

Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how https works
Used to working in reactive environment, able to prioritise issues based on impact and other factors
Excellent business communication skills, able to speak to people at all levels

Core Behaviour

Huntswood’s employees are described as dependable, driven and collaborative.

The job holder should align to our 6 Fundamental Values:

Bring Your “A” Game
Strive For Greater
Enable and empower all employees
Do the right thing
Own it
Deliver unbelievable service

Seniority level

Seniority level Entry level

Employment type

Employment type Full-time

Job function

Job function Engineering and Information Technology
Industries Outsourcing/Offshoring

Referrals increase your chances of interviewing at CXP are now part of the Huntswood Group by 2x

Sign in to set job alerts for “Site Reliability Engineer” roles.

Durban, KwaZulu-Natal, South Africa 4 days ago

Junior Software Development Engineer (DBN) Embedded Software Engineer - Durban - On-Site Intermediate Software Development Engineer Intermediate Software Development Engineer (Live) - DBN Principal Software Engineer (Kafka) - DBN

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Iqtalent

Posted 13 days ago

Tap Again To Close

Job Description

workfromhome

Who are Tyk, and what do we do?

The Tyk API Management platform is helping to drive the connected world and power new products and services. We’re changing the way that organisations connect any number of their systems and services. Whether internal, external, public or highly encrypted systems, Tyk helps businesses drive value across the retail, finance, telecoms, healthcare, or media industries (to name just a few!)

If you’ve banked online, used an app to check the news, or perhaps even driven a connected car, API’s, and by extension, Tyk, make that possible. Founded in 2015 with offices in London – UK, London – Ontario, Atlanta and Singapore, we have many thousands of users of our B2B platform across the globe. Brands using Tyk range from Lotte, Bell, Dominos, Starbucks, to RBS and Societe Generale. We have a varied user base hailing from every continent – even Antarctica.

Our Mission

Tyk is on a mission to connect every system in the world. We’ve started by building an API Management platform.

Total flexibility, default remote, radical responsibility

We offer unlimited paid holidays and remote working from anywhere in the world, for everyone, Why? Tyk was founded on the principle of offering flexibility and autonomy to our employees, we believe this allows our employees to achieve their best results. It also means we can build the best possible team, location and working hours are no barrier.

If this sounds like an environment that you believe could work for you then read on to find out more.

The role:

At Tyk, we’re obsessed with building software that solves problems. We count on our Site Reliability Engineers (SREs) to empower users with a rich feature set, high availability, and stellar performance level to pursue their missions.

Our customer base is growing, so we’re seeking an experienced SRE to optimise, automate, and improve our performance, using insights from massive-scale data in real time. We want an original thinker, a challenger, a technical legend, an opinionated collaborator who wants to make things better.

Here’s what you’ll be getting up to:

Proactive Monitoring : Ensure our production Cloud environment operates within defined SLAs through vigilant monitoring and proactive issue resolution.
Alerting and Monitoring : Collaborate with Senior SRE to identify opportunities for building proactive alerting and monitoring systems; implement solutions to enhance system reliability.
Performance Metrics: Contribute to defining key performance metrics for Cloud services, enabling performance improvements and success measurement.
Solutions Development : Propose and develop solutions to maintain and enhance key performance indicators (KPIs) across our Cloud infrastructure.
Data Analysis: Gather and analyse metrics from operating systems and applications to optimise system performance and expedite fault resolution.
Innovation : Drive innovation by optimising system and infrastructure performance, anticipating customer needs, and proactively addressing scaling demands.
Scalability : Work closely with commercial functions to optimise our platform for scalability and meet growing customer demands.
Cloud Infrastructure : Analyse and ensure the automation, scalability, and efficient management of our Cloud infrastructure.
Automation : Execute automation for known cloud operations tasks and create new automation solutions to streamline processes.
Software Development : Design, write, and deliver software and automation solutions to enhance the availability, scalability, latency, and efficiency of our PaaS services.
Root Cause Analysis : Participate in blame-free root cause analysis meetings to promote learning and continuous system improvement in the event of production system incidents.
Documentation : Create and contribute to policies and runbooks to ensure that operational processes are well-documented and consistently followed.
On-call Support : Provide on-call support, ensuring our Cloud services follow a 24/7 model by promptly responding to alerts, meeting SLAs, and automating root cause analysis.
Upgrades and Migrations : Plan and execute software upgrades, including Kubernetes versions. Manage and communicate migrations from Classic Cloud to the new Cloud platform.

Here’s what we’re looking for:

Strong collaboration skills
Launching and operating production Kubernetes clusters
Designing and operating infrastructure on AWS and other providers
Operating MongoDB (or other document database) clusters
Operating Redis (or other key-value storage) clusters
Administering Linux servers
Maintaining distributed software
Operating Prometheus and Grafana
Operating logging collection and analysis system

Skills:

Kubernetes & containers (proficient)
Go and/or Python (advanced)
AWS (proficient)
Linux (proficient)
Terraform and IaC in general (proficient)
Helm (familiar)
MongoDB (or similar)
Redis (or similar)
Monitoring & logging
Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)
Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP)

Benefits

Here’s why you should join us:

Everyone has unlimited paid holidays.
We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all.
Employee share scheme
Generous maternity and paternity leave
Volunteering Days
Company retreats
Employee Wellbeing platform

We all share the same vision – we value authenticity, respect, responsibility, independence, honesty, diversity and inclusion and most importantly treating others how you wish to be treated. We look for like-minded people who bring their personalities to work everyday, strive to achieve their personal goals and who are willing to challenge the way we do things, why? – to make what we do even better!

Our values tell the story of Tyk – here’s how:

It’s ok to screw up!

We’ve found that it’s often the ‘stupid’ or unexpected ideas that turn out to be the successful ones – so try it, at least we can say we have!

The only stupid idea, is the untested one!

It’s in our DNA – starting a business with founders 12 hours apart, giving our gateway away for free – sure, we did that, and we’d do it again!

Trust starts with you – make it count!

Trust is a two-way street – instil it from day one!

Assume best intent!

We have each other’s back – we’re all on the same team. Think before you speak or act.

Make things better!

Always try to leave things better than when you found them – change is constant, inevitable and embraced! Be that change we want to see.

What’s it like to work here! check it out:

Tyk is an equal opportunities employer and we are determined to ensure that no applicant or employee receives less favourable treatment on the grounds of gender, age, disability, religion, belief, sexual orientation, marital status, or race, or is disadvantaged by conditions or requirements which cannot be shown to be justifiable.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Johannesburg, Gauteng Ziyasiza Consulting (Pty) Ltd

Posted 20 days ago

Tap Again To Close

Job Description

Key Responsibilities:

Monitoring and Alerting: Implementing and maintaining monitoring systems to track system health and performance, alerting on symptoms rather than just outages.

Incident Response: Responding to and resolving production incidents, troubleshooting across the entire stack, and providing support for product teams.

Automation: Developing and implementing automation to streamline operational tasks, improve efficiency, and reducing manual effort.

Infrastructure Management: Managing and maintaining infrastructure, including platforms

Performance Optimization: Identifying and addressing performance bottlenecks, optimizing existing systems, and contributing to system design and capacity planning.

Collaboration: Working closely with development, operations, and other teams to ensure smooth deployments and efficient operations.

Continuous Improvement: Continuously improving systems and processes through post-incident reviews, documentation, and knowledge sharing.

Proactive Problem Solving: Identifying potential problems before they occur and developing solutions to prevent future issues.

Capacity Planning: Ensuring that systems can handle current and future demands.

Mentoring and Coaching: Sharing knowledge and providing guidance to junior engineers.

Skills and Qualifications:

Strong understanding of system architect, automation, and infrastructure tools.
Proficiency in programming languages like Python, Go, or Jave.
Experience with cloud platforms like AWS, Azure or GCP.
Familiarity with containerization technologies like Docket and Kubernetes.
Experience with monitoring and alerting tools like Prometheus, Grafana, or New Relic.
Strong problem-solving and troubleshooting skills.
Excellent communication and collaboration skills.
Ability to work independently and as part of a team.

This advertiser has chosen not to accept applicants from your region.

Be The First To Know

About the latest Site reliability engineer jobs in South Africa !

Set Email Alert:

Enter your email

Job title

Location

Site Reliability Engineer

Pretoria, Gauteng SM Squared Talent (Pty) Ltd

Posted 27 days ago

Tap Again To Close

Job Description

Requirements:

Card payment domain knowledge (mandatory)
Experience with CI/CD and Build pipelines using Jenkins.
Experience in public and private Cloud offerings (PCF, Azure, AWS etc.).
Knowledge of NoSQL & SQL databases such as Mongo / Oracle/
Experience and knowledge of managing distributed systems and working
with microservices.
Familiarity with Unix tooling, with strong scripting skills
Exposure to working with Monitoring and Alerting tools such as Splunk,
Dynatrace
Proficiency in one of the following: Python, Java, GO or equivalent.
Familiarity defining SLOs and SLAs
Prior experience of working in an SRE/DevOps team and excellent understanding of SRE/DevOps principles.
High degree of initiative and self-motivation, with a willingness to take on
challenging opportunities.
Excellent communication and relationship building/collaboration skills.

KPA's

Design, implement and maintain monitoring systems
Identify and resolve reliability issues
Automate manual processes for efficient system operation
Participate in on-call rotation to address system outages
Collaborate with development teams to improve system design
Lead incident management efforts by proactively monitoring and analyzing ISO 8583 financial transaction messages across the 4-party payment model (Cardholder, Merchant, Acquirer, Issuer).

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Randburg, Gauteng Asuer (Pty)

Posted today

Tap Again To Close

Job Description

Location: North Riding, Johannesburg, South Africa

Type: Full-time

Office: Hybrid, 3 days in office a week

ABOUT YOU

We are looking for.

What you'll get to do and why we need you.

ABOUT US

Who we are and what we do.

Managing and monitoring the infrastructure of our ERP systems, applications, APIs, and databases.
Ensuring high availability and scalability of production environments and development pipelines.
Administering cloud environments including deployments, rollbacks, and updates.
Establishing and maintaining CI CD workflows for rapid and safe deployments.
Setting up monitoring, logging, and alerting systems to track system health and performance.
Investigating and resolving production incidents in a timely and thorough manner.
Implementing backup, recovery, and failover processes to ensure data integrity.
Improving observability and reporting across environments and services.
Hardening infrastructure security and enforcing access controls and best practices.
Supporting development teams with staging, test, and release environments.
Automating routine tasks to improve system efficiency and reduce human error.

Our requirements include. Technical skills in:

Experience managing Linux based production environments preferably on Ubuntu
Strong proficiency in cloud hosting platforms such as AWS or Google Cloud
Solid understanding of containerisation using Docker and orchestration tools
Experience with CI CD tools and pipeline automation
Familiarity with infrastructure as code tools such as Terraform or Ansible
Comfortable working with PostgreSQL and database administration best practices
Networking, DNS, and load balancing
Monitoring and alerting using tools like Grafana, Prometheus, or cloud native solutions
Understanding of secure deployment practices including firewalls, SSL, and API rate limiting

Mustbe able to:

Set up and manage reliable and scalable hosting environments
Diagnose and resolve incidents efficiently with minimal downtime
Collaborate with software teams to enable faster and safer deployments
Document infrastructure processes and maintain infrastructure knowledge bases
Implement DevOps and SRE practices tailored to a fast moving startup context
Build processes that are robust and scale as the company grows
Balance performance, security, and simplicity in all infrastructure decisions

Knowledge & experience:

Odoo hosting and maintenance workflows
Hosting ERP systems, databases, and API driven platforms
Securing web infrastructure and access credentials
Optimising costs and performance in cloud environments
Scripting and automation using Bash, Python, or similar
Logging and system observability tools
Fast recovery planning and disaster mitigation

Prerequisites:

A tertiary qualification in Computer Science, Information Technology, or a related field
Minimum of 3 years of experience in a systems administration, DevOps, or SRE role
Strong problem solving, troubleshooting, and communication skills
Proficiency in English reading, writing, and speaking

A BIT MORE ABOUT US

What we offer.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

CXP are now part of the Huntswood Group

Posted today

Tap Again To Close

Job Description

Job Overview The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.
Job Responsibilities Identify and rectify shortcomings and weaknesses in:

Windows
- Operating system deployment (AutoPilot) and application deployment (Intune)
- Frequently raised issues for agents
Automation
- We use PowerShell and/or Python
Networking
- Capacity (we use Site24x7 and Meraki to monitor our systems)
- Responsiveness (with several offices in two countries, addressing delays is important)
Regional requirements
- Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
Own the delivery of service for the offices
Provide technical escalation point to the ServiceDesk
Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
Able to occasionally work out of hours to avoid disruption to end users
Able to join an on-call out of hour rota

Job Requirements

Matric / NQF Level 4
Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
IT qualification advantageous

Required Skills

Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how works
Used to working in reactive environment, able to prioritise issues based on impact and other factors
Excellent business communication skills, able to speak to people at all levels

Core Behaviour

Huntswood’s employees are described as dependable, driven and collaborative.
The job holder should align to our 6 Fundamental Values:

Bring Your “A” Game
Strive For Greater
Enable and empower all employees
Do the right thing
Own it
Deliver unbelievable service

"It's not just about what we do, but the way we do it. And it's our values that make us special."

NB: All appointments are subject to the positive outcome of pre-employment verification checks.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Industry

View All Site Reliability Engineer jobs

jobs

Employers

For Business

WhatJobs

Search

Nearby Locations

Other Jobs Near Me

Industry

61 Site Reliability Engineer jobs in South Africa

Devops

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Be The First To Know

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Nearby Locations

Other Jobs Near Me

Industry