61 Site Reliability Engineer jobs in South Africa
Devops
Job Viewed
Job Description
Job Reference : NWA-BOM-2
Do you want to level up by working for an out-of-this-world software powerhouse?
Duties & ResponsibilitiesJob & Company Description: The client is based in Pretoria and they are looking for talented Developers to join their development team.
They specialise in the insurance industry.
They encourage continuous career growth and have great support systems in place such as allocated training time / study time within the weekly dealings of the company.
The Mid-Level Software Developer is responsible for using development languages and tools to write, edit, maintain, and test computer software.
The position will be required to follow the software development lifecycle (SDLC) to plan, design, build, test, and deploy software applications.
In addition to creating new software, you will be required to improve and maintain the working order of existing software.
What's In It For You- Relaxed dress code
- Access to Microsoft Certifications
- Excellent career growth opportunities
- BSc Computer Studies / BEng Computer Engineering
- Azure certified
- 3+ years' experience in Systems Administration / DevOps Engineering / Network Administrator
- Networking: Knowledge on private vs public IP's and subnets
- Private network routing
- VPN: Has configured OpenVPN before
- System Configuration Management: Has done automatic system configuration management
- Configuration Management Skills: CFEngine, Rudder, Chef, Puppet, Ansible, Salt
- Linux: Worked on RedHat / CentOS
- Bash scripting
- Can configure system
- Ability to configure PXE boot
- Ability to configure IPTables
- Experience with LVM
- Cloud: Working experience on Azure
- Knowledge of what an Azure WebApp is
- DB: Ability to DBA Postgres
- Experience with live WAL streaming
- Has restored a Postgres DB from WAL files with point in time recovery
- Other: Package Installation
- Azure SQL: Continuous deployment
- DevOps and Agile principles
If you are interested in this opportunity, please apply directly.
#J-18808-LjbffrJob No Longer Available
This position is no longer listed on WhatJobs. The employer may be reviewing applications, filled the role, or has removed the listing.
However, we have similar jobs available for you below.
Site Reliability Engineer
Posted 2 days ago
Job Viewed
Job Description
Flash
2024/12/12 Western Cape
Job Reference Number: T169
Department: Technology
Business Unit:
Industry: Fintech
Job Type: Permanent
Positions Available: 2
Salary: Market Related
We are looking for an individual passionate about technology with experience in developing and managing cutting-edge environment monitoring solutions, as well as using software and automation to solve problems and manage production systems.
Job DescriptionRESPONSIBILITIES:
- Master multiple scripting and programming languages to achieve advanced proficiency and deliver robust solutions.
- Drive the design and implementation of sophisticated automation tools and processes for managing large-scale systems.
- Lead critical incident responses with composure and efficiency, followed by thorough post-incident reviews to implement preventative measures.
- Shape system architecture and design, bringing your vision and expertise to influence high-impact decisions.
- Champion the creation and adherence to reliability standards, ensuring scalable and sustainable system operations.
- Demonstrate strong strategic thinking and planning abilities to drive team and organizational success.
- Exhibit exceptional leadership skills, with the capacity to influence key technical decisions and inspire cross-functional teams.
- Possess mentorship and coaching expertise to nurture and develop junior and intermediate team members, fostering a collaborative and growth-oriented environment.
MINIMUM REQUIREMENTS:
- 8-10 years relevant experience in SRE, DevOps, or system engineering. Matric
- Proficiency in scripting languages.
- Relevant certifications such as Oracle, Cloud, DevOps.
- Continuous delivery
- Cloud skills & best practices
- Observability (System and Application Performance Monitoring)
- Infrastructure as code
- Configuration management (Infrastructure as a Service)
- Containers
- Automation
- Collaboration and Communication
- Coding and Scripting
- Azure DevOps
- General system uptimes
- SLO (Service-level Objectives)
- Latency
- Incident and outage management
- Change management
- Capacity planning
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.
Job Responsibilities
Identify and rectify shortcomings and weaknesses in:
- Windows
- Operating system deployment (AutoPilot) and application deployment (Intune)
- Frequently raised issues for agents
- Automation
- We use PowerShell and/or Python
- Networking
- Capacity (we use Site24x7 and Meraki to monitor our systems)
- Responsiveness (with several offices in two countries, addressing delays is important)
- Regional requirements
- Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
- Own the delivery of service for the offices
- Provide technical escalation point to the ServiceDesk
- Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
- Able to occasionally work out of hours to avoid disruption to end users
- Able to join an on-call out of hour rota
- Matric / NQF Level 4
- Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
- IT qualification advantageous
- Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
- Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
- Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
- Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how https works
- Used to working in reactive environment, able to prioritise issues based on impact and other factors
- Excellent business communication skills, able to speak to people at all levels
Huntswood’s employees are described as dependable, driven and collaborative.
The job holder should align to our 6 Fundamental Values:
- Bring Your “A” Game
- Strive For Greater
- Enable and empower all employees
- Do the right thing
- Own it
- Deliver unbelievable service
"It's not just about what we do, but the way we do it. And it's our values that make us special."
NB: All appointments are subject to the positive outcome of pre-employment verification checks.
#J-18808-LjbffrSite Reliability Engineer
Posted 4 days ago
Job Viewed
Job Description
Location: North Riding, Johannesburg, South Africa
Type: Full-time
Office: Hybrid, 3 days in office a week
ABOUT YOU
A committed and capable Site Reliability Engineer (SRE) to take ownership of the uptime, performance, and scalability of our production and development systems. You will be responsible for managing the hosting environments of our ERP, customer platforms, internal applications, databases, and websites, ensuring they are secure, available, and optimised across all stages of deployment. This position is based in Johannesburg, offers a competitive salary, and provides an opportunity to build the foundations of infrastructure excellence for one of South Africa’s most promising fintech ventures.
As a Site Reliability Engineer, you will be the guardian of our technical stability and infrastructure performance. You will manage and optimise hosting environments across production and development instances, covering platforms like Odoo ERP, WhatsApp chatbot systems, APIs, internal tools, external facing websites and reporting databases. Your work ensures that the systems powering over 50 000 Sales Force members and thousands of end users remain resilient, scalable, and secure.
You will collaborate with engineers, product managers, and business teams to design infrastructure strategies, improve observability, manage deployments, respond to incidents, and drive continuous improvement. This is a rare opportunity to shape the infrastructure blueprint of a high growth, impact focused business from the ground up.
ABOUT US
Who we are and what we do.Asuer is a fintech company committed to making life simpler and more secure for African communities through innovative financial and technology solutions. We operate across insurance and telecommunications, with plans to expand into digital payments. Our focus is on removing barriers and helping people achieve their goals.
Born from the ongoing digital transformation of Botle Buhle Brands (BBB), one of Africa’s leading direct-selling businesses, Asuer has grown into an independent company centred on financial inclusion and accessible technology. Everything we build is guided by our core values: Impact, Innovation, and Integrity.
- Managing and monitoring the infrastructure of our ERP systems, applications, APIs, and databases.
- Ensuring high availability and scalability of production environments and development pipelines.
- Administering cloud environments including deployments, rollbacks, and updates.
- Establishing and maintaining CI CD workflows for rapid and safe deployments.
- Setting up monitoring, logging, and alerting systems to track system health and performance.
- Investigating and resolving production incidents in a timely and thorough manner.
- Implementing backup, recovery, and failover processes to ensure data integrity.
- Improving observability and reporting across environments and services.
- Hardening infrastructure security and enforcing access controls and best practices.
- Supporting development teams with staging, test, and release environments.
- Automating routine tasks to improve system efficiency and reduce human error.
- Experience managing Linux based production environments preferably on Ubuntu
- Strong proficiency in cloud hosting platforms such as AWS or Google Cloud
- Solid understanding of containerisation using Docker and orchestration tools
- Experience with CI CD tools and pipeline automation
- Familiarity with infrastructure as code tools such as Terraform or Ansible
- Comfortable working with PostgreSQL and database administration best practices
- Networking, DNS, and load balancing
- Monitoring and alerting using tools like Grafana, Prometheus, or cloud native solutions
- Understanding of secure deployment practices including firewalls, SSL, and API rate limiting
- Set up and manage reliable and scalable hosting environments
- Diagnose and resolve incidents efficiently with minimal downtime
- Collaborate with software teams to enable faster and safer deployments
- Document infrastructure processes and maintain infrastructure knowledge bases
- Implement DevOps and SRE practices tailored to a fast moving startup context
- Build processes that are robust and scale as the company grows
- Balance performance, security, and simplicity in all infrastructure decisions
- Odoo hosting and maintenance workflows
- Hosting ERP systems, databases, and API driven platforms
- Securing web infrastructure and access credentials
- Optimising costs and performance in cloud environments
- Scripting and automation using Bash, Python, or similar
- Logging and system observability tools
- Fast recovery planning and disaster mitigation
- A tertiary qualification in Computer Science, Information Technology, or a related field
- Minimum of 3 years of experience in a systems administration, DevOps, or SRE role
- Strong problem solving, troubleshooting, and communication skills
- Proficiency in English reading, writing, and speaking
A BIT MORE ABOUT US
At Asuer, you’ll join a mission with real meaning, where your work empowers thousands of people across Africa. You’ll collaborate with smart, curious teammates who move fast and build with purpose, without the drag of legacy systems. We offer competitive pay, a flexible environment, and the autonomy to shape systems from the ground up. This is a place for real growth, where you scale products that matter and make a tangible impact every day.
#J-18808-LjbffrSite Reliability Engineer
Posted 5 days ago
Job Viewed
Job Description
Job Overview
The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.
Job Responsibilities
Identify and rectify shortcomings and weaknesses in:
- Windows
- Operating system deployment (AutoPilot) and application deployment (Intune)
- Frequently raised issues for agents
- Automation
- We use PowerShell and/or Python
- Networking
- Capacity (we use Site24x7 and Meraki to monitor our systems)
- Responsiveness (with several offices in two countries, addressing delays is important)
- Regional requirements
- Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
- Own the delivery of service for the offices
- Provide technical escalation point to the ServiceDesk
- Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
- Able to occasionally work out of hours to avoid disruption to end users
- Able to join an on-call out of hour rota
- Matric / NQF Level 4
- Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
- IT qualification advantageous
- Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
- Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
- Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
- Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how https works
- Used to working in reactive environment, able to prioritise issues based on impact and other factors
- Excellent business communication skills, able to speak to people at all levels
Huntswood’s employees are described as dependable, driven and collaborative.
The job holder should align to our 6 Fundamental Values:
- Bring Your “A” Game
- Strive For Greater
- Enable and empower all employees
- Do the right thing
- Own it
- Deliver unbelievable service
NB: All appointments are subject to the positive outcome of pre-employment verification checks.
Apply #J-18808-Ljbffr
Site Reliability Engineer
Posted 8 days ago
Job Viewed
Job Description
Join to apply for the Site Reliability Engineer role at CXP are now part of the Huntswood Group
Join to apply for the Site Reliability Engineer role at CXP are now part of the Huntswood Group
Get AI-powered advice on this job and more exclusive features.
Job Overview
The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.
Job Description
Job Overview
The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.
Job Responsibilities
Identify and rectify shortcomings and weaknesses in:
- Windows
- Operating system deployment (AutoPilot) and application deployment (Intune)
- Frequently raised issues for agents
- Automation
- We use PowerShell and/or Python
- Networking
- Capacity (we use Site24x7 and Meraki to monitor our systems)
- Responsiveness (with several offices in two countries, addressing delays is important)
- Regional requirements
- Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
- Own the delivery of service for the offices
- Provide technical escalation point to the ServiceDesk
- Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
- Able to occasionally work out of hours to avoid disruption to end users
- Able to join an on-call out of hour rota
- Matric / NQF Level 4
- Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
- IT qualification advantageous
- Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
- Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
- Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
- Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how https works
- Used to working in reactive environment, able to prioritise issues based on impact and other factors
- Excellent business communication skills, able to speak to people at all levels
Huntswood’s employees are described as dependable, driven and collaborative.
The job holder should align to our 6 Fundamental Values:
- Bring Your “A” Game
- Strive For Greater
- Enable and empower all employees
- Do the right thing
- Own it
- Deliver unbelievable service
NB: All appointments are subject to the positive outcome of pre-employment verification checks.
Apply Seniority level
- Seniority level Entry level
- Employment type Full-time
- Job function Engineering and Information Technology
- Industries Outsourcing/Offshoring
Referrals increase your chances of interviewing at CXP are now part of the Huntswood Group by 2x
Sign in to set job alerts for “Site Reliability Engineer” roles.Durban, KwaZulu-Natal, South Africa 4 days ago
Junior Software Development Engineer (DBN) Embedded Software Engineer - Durban - On-Site Intermediate Software Development Engineer Intermediate Software Development Engineer (Live) - DBN Principal Software Engineer (Kafka) - DBNWe’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-LjbffrSite Reliability Engineer
Posted 13 days ago
Job Viewed
Job Description
Who are Tyk, and what do we do?
The Tyk API Management platform is helping to drive the connected world and power new products and services. We’re changing the way that organisations connect any number of their systems and services. Whether internal, external, public or highly encrypted systems, Tyk helps businesses drive value across the retail, finance, telecoms, healthcare, or media industries (to name just a few!)
If you’ve banked online, used an app to check the news, or perhaps even driven a connected car, API’s, and by extension, Tyk, make that possible. Founded in 2015 with offices in London – UK, London – Ontario, Atlanta and Singapore, we have many thousands of users of our B2B platform across the globe. Brands using Tyk range from Lotte, Bell, Dominos, Starbucks, to RBS and Societe Generale. We have a varied user base hailing from every continent – even Antarctica.
Our Mission
Tyk is on a mission to connect every system in the world. We’ve started by building an API Management platform.
Total flexibility, default remote, radical responsibility
We offer unlimited paid holidays and remote working from anywhere in the world, for everyone, Why? Tyk was founded on the principle of offering flexibility and autonomy to our employees, we believe this allows our employees to achieve their best results. It also means we can build the best possible team, location and working hours are no barrier.
If this sounds like an environment that you believe could work for you then read on to find out more.
The role:
At Tyk, we’re obsessed with building software that solves problems. We count on our Site Reliability Engineers (SREs) to empower users with a rich feature set, high availability, and stellar performance level to pursue their missions.
Our customer base is growing, so we’re seeking an experienced SRE to optimise, automate, and improve our performance, using insights from massive-scale data in real time. We want an original thinker, a challenger, a technical legend, an opinionated collaborator who wants to make things better.
Here’s what you’ll be getting up to:
- Proactive Monitoring : Ensure our production Cloud environment operates within defined SLAs through vigilant monitoring and proactive issue resolution.
- Alerting and Monitoring : Collaborate with Senior SRE to identify opportunities for building proactive alerting and monitoring systems; implement solutions to enhance system reliability.
- Performance Metrics: Contribute to defining key performance metrics for Cloud services, enabling performance improvements and success measurement.
- Solutions Development : Propose and develop solutions to maintain and enhance key performance indicators (KPIs) across our Cloud infrastructure.
- Data Analysis: Gather and analyse metrics from operating systems and applications to optimise system performance and expedite fault resolution.
- Innovation : Drive innovation by optimising system and infrastructure performance, anticipating customer needs, and proactively addressing scaling demands.
- Scalability : Work closely with commercial functions to optimise our platform for scalability and meet growing customer demands.
- Cloud Infrastructure : Analyse and ensure the automation, scalability, and efficient management of our Cloud infrastructure.
- Automation : Execute automation for known cloud operations tasks and create new automation solutions to streamline processes.
- Software Development : Design, write, and deliver software and automation solutions to enhance the availability, scalability, latency, and efficiency of our PaaS services.
- Root Cause Analysis : Participate in blame-free root cause analysis meetings to promote learning and continuous system improvement in the event of production system incidents.
- Documentation : Create and contribute to policies and runbooks to ensure that operational processes are well-documented and consistently followed.
- On-call Support : Provide on-call support, ensuring our Cloud services follow a 24/7 model by promptly responding to alerts, meeting SLAs, and automating root cause analysis.
- Upgrades and Migrations : Plan and execute software upgrades, including Kubernetes versions. Manage and communicate migrations from Classic Cloud to the new Cloud platform.
Here’s what we’re looking for:
- Strong collaboration skills
- Launching and operating production Kubernetes clusters
- Designing and operating infrastructure on AWS and other providers
- Operating MongoDB (or other document database) clusters
- Operating Redis (or other key-value storage) clusters
- Administering Linux servers
- Maintaining distributed software
- Operating Prometheus and Grafana
- Operating logging collection and analysis system
Skills:
- Kubernetes & containers (proficient)
- Go and/or Python (advanced)
- AWS (proficient)
- Linux (proficient)
- Terraform and IaC in general (proficient)
- Helm (familiar)
- MongoDB (or similar)
- Redis (or similar)
- Monitoring & logging
- Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)
- Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP)
Benefits
Here’s why you should join us:
- Everyone has unlimited paid holidays.
- We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all.
- Employee share scheme
- Generous maternity and paternity leave
- Volunteering Days
- Company retreats
- Employee Wellbeing platform
We all share the same vision – we value authenticity, respect, responsibility, independence, honesty, diversity and inclusion and most importantly treating others how you wish to be treated. We look for like-minded people who bring their personalities to work everyday, strive to achieve their personal goals and who are willing to challenge the way we do things, why? – to make what we do even better!
Our values tell the story of Tyk – here’s how:
- It’s ok to screw up!
We’ve found that it’s often the ‘stupid’ or unexpected ideas that turn out to be the successful ones – so try it, at least we can say we have!
- The only stupid idea, is the untested one!
It’s in our DNA – starting a business with founders 12 hours apart, giving our gateway away for free – sure, we did that, and we’d do it again!
- Trust starts with you – make it count!
Trust is a two-way street – instil it from day one!
- Assume best intent!
We have each other’s back – we’re all on the same team. Think before you speak or act.
- Make things better!
Always try to leave things better than when you found them – change is constant, inevitable and embraced! Be that change we want to see.
What’s it like to work here! check it out:
Tyk is an equal opportunities employer and we are determined to ensure that no applicant or employee receives less favourable treatment on the grounds of gender, age, disability, religion, belief, sexual orientation, marital status, or race, or is disadvantaged by conditions or requirements which cannot be shown to be justifiable.
#J-18808-LjbffrSite Reliability Engineer
Posted 20 days ago
Job Viewed
Job Description
Monitoring and Alerting: Implementing and maintaining monitoring systems to track system health and performance, alerting on symptoms rather than just outages.
Incident Response: Responding to and resolving production incidents, troubleshooting across the entire stack, and providing support for product teams.
Automation: Developing and implementing automation to streamline operational tasks, improve efficiency, and reducing manual effort.
Infrastructure Management: Managing and maintaining infrastructure, including platforms
Performance Optimization: Identifying and addressing performance bottlenecks, optimizing existing systems, and contributing to system design and capacity planning.
Collaboration: Working closely with development, operations, and other teams to ensure smooth deployments and efficient operations.
Continuous Improvement: Continuously improving systems and processes through post-incident reviews, documentation, and knowledge sharing.
Proactive Problem Solving: Identifying potential problems before they occur and developing solutions to prevent future issues.
Capacity Planning: Ensuring that systems can handle current and future demands.
Mentoring and Coaching: Sharing knowledge and providing guidance to junior engineers.
Skills and Qualifications:
- Strong understanding of system architect, automation, and infrastructure tools.
- Proficiency in programming languages like Python, Go, or Jave.
- Experience with cloud platforms like AWS, Azure or GCP.
- Familiarity with containerization technologies like Docket and Kubernetes.
- Experience with monitoring and alerting tools like Prometheus, Grafana, or New Relic.
- Strong problem-solving and troubleshooting skills.
- Excellent communication and collaboration skills.
- Ability to work independently and as part of a team.
Be The First To Know
About the latest Site reliability engineer jobs in South Africa !
Site Reliability Engineer
Posted 27 days ago
Job Viewed
Job Description
Requirements:
- Card payment domain knowledge (mandatory)
- Experience with CI/CD and Build pipelines using Jenkins.
- Experience in public and private Cloud offerings (PCF, Azure, AWS etc.).
- Knowledge of NoSQL & SQL databases such as Mongo / Oracle/
- Experience and knowledge of managing distributed systems and working
with microservices. - Familiarity with Unix tooling, with strong scripting skills
- Exposure to working with Monitoring and Alerting tools such as Splunk,
Dynatrace - Proficiency in one of the following: Python, Java, GO or equivalent.
- Familiarity defining SLOs and SLAs
- Prior experience of working in an SRE/DevOps team and excellent understanding of SRE/DevOps principles.
- High degree of initiative and self-motivation, with a willingness to take on
challenging opportunities. - Excellent communication and relationship building/collaboration skills.
- Design, implement and maintain monitoring systems
- Identify and resolve reliability issues
- Automate manual processes for efficient system operation
- Participate in on-call rotation to address system outages
- Collaborate with development teams to improve system design
- Lead incident management efforts by proactively monitoring and analyzing ISO 8583 financial transaction messages across the 4-party payment model (Cardholder, Merchant, Acquirer, Issuer).
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Location: North Riding, Johannesburg, South Africa
Type: Full-time
Office: Hybrid, 3 days in office a week
ABOUT YOU
A committed and capable Site Reliability Engineer (SRE) to take ownership of the uptime, performance, and scalability of our production and development systems. You will be responsible for managing the hosting environments of our ERP, customer platforms, internal applications, databases, and websites, ensuring they are secure, available, and optimised across all stages of deployment. This position is based in Johannesburg, offers a competitive salary, and provides an opportunity to build the foundations of infrastructure excellence for one of South Africa’s most promising fintech ventures.
As a Site Reliability Engineer, you will be the guardian of our technical stability and infrastructure performance. You will manage and optimise hosting environments across production and development instances, covering platforms like Odoo ERP, WhatsApp chatbot systems, APIs, internal tools, external facing websites and reporting databases. Your work ensures that the systems powering over 50 000 Sales Force members and thousands of end users remain resilient, scalable, and secure.
You will collaborate with engineers, product managers, and business teams to design infrastructure strategies, improve observability, manage deployments, respond to incidents, and drive continuous improvement. This is a rare opportunity to shape the infrastructure blueprint of a high growth, impact focused business from the ground up. Infrastructure Management Security & Uptime Automation & CI/CD Collaboration with Engineers
ABOUT US
Who we are and what we do.Asuer is a fintech company committed to making life simpler and more secure for African communities through innovative financial and technology solutions. We operate across insurance and telecommunications, with plans to expand into digital payments. Our focus is on removing barriers and helping people achieve their goals.
Born from the ongoing digital transformation of Botle Buhle Brands (BBB), one of Africa’s leading direct-selling businesses, Asuer has grown into an independent company centred on financial inclusion and accessible technology. Everything we build is guided by our core values: Impact, Innovation, and Integrity.
- Managing and monitoring the infrastructure of our ERP systems, applications, APIs, and databases.
- Ensuring high availability and scalability of production environments and development pipelines.
- Administering cloud environments including deployments, rollbacks, and updates.
- Establishing and maintaining CI CD workflows for rapid and safe deployments.
- Setting up monitoring, logging, and alerting systems to track system health and performance.
- Investigating and resolving production incidents in a timely and thorough manner.
- Implementing backup, recovery, and failover processes to ensure data integrity.
- Improving observability and reporting across environments and services.
- Hardening infrastructure security and enforcing access controls and best practices.
- Supporting development teams with staging, test, and release environments.
- Automating routine tasks to improve system efficiency and reduce human error.
- Experience managing Linux based production environments preferably on Ubuntu
- Strong proficiency in cloud hosting platforms such as AWS or Google Cloud
- Solid understanding of containerisation using Docker and orchestration tools
- Experience with CI CD tools and pipeline automation
- Familiarity with infrastructure as code tools such as Terraform or Ansible
- Comfortable working with PostgreSQL and database administration best practices
- Networking, DNS, and load balancing
- Monitoring and alerting using tools like Grafana, Prometheus, or cloud native solutions
- Understanding of secure deployment practices including firewalls, SSL, and API rate limiting
- Set up and manage reliable and scalable hosting environments
- Diagnose and resolve incidents efficiently with minimal downtime
- Collaborate with software teams to enable faster and safer deployments
- Document infrastructure processes and maintain infrastructure knowledge bases
- Implement DevOps and SRE practices tailored to a fast moving startup context
- Build processes that are robust and scale as the company grows
- Balance performance, security, and simplicity in all infrastructure decisions
- Odoo hosting and maintenance workflows
- Hosting ERP systems, databases, and API driven platforms
- Securing web infrastructure and access credentials
- Optimising costs and performance in cloud environments
- Scripting and automation using Bash, Python, or similar
- Logging and system observability tools
- Fast recovery planning and disaster mitigation
- A tertiary qualification in Computer Science, Information Technology, or a related field
- Minimum of 3 years of experience in a systems administration, DevOps, or SRE role
- Strong problem solving, troubleshooting, and communication skills
- Proficiency in English reading, writing, and speaking
A BIT MORE ABOUT US
At Asuer, you’ll join a mission with real meaning, where your work empowers thousands of people across Africa. You’ll collaborate with smart, curious teammates who move fast and build with purpose, without the drag of legacy systems. We offer competitive pay, a flexible environment, and the autonomy to shape systems from the ground up. This is a place for real growth, where you scale products that matter and make a tangible impact every day.
#J-18808-LjbffrSite Reliability Engineer
Posted today
Job Viewed
Job Description
Job Responsibilities Identify and rectify shortcomings and weaknesses in:
- Windows
- Operating system deployment (AutoPilot) and application deployment (Intune)
- Frequently raised issues for agents
- Automation
- We use PowerShell and/or Python
- Networking
- Capacity (we use Site24x7 and Meraki to monitor our systems)
- Responsiveness (with several offices in two countries, addressing delays is important)
- Regional requirements
- Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
- Own the delivery of service for the offices
- Provide technical escalation point to the ServiceDesk
- Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
- Able to occasionally work out of hours to avoid disruption to end users
- Able to join an on-call out of hour rota
- Matric / NQF Level 4
- Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
- IT qualification advantageous
- Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
- Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
- Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
- Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how works
- Used to working in reactive environment, able to prioritise issues based on impact and other factors
- Excellent business communication skills, able to speak to people at all levels
Huntswood’s employees are described as dependable, driven and collaborative.
The job holder should align to our 6 Fundamental Values:
- Bring Your “A” Game
- Strive For Greater
- Enable and empower all employees
- Do the right thing
- Own it
- Deliver unbelievable service
"It's not just about what we do, but the way we do it. And it's our values that make us special."
NB: All appointments are subject to the positive outcome of pre-employment verification checks.
#J-18808-Ljbffr