60 Site Reliability Engineer jobs in South Africa

Site Reliability Engineer

Flash Group

Posted 3 days ago

Job Viewed

Tap Again To Close

Job Description

Flash

2024/12/12 Western Cape

Job Reference Number: T169

Department: Technology

Business Unit:

Industry: Fintech

Job Type: Permanent

Positions Available: 2

Salary: Market Related

We are looking for an individual passionate about technology with experience in developing and managing cutting-edge environment monitoring solutions, as well as using software and automation to solve problems and manage production systems.

Job Description

RESPONSIBILITIES:

  • Master multiple scripting and programming languages to achieve advanced proficiency and deliver robust solutions.
  • Drive the design and implementation of sophisticated automation tools and processes for managing large-scale systems.
  • Lead critical incident responses with composure and efficiency, followed by thorough post-incident reviews to implement preventative measures.
  • Shape system architecture and design, bringing your vision and expertise to influence high-impact decisions.
  • Champion the creation and adherence to reliability standards, ensuring scalable and sustainable system operations.
  • Demonstrate strong strategic thinking and planning abilities to drive team and organizational success.
  • Exhibit exceptional leadership skills, with the capacity to influence key technical decisions and inspire cross-functional teams.
  • Possess mentorship and coaching expertise to nurture and develop junior and intermediate team members, fostering a collaborative and growth-oriented environment.
Job Requirements

MINIMUM REQUIREMENTS:

  • 8-10 years relevant experience in SRE, DevOps, or system engineering. Matric
  • Proficiency in scripting languages.
  • Relevant certifications such as Oracle, Cloud, DevOps.
TECHNICAL SKILLS:
  • Continuous delivery
  • Cloud skills & best practices
  • Observability (System and Application Performance Monitoring)
  • Infrastructure as code
  • Configuration management (Infrastructure as a Service)
  • Containers
  • Automation
  • Collaboration and Communication
  • Coding and Scripting
  • Azure DevOps
  • General system uptimes
  • SLO (Service-level Objectives)
  • Latency
  • Incident and outage management
  • Change management
  • Capacity planning
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

CXP are now part of the Huntswood Group

Posted 4 days ago

Job Viewed

Tap Again To Close

Job Description

Job Overview

The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.
Job Responsibilities

Identify and rectify shortcomings and weaknesses in:
  • Windows
    • Operating system deployment (AutoPilot) and application deployment (Intune)
    • Frequently raised issues for agents
  • Automation
    • We use PowerShell and/or Python
  • Networking
    • Capacity (we use Site24x7 and Meraki to monitor our systems)
    • Responsiveness (with several offices in two countries, addressing delays is important)
  • Regional requirements
    • Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
  • Own the delivery of service for the offices
  • Provide technical escalation point to the ServiceDesk
  • Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
  • Able to occasionally work out of hours to avoid disruption to end users
  • Able to join an on-call out of hour rota
Job Requirements
  • Matric / NQF Level 4
  • Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
  • IT qualification advantageous
Required Skills
  • Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
  • Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
  • Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
  • Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how https works
  • Used to working in reactive environment, able to prioritise issues based on impact and other factors
  • Excellent business communication skills, able to speak to people at all levels
Core Behaviour

Huntswood’s employees are described as dependable, driven and collaborative.
The job holder should align to our 6 Fundamental Values:

  • Bring Your “A” Game
  • Strive For Greater
  • Enable and empower all employees
  • Do the right thing
  • Own it
  • Deliver unbelievable service

"It's not just about what we do, but the way we do it. And it's our values that make us special."

NB: All appointments are subject to the positive outcome of pre-employment verification checks.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Gauteng, Gauteng Asuer (Pty)

Posted 5 days ago

Job Viewed

Tap Again To Close

Job Description

workfromhome

Location: North Riding, Johannesburg, South Africa

Type: Full-time

Office: Hybrid, 3 days in office a week

ABOUT YOU

We are looking for.


A committed and capable Site Reliability Engineer (SRE) to take ownership of the uptime, performance, and scalability of our production and development systems. You will be responsible for managing the hosting environments of our ERP, customer platforms, internal applications, databases, and websites, ensuring they are secure, available, and optimised across all stages of deployment. This position is based in Johannesburg, offers a competitive salary, and provides an opportunity to build the foundations of infrastructure excellence for one of South Africa’s most promising fintech ventures.

What you'll get to do and why we need you.


As a Site Reliability Engineer, you will be the guardian of our technical stability and infrastructure performance. You will manage and optimise hosting environments across production and development instances, covering platforms like Odoo ERP, WhatsApp chatbot systems, APIs, internal tools, external facing websites and reporting databases. Your work ensures that the systems powering over 50 000 Sales Force members and thousands of end users remain resilient, scalable, and secure.

You will collaborate with engineers, product managers, and business teams to design infrastructure strategies, improve observability, manage deployments, respond to incidents, and drive continuous improvement. This is a rare opportunity to shape the infrastructure blueprint of a high growth, impact focused business from the ground up.

Infrastructure Management Security & Uptime Automation & CI/CD Collaboration with Engineers

ABOUT US

Who we are and what we do.

Asuer is a fintech company committed to making life simpler and more secure for African communities through innovative financial and technology solutions. We operate across insurance and telecommunications, with plans to expand into digital payments. Our focus is on removing barriers and helping people achieve their goals.

Born from the ongoing digital transformation of Botle Buhle Brands (BBB), one of Africa’s leading direct-selling businesses, Asuer has grown into an independent company centred on financial inclusion and accessible technology. Everything we build is guided by our core values: Impact, Innovation, and Integrity.

  • Managing and monitoring the infrastructure of our ERP systems, applications, APIs, and databases.
  • Ensuring high availability and scalability of production environments and development pipelines.
  • Administering cloud environments including deployments, rollbacks, and updates.
  • Establishing and maintaining CI CD workflows for rapid and safe deployments.
  • Setting up monitoring, logging, and alerting systems to track system health and performance.
  • Investigating and resolving production incidents in a timely and thorough manner.
  • Implementing backup, recovery, and failover processes to ensure data integrity.
  • Improving observability and reporting across environments and services.
  • Hardening infrastructure security and enforcing access controls and best practices.
  • Supporting development teams with staging, test, and release environments.
  • Automating routine tasks to improve system efficiency and reduce human error.
Our requirements include. Technical skills in:
  • Experience managing Linux based production environments preferably on Ubuntu
  • Strong proficiency in cloud hosting platforms such as AWS or Google Cloud
  • Solid understanding of containerisation using Docker and orchestration tools
  • Experience with CI CD tools and pipeline automation
  • Familiarity with infrastructure as code tools such as Terraform or Ansible
  • Comfortable working with PostgreSQL and database administration best practices
  • Networking, DNS, and load balancing
  • Monitoring and alerting using tools like Grafana, Prometheus, or cloud native solutions
  • Understanding of secure deployment practices including firewalls, SSL, and API rate limiting
Mustbe able to:
  • Set up and manage reliable and scalable hosting environments
  • Diagnose and resolve incidents efficiently with minimal downtime
  • Collaborate with software teams to enable faster and safer deployments
  • Document infrastructure processes and maintain infrastructure knowledge bases
  • Implement DevOps and SRE practices tailored to a fast moving startup context
  • Build processes that are robust and scale as the company grows
  • Balance performance, security, and simplicity in all infrastructure decisions
Knowledge & experience:
  • Odoo hosting and maintenance workflows
  • Hosting ERP systems, databases, and API driven platforms
  • Securing web infrastructure and access credentials
  • Optimising costs and performance in cloud environments
  • Scripting and automation using Bash, Python, or similar
  • Logging and system observability tools
  • Fast recovery planning and disaster mitigation
Prerequisites:
  • A tertiary qualification in Computer Science, Information Technology, or a related field
  • Minimum of 3 years of experience in a systems administration, DevOps, or SRE role
  • Strong problem solving, troubleshooting, and communication skills
  • Proficiency in English reading, writing, and speaking

A BIT MORE ABOUT US

What we offer.

At Asuer, you’ll join a mission with real meaning, where your work empowers thousands of people across Africa. You’ll collaborate with smart, curious teammates who move fast and build with purpose, without the drag of legacy systems. We offer competitive pay, a flexible environment, and the autonomy to shape systems from the ground up. This is a place for real growth, where you scale products that matter and make a tangible impact every day.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Durban, KwaZulu Natal CXP are now part of the Huntswood Group

Posted 6 days ago

Job Viewed

Tap Again To Close

Job Description

Job Description

Job Overview

The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.

Job Responsibilities

Identify and rectify shortcomings and weaknesses in:

  • Windows
    • Operating system deployment (AutoPilot) and application deployment (Intune)
    • Frequently raised issues for agents
  • Automation
    • We use PowerShell and/or Python
  • Networking
    • Capacity (we use Site24x7 and Meraki to monitor our systems)
    • Responsiveness (with several offices in two countries, addressing delays is important)
  • Regional requirements
    • Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
  • Own the delivery of service for the offices
  • Provide technical escalation point to the ServiceDesk
  • Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
  • Able to occasionally work out of hours to avoid disruption to end users
  • Able to join an on-call out of hour rota
Job Requirements

  • Matric / NQF Level 4
  • Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
  • IT qualification advantageous

Required Skills

  • Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
  • Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
  • Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
  • Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how https works
  • Used to working in reactive environment, able to prioritise issues based on impact and other factors
  • Excellent business communication skills, able to speak to people at all levels

Core Behaviour

Huntswood’s employees are described as dependable, driven and collaborative.

The job holder should align to our 6 Fundamental Values:

  • Bring Your “A” Game
  • Strive For Greater
  • Enable and empower all employees
  • Do the right thing
  • Own it
  • Deliver unbelievable service

"It's not just about what we do, but the way we do it. And it's our values that make us special."

NB: All appointments are subject to the positive outcome of pre-employment verification checks.

Apply #J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Durban, KwaZulu Natal CXP are now part of the Huntswood Group

Posted 9 days ago

Job Viewed

Tap Again To Close

Job Description

Join to apply for the Site Reliability Engineer role at CXP are now part of the Huntswood Group

Join to apply for the Site Reliability Engineer role at CXP are now part of the Huntswood Group

Get AI-powered advice on this job and more exclusive features.

Job Overview

The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.

Job Description

Job Overview

The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.

Job Responsibilities

Identify and rectify shortcomings and weaknesses in:

  • Windows
    • Operating system deployment (AutoPilot) and application deployment (Intune)
    • Frequently raised issues for agents
  • Automation
    • We use PowerShell and/or Python
  • Networking
    • Capacity (we use Site24x7 and Meraki to monitor our systems)
    • Responsiveness (with several offices in two countries, addressing delays is important)
  • Regional requirements
    • Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
  • Own the delivery of service for the offices
  • Provide technical escalation point to the ServiceDesk
  • Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
  • Able to occasionally work out of hours to avoid disruption to end users
  • Able to join an on-call out of hour rota
Job Requirements

  • Matric / NQF Level 4
  • Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
  • IT qualification advantageous

Required Skills

  • Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
  • Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
  • Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
  • Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how https works
  • Used to working in reactive environment, able to prioritise issues based on impact and other factors
  • Excellent business communication skills, able to speak to people at all levels

Core Behaviour

Huntswood’s employees are described as dependable, driven and collaborative.

The job holder should align to our 6 Fundamental Values:

  • Bring Your “A” Game
  • Strive For Greater
  • Enable and empower all employees
  • Do the right thing
  • Own it
  • Deliver unbelievable service

"It's not just about what we do, but the way we do it. And it's our values that make us special."

NB: All appointments are subject to the positive outcome of pre-employment verification checks.

Apply

Seniority level
  • Seniority level Entry level
Employment type
  • Employment type Full-time
Job function
  • Job function Engineering and Information Technology
  • Industries Outsourcing/Offshoring

Referrals increase your chances of interviewing at CXP are now part of the Huntswood Group by 2x

Sign in to set job alerts for “Site Reliability Engineer” roles.

Durban, KwaZulu-Natal, South Africa 4 days ago

Junior Software Development Engineer (DBN) Embedded Software Engineer - Durban - On-Site Intermediate Software Development Engineer Intermediate Software Development Engineer (Live) - DBN Principal Software Engineer (Kafka) - DBN

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Iqtalent

Posted 14 days ago

Job Viewed

Tap Again To Close

Job Description

workfromhome

Who are Tyk, and what do we do?

The Tyk API Management platform is helping to drive the connected world and power new products and services. We’re changing the way that organisations connect any number of their systems and services. Whether internal, external, public or highly encrypted systems, Tyk helps businesses drive value across the retail, finance, telecoms, healthcare, or media industries (to name just a few!)

If you’ve banked online, used an app to check the news, or perhaps even driven a connected car, API’s, and by extension, Tyk, make that possible. Founded in 2015 with offices in London – UK, London – Ontario, Atlanta and Singapore, we have many thousands of users of our B2B platform across the globe. Brands using Tyk range from Lotte, Bell, Dominos, Starbucks, to RBS and Societe Generale. We have a varied user base hailing from every continent – even Antarctica.

Our Mission

Tyk is on a mission to connect every system in the world. We’ve started by building an API Management platform.

Total flexibility, default remote, radical responsibility

We offer unlimited paid holidays and remote working from anywhere in the world, for everyone, Why? Tyk was founded on the principle of offering flexibility and autonomy to our employees, we believe this allows our employees to achieve their best results. It also means we can build the best possible team, location and working hours are no barrier.

If this sounds like an environment that you believe could work for you then read on to find out more.

The role:

At Tyk, we’re obsessed with building software that solves problems. We count on our Site Reliability Engineers (SREs) to empower users with a rich feature set, high availability, and stellar performance level to pursue their missions.

Our customer base is growing, so we’re seeking an experienced SRE to optimise, automate, and improve our performance, using insights from massive-scale data in real time. We want an original thinker, a challenger, a technical legend, an opinionated collaborator who wants to make things better.

Here’s what you’ll be getting up to:

  • Proactive Monitoring : Ensure our production Cloud environment operates within defined SLAs through vigilant monitoring and proactive issue resolution.
  • Alerting and Monitoring : Collaborate with Senior SRE to identify opportunities for building proactive alerting and monitoring systems; implement solutions to enhance system reliability.
  • Performance Metrics: Contribute to defining key performance metrics for Cloud services, enabling performance improvements and success measurement.
  • Solutions Development : Propose and develop solutions to maintain and enhance key performance indicators (KPIs) across our Cloud infrastructure.
  • Data Analysis: Gather and analyse metrics from operating systems and applications to optimise system performance and expedite fault resolution.
  • Innovation : Drive innovation by optimising system and infrastructure performance, anticipating customer needs, and proactively addressing scaling demands.
  • Scalability : Work closely with commercial functions to optimise our platform for scalability and meet growing customer demands.
  • Cloud Infrastructure : Analyse and ensure the automation, scalability, and efficient management of our Cloud infrastructure.
  • Automation : Execute automation for known cloud operations tasks and create new automation solutions to streamline processes.
  • Software Development : Design, write, and deliver software and automation solutions to enhance the availability, scalability, latency, and efficiency of our PaaS services.
  • Root Cause Analysis : Participate in blame-free root cause analysis meetings to promote learning and continuous system improvement in the event of production system incidents.
  • Documentation : Create and contribute to policies and runbooks to ensure that operational processes are well-documented and consistently followed.
  • On-call Support : Provide on-call support, ensuring our Cloud services follow a 24/7 model by promptly responding to alerts, meeting SLAs, and automating root cause analysis.
  • Upgrades and Migrations : Plan and execute software upgrades, including Kubernetes versions. Manage and communicate migrations from Classic Cloud to the new Cloud platform.

Here’s what we’re looking for:

  • Strong collaboration skills
  • Launching and operating production Kubernetes clusters
  • Designing and operating infrastructure on AWS and other providers
  • Operating MongoDB (or other document database) clusters
  • Operating Redis (or other key-value storage) clusters
  • Administering Linux servers
  • Maintaining distributed software
  • Operating Prometheus and Grafana
  • Operating logging collection and analysis system

Skills:

  • Kubernetes & containers (proficient)
  • Go and/or Python (advanced)
  • AWS (proficient)
  • Linux (proficient)
  • Terraform and IaC in general (proficient)
  • Helm (familiar)
  • MongoDB (or similar)
  • Redis (or similar)
  • Monitoring & logging
  • Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)
  • Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP)

Benefits

Here’s why you should join us:

  • Everyone has unlimited paid holidays.
  • We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all.
  • Employee share scheme
  • Generous maternity and paternity leave
  • Volunteering Days
  • Company retreats
  • Employee Wellbeing platform

We all share the same vision – we value authenticity, respect, responsibility, independence, honesty, diversity and inclusion and most importantly treating others how you wish to be treated. We look for like-minded people who bring their personalities to work everyday, strive to achieve their personal goals and who are willing to challenge the way we do things, why? – to make what we do even better!

Our values tell the story of Tyk – here’s how:

  • It’s ok to screw up!

We’ve found that it’s often the ‘stupid’ or unexpected ideas that turn out to be the successful ones – so try it, at least we can say we have!

  • The only stupid idea, is the untested one!

It’s in our DNA – starting a business with founders 12 hours apart, giving our gateway away for free – sure, we did that, and we’d do it again!

  • Trust starts with you – make it count!

Trust is a two-way street – instil it from day one!

  • Assume best intent!

We have each other’s back – we’re all on the same team. Think before you speak or act.

  • Make things better!

Always try to leave things better than when you found them – change is constant, inevitable and embraced! Be that change we want to see.

What’s it like to work here! check it out:

Tyk is an equal opportunities employer and we are determined to ensure that no applicant or employee receives less favourable treatment on the grounds of gender, age, disability, religion, belief, sexual orientation, marital status, or race, or is disadvantaged by conditions or requirements which cannot be shown to be justifiable.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Johannesburg, Gauteng Ziyasiza Consulting (Pty) Ltd

Posted 21 days ago

Job Viewed

Tap Again To Close

Job Description

Key Responsibilities:

Monitoring and Alerting: Implementing and maintaining monitoring systems to track system health and performance, alerting on symptoms rather than just outages.

Incident Response: Responding to and resolving production incidents, troubleshooting across the entire stack, and providing support for product teams.

Automation: Developing and implementing automation to streamline operational tasks, improve efficiency, and reducing manual effort.

Infrastructure Management: Managing and maintaining infrastructure, including platforms

Performance Optimization: Identifying and addressing performance bottlenecks, optimizing existing systems, and contributing to system design and capacity planning.

Collaboration: Working closely with development, operations, and other teams to ensure smooth deployments and efficient operations.

Continuous Improvement: Continuously improving systems and processes through post-incident reviews, documentation, and knowledge sharing.

Proactive Problem Solving: Identifying potential problems before they occur and developing solutions to prevent future issues.

Capacity Planning: Ensuring that systems can handle current and future demands.

Mentoring and Coaching: Sharing knowledge and providing guidance to junior engineers.

Skills and Qualifications:
  • Strong understanding of system architect, automation, and infrastructure tools.
  • Proficiency in programming languages like Python, Go, or Jave.
  • Experience with cloud platforms like AWS, Azure or GCP.
  • Familiarity with containerization technologies like Docket and Kubernetes.
  • Experience with monitoring and alerting tools like Prometheus, Grafana, or New Relic.
  • Strong problem-solving and troubleshooting skills.
  • Excellent communication and collaboration skills.
  • Ability to work independently and as part of a team.
This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Site reliability engineer Jobs in South Africa !

Site Reliability Engineer

Pretoria, Gauteng SM Squared Talent (Pty) Ltd

Posted 28 days ago

Job Viewed

Tap Again To Close

Job Description


Requirements:
  • Card payment domain knowledge (mandatory)
  • Experience with CI/CD and Build pipelines using Jenkins.
  • Experience in public and private Cloud offerings (PCF, Azure, AWS etc.).
  • Knowledge of NoSQL & SQL databases such as Mongo / Oracle/
  • Experience and knowledge of managing distributed systems and working
    with microservices.
  • Familiarity with Unix tooling, with strong scripting skills
  • Exposure to working with Monitoring and Alerting tools such as Splunk,
    Dynatrace
  • Proficiency in one of the following: Python, Java, GO or equivalent.
  • Familiarity defining SLOs and SLAs
  • Prior experience of working in an SRE/DevOps team and excellent understanding of SRE/DevOps principles.
  • High degree of initiative and self-motivation, with a willingness to take on
    challenging opportunities.
  • Excellent communication and relationship building/collaboration skills.
KPA's
  • Design, implement and maintain monitoring systems
  • Identify and resolve reliability issues
  • Automate manual processes for efficient system operation
  • Participate in on-call rotation to address system outages
  • Collaborate with development teams to improve system design
  • Lead incident management efforts by proactively monitoring and analyzing ISO 8583 financial transaction messages across the 4-party payment model (Cardholder, Merchant, Acquirer, Issuer).
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

CXP are now part of the Huntswood Group

Posted today

Job Viewed

Tap Again To Close

Job Description

Job Overview The purpose of the role of the Service Reliability Engineer is to focus upon the delivery of technology services to internal and external customers. You will take a holistic and forward looking view of all Huntswood services to minimize service impact.
Job Responsibilities
Identify and rectify shortcomings and weaknesses in:
  • Windows
    • Operating system deployment (AutoPilot) and application deployment (Intune)
    • Frequently raised issues for agents
  • Automation
    • We use PowerShell and/or Python
  • Networking
    • Capacity (we use Site24x7 and Meraki to monitor our systems)
    • Responsiveness (with several offices in two countries, addressing delays is important)
  • Regional requirements
    • Ensure that different locations have their unique (cultural, legal, infrastructure) requirement addressed
  • Own the delivery of service for the offices
  • Provide technical escalation point to the ServiceDesk
  • Work with the other technology teams (DevOps, Development, SecOps, Business Apps) when requiring their expertise or identifying improvements they can make
  • Able to occasionally work out of hours to avoid disruption to end users
  • Able to join an on-call out of hour rota
Job Requirements
  • Matric / NQF Level 4
  • Min 2 years in an IT Tech support role in a large scale operation working with Windows/Servers and networking
  • IT qualification advantageous
Required Skills
  • Windows: You will be comfortable diagnosing issues with servers and workstations using event log, performance monitor and other diagnostic tools
  • Intune: You have experience with using Intune to manage Windows endpoints, deploying apps and building systems with Autopilot
  • Scripting: You will have written PowerShell or Python scripts to automate repetitive tasks
  • Networking: You will have managed enterprise network devices such as firewalls and switches, understand what a subnet is used for, what the difference between udp and tcp, and how works
  • Used to working in reactive environment, able to prioritise issues based on impact and other factors
  • Excellent business communication skills, able to speak to people at all levels
Core Behaviour

Huntswood’s employees are described as dependable, driven and collaborative.
The job holder should align to our 6 Fundamental Values:

  • Bring Your “A” Game
  • Strive For Greater
  • Enable and empower all employees
  • Do the right thing
  • Own it
  • Deliver unbelievable service

"It's not just about what we do, but the way we do it. And it's our values that make us special."

NB: All appointments are subject to the positive outcome of pre-employment verification checks.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Iqtalent

Posted today

Job Viewed

Tap Again To Close

Job Description

Who are Tyk, and what do we do?

The Tyk API Management platform is helping to drive the connected world and power new products and services. We’re changing the way that organisations connect any number of their systems and services. Whether internal, external, public or highly encrypted systems, Tyk helps businesses drive value across the retail, finance, telecoms, healthcare, or media industries (to name just a few!)

If you’ve banked online, used an app to check the news, or perhaps even driven a connected car, API’s, and by extension, Tyk, make that possible. Founded in 2015 with offices in London – UK, London – Ontario, Atlanta and Singapore, we have many thousands of users of our B2B platform across the globe. Brands using Tyk range from Lotte, Bell, Dominos, Starbucks, to RBS and Societe Generale. We have a varied user base hailing from every continent – even Antarctica.

Our Mission

Tyk is on a mission to connect every system in the world. We’ve started by building an API Management platform.

Total flexibility, default remote, radical responsibility

We offer unlimited paid holidays and remote working from anywhere in the world, for everyone, Why? Tyk was founded on the principle of offering flexibility and autonomy to our employees, we believe this allows our employees to achieve their best results. It also means we can build the best possible team, location and working hours are no barrier.

If this sounds like an environment that you believe could work for you then read on to find out more.

The role:

At Tyk, we’re obsessed with building software that solves problems. We count on our Site Reliability Engineers (SREs) to empower users with a rich feature set, high availability, and stellar performance level to pursue their missions.

Our customer base is growing, so we’re seeking an experienced SRE to optimise, automate, and improve our performance, using insights from massive-scale data in real time. We want an original thinker, a challenger, a technical legend, an opinionated collaborator who wants to make things better.

Here’s what you’ll be getting up to:

  • Proactive Monitoring : Ensure our production Cloud environment operates within defined SLAs through vigilant monitoring and proactive issue resolution.
  • Alerting and Monitoring : Collaborate with Senior SRE to identify opportunities for building proactive alerting and monitoring systems; implement solutions to enhance system reliability.
  • Performance Metrics: Contribute to defining key performance metrics for Cloud services, enabling performance improvements and success measurement.
  • Solutions Development : Propose and develop solutions to maintain and enhance key performance indicators (KPIs) across our Cloud infrastructure.
  • Data Analysis: Gather and analyse metrics from operating systems and applications to optimise system performance and expedite fault resolution.
  • Innovation : Drive innovation by optimising system and infrastructure performance, anticipating customer needs, and proactively addressing scaling demands.
  • Scalability : Work closely with commercial functions to optimise our platform for scalability and meet growing customer demands.
  • Cloud Infrastructure : Analyse and ensure the automation, scalability, and efficient management of our Cloud infrastructure.
  • Automation : Execute automation for known cloud operations tasks and create new automation solutions to streamline processes.
  • Software Development : Design, write, and deliver software and automation solutions to enhance the availability, scalability, latency, and efficiency of our PaaS services.
  • Root Cause Analysis : Participate in blame-free root cause analysis meetings to promote learning and continuous system improvement in the event of production system incidents.
  • Documentation : Create and contribute to policies and runbooks to ensure that operational processes are well-documented and consistently followed.
  • On-call Support : Provide on-call support, ensuring our Cloud services follow a 24/7 model by promptly responding to alerts, meeting SLAs, and automating root cause analysis.
  • Upgrades and Migrations : Plan and execute software upgrades, including Kubernetes versions. Manage and communicate migrations from Classic Cloud to the new Cloud platform.

Here’s what we’re looking for:

  • Strong collaboration skills
  • Launching and operating production Kubernetes clusters
  • Designing and operating infrastructure on AWS and other providers
  • Operating MongoDB (or other document database) clusters
  • Operating Redis (or other key-value storage) clusters
  • Administering Linux servers
  • Maintaining distributed software
  • Operating Prometheus and Grafana
  • Operating logging collection and analysis system

Skills:

  • Kubernetes & containers (proficient)
  • Go and/or Python (advanced)
  • AWS (proficient)
  • Linux (proficient)
  • Terraform and IaC in general (proficient)
  • Helm (familiar)
  • MongoDB (or similar)
  • Redis (or similar)
  • Monitoring & logging
  • Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)
  • Common networking protocols (DNS, TCP/IP, TLS, UDP)

Benefits

Here’s why you should join us:

  • Everyone has unlimited paid holidays.
  • We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all.
  • Employee share scheme
  • Generous maternity and paternity leave
  • Volunteering Days
  • Company retreats
  • Employee Wellbeing platform

We all share the same vision – we value authenticity, respect, responsibility, independence, honesty, diversity and inclusion and most importantly treating others how you wish to be treated. We look for like-minded people who bring their personalities to work everyday, strive to achieve their personal goals and who are willing to challenge the way we do things, why? – to make what we do even better!

Our values tell the story of Tyk – here’s how:

  • It’s ok to screw up!

We’ve found that it’s often the ‘stupid’ or unexpected ideas that turn out to be the successful ones – so try it, at least we can say we have!

  • The only stupid idea, is the untested one!

It’s in our DNA – starting a business with founders 12 hours apart, giving our gateway away for free – sure, we did that, and we’d do it again!

  • Trust starts with you – make it count!

Trust is a two-way street – instil it from day one!

  • Assume best intent!

We have each other’s back – we’re all on the same team. Think before you speak or act.

  • Make things better!

Always try to leave things better than when you found them – change is constant, inevitable and embraced! Be that change we want to see.

What’s it like to work here! check it out:

Tyk is an equal opportunities employer and we are determined to ensure that no applicant or employee receives less favourable treatment on the grounds of gender, age, disability, religion, belief, sexual orientation, marital status, or race, or is disadvantaged by conditions or requirements which cannot be shown to be justifiable.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Site Reliability Engineer Jobs