Senior Engineer, Data & Messaging Reliability – Riyadh or Remote, Full Time
NOTE: IN ORDER TO BE CONSIDERED FOR THIS POSITION YOU MUST BE WILLING TO WORK THE KSA WORKING WEEK, WHICH IS FROM SUNDAY – THURSDAY AT TIME ZONE GMT -3.
We are seeking a proactive and highly skilled Data and Messaging Reliability Senior Engineer to join our client’s Cloud Engineering team, based either in Riyadh, KSA or Remote within a suitable time zone.
In this critical role, you will to ensure the reliability, scalability, and high availability of our clients’ data systems and messaging platforms. You will also oversee the operational excellence of technologies such as Kafka, ClickHouse, MySQL, and PostgreSQL, enabling secure and robust data operations across the organization.
Responsibilities:

Architect and implement resilient systems for ClickHouse, MySQL, and PostgreSQL to meet stringent uptime requirements.

Monitor and fine-tune database performance to ensure low-latency operations under high loads.

Develop and manage disaster recovery plans and backup strategies for critical data systems.

Forecast future data growth and plan for scalability to maintain reliability.

Design and maintain highly available Kafka clusters, ensuring fault tolerance and minimal downtime.

Optimize Kafka Streams and real-time processing pipelines for consistent data delivery and system reliability.

Ensure seamless integration between messaging systems and other components of the data ecosystem.

Implement and refine monitoring tools and alerting systems to detect and address issues before they escalate.

Lead root cause analysis and postmortems for incidents, applying preventive solutions.

Automate operational tasks such as failovers, scaling, and Kafka partition management.

Work closely with security teams to enforce RBAC, encryption, and compliance standards.

Maintain detailed documentation of reliability processes, disaster recovery strategies, and operational best practices.

Collaborate with engineering, operations, and security teams to align on reliability goals and implement improvements.

Mentor and provide technical guidance to junior engineers.

Stay up-to-date with industry advancements in reliability engineering, applying new practices to improve system stability.

Qualifications:

Education:

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.

Experience:

5+ years of experience in reliability engineering, data engineering, or messaging systems.

Proven expertise managing Kafka, ClickHouse, MySQL, and PostgreSQL in high-availability production environments.

Technical Skills:

Advanced knowledge of monitoring tools such as Prometheus, Grafana, or Datadog.

Proficiency in database tuning and query optimization for ClickHouse, MySQL, and PostgreSQL.

Expertise in disaster recovery planning and execution.

Strong scripting and programming skills (e.g., Python, Bash, Java).

Familiarity with containerization and orchestration tools like Kubernetes and Docker.

Soft Skills:

Exceptional problem-solving, collaboration, and communication skills.

Preferred Qualifications:

Experience with advanced Kafka features, such as Kafka Connect, Schema Registry, and tiered storage.

Knowledge of multi-region or geo-distributed system architecture design.

Familiarity with cloud-native reliability tools and services.

Expertise in database replication and sharding for scalability.

Proven success in automating operational workflows to improve efficiency and reliability.

FOR MORE INFORMATION
CONTACT Dan Wardle

Kingston Stanley

You must sign in to apply for this position.