Home › Companies › Redotpay › SRE Engineer
SRE Engineer
Redotpay · Lok Ma Chau, 000, Hong Kong · Active · BambooHR
Job facts
| Field | Value |
|---|---|
| Company | Redotpay |
| Title | SRE Engineer |
| Normalized title | - |
| Department / team | 11# Central Hub |
| Location | Lok Ma Chau |
| Work model | - |
| Employment type | Full Time |
| Salary | - |
| Status | active |
| ATS provider | BambooHR |
| Posted / first seen | 2026-05-15 / 2026-05-30 |
| Changed / last seen | 2026-05-30 / 2026-06-06 |
Related slices
| Page | What it contains | Open |
|---|---|---|
| Company jobs | Active postings from Redotpay. | Open |
| Company breakdowns | Role, location, ATS, and work model facets for this company. | Open |
| ATS provider jobs | Active postings observed through BambooHR. | Open |
| Provider filtered search | The same provider as a filtered job collection. | Open |
| City jobs | Active postings in Lok Ma Chau. | Open |
| Department jobs | Active postings in 11# Central Hub. | Open |
| Lifecycle events | Open, update, close, and reopen events for this posting. | Open |
| Original posting | Canonical source or apply URL captured from the ATS. | Open |
Linked records
| Company | Redotpay |
| Source | 1e7f2d06-8d7b-467d-b843-323ee6bc1221 |
| ATS provider | BambooHR |
Description
SRE Engineer
Role Overview
As a Site Reliability Engineer (SRE), you will be the guardian of our app and core business systems, ensuring their stability, availability, and recoverability. Through robust monitoring and alerting, incident response, release governance, capacity planning, automation, and disaster recovery drills, you will safeguard our end-user experience and maintain uninterrupted business continuity.
Core Responsibilities
App Stability Assurance
Own the stability monitoring for critical user journeys, including login, homepage, trading, payments, deposits/withdrawals, and core APIs.
Define and track core Service Level Indicators (SLIs) such as user-side availability, API success/error rates, latency, and crash rates.
Promptly detect and address issues like app launch failures, API timeouts, service degradation, and regional access anomalies.
Monitoring, Alerting & Observability
Build and optimize comprehensive observability capabilities encompassing logs, metrics, distributed tracing, business probes, and Real User Monitoring (RUM).
Refine alerting rules to reduce noise/false positives and improve the accuracy of incident detection.
Establish and enforce tiered incident classification (P0/P1/P2), alongside clear notification, escalation, and response protocols.
Incident Response & Emergency Handling
Lead or actively participate in production incident triage, mitigation, recovery, and post-mortem analysis.
Develop and maintain emergency runbooks for critical scenarios (e.g., app downtime, core API failures, database anomalies, cloud service outages, network disruptions).
Drive Root Cause Analysis (RCA) and ensure the closed-loop implementation of corrective actions.
Release & Change Stability Governance
Participate in establishing best practices for production releases, canary/gray deployments, rollbacks, change windows, and post-release monitoring.
Identify and mitigate stability risks during the release pipeline to prevent incidents caused by deployments or configuration changes.
Champion the adoption of automated deployments, automated rollbacks, and advanced change risk controls.
Capacity, Performance & Resilience
Contribute to capacity planning, performance stress testing, resource utilization monitoring, and scaling strategies.
Drive the implementation of reliability patterns, including rate limiting, graceful degradation, circuit breaking, and backup/restore mechanisms.
Regularly organize or participate in chaos engineering/fault drills, disaster recovery exercises, and restoration validation.
Automation & Toil Reduction
Develop tools and platforms for automated health checks, alert analysis, and system self-healing.
Eliminate manual toil to drastically improve the efficiency of production issue resolution.
Standardize operations by documenting Standard Operating Procedures (SOPs), runbooks, and post-mortem templates.
Qualifications
Solid understanding of core infrastructure components: Linux, networking, databases, caching, middleware, and cloud services.
Familiarity with common modern architectures: App backend services, API gateways, load balancing, CDN, and Kubernetes/containerization.
Hands-on experience with one or more monitoring and observability ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, CloudWatch, APM, distributed tracing).
Proven track record in handling production incidents, with the ability to independently perform log analysis, trace debugging, performance profiling, and system recovery.
Strong understanding of SRE workflows, including deployments, canary releases, rollbacks, capacity planning, incident response, and post-mortems.
Proficiency in scripting or development (Shell, Python, or Go) to build automation tools.
Preferred: Experience ensuring the stability of global apps, or a background in Payments, FinTech, Web3, or Cross-border businesses.
Full job record
| Job ID | 264b462a75ec5f683bbef371749b4bc5793b6440 |
| Org ID | c825e977-0449-4906-8427-64816f10b2c7 |
| Source ID | 1e7f2d06-8d7b-467d-b843-323ee6bc1221 |
| Board ID | 1e7f2d06-8d7b-467d-b843-323ee6bc1221 |
| Provider | bamboohr |
| Provider Job Key | 165 |
| Title | SRE Engineer |
| Normalized Title | — |
| Status | active |
| Active | yes |
| Location Text | Lok Ma Chau, 000, Hong Kong |
| Department | 11# Central Hub |
| Team | — |
| Employment Type | full_time |
| Workplace Type | — |
| Remote Policy | — |
| Country | — |
| Region | — |
| City | Lok Ma Chau |
| Salary Raw | — |
| Salary Min | — |
| Salary Max | — |
| Salary Currency | — |
| Salary Period | — |
| Source URL | https://redotpay.bamboohr.com/careers/165 |
| Apply URL | https://redotpay.bamboohr.com/careers/165 |
| First Seen At | 2026-05-30 05:43:51Z |
| Last Seen At | 2026-06-06 19:39:10Z |
| Last Checked At | 2026-06-06 19:39:10Z |
| Last Changed At | 2026-05-30 05:43:51Z |
| Inactive At | — |
| Source Posted At | 2026-05-15 00:00:00Z |
| Source Updated At | — |
| Raw Payload Uri | s3://job-postings-prod-raw-590183727216/raw/provider=bamboohr/board=redotpay/date=2026-06-06/2026-06-06T19-39-07-895Z-c681d8dcbb5d4afbebb0c25611a2632286e28fd0a525488d173a6892d2842a2e.json |
Event Fields
{
"content_hash": "39d924de4708a15a98ef4e9cfd93af241349469354c326da8b3c755e34825986",
"source_hash": "ab299cd2ae0eec2157c62746f25d11a44b72e5d93bb2896eb66e398fa40f66f8",
"last_changed_at": "2026-05-30T05:43:51.380Z",
"active_status": "active"
}Parsed Structured
{
"language": "en",
"location": {
"raw": "Lok Ma Chau, 000, Hong Kong",
"city": "Lok Ma Chau",
"region": null,
"country": null,
"is_remote": false,
"confidence": 0.8
},
"salary_max": null,
"salary_min": null,
"inferred_at": "2026-06-06T19:39:10.782Z",
"launch_scope": {
"reason": "bamboohr_production_catalog",
"included": true,
"location": {
"raw": "Lok Ma Chau, 000, Hong Kong",
"city": "Lok Ma Chau",
"region": null,
"country": null,
"is_remote": false,
"confidence": 0.8
},
"countries": []
},
"remote_policy": null,
"salary_period": null,
"workplace_type": null,
"salary_currency": null
}Extensions
{}Native Structured
{
"list_job": {
"id": "165",
"isRemote": null,
"location": {
"city": "Lok Ma Chau",
"state": null
},
"atsLocation": {
"city": null,
"state": null,
"country": null,
"province": null
},
"departmentId": "18846",
"locationType": "0",
"jobOpeningName": "SRE Engineer ",
"departmentLabel": "11# Central Hub",
"employmentStatusLabel": "Full-Time"
},
"detail_errors": [],
"detail_job_opening": {
"location": {
"city": "Lok Ma Chau",
"state": null,
"postalCode": "000",
"addressCountry": "Hong Kong"
},
"datePosted": "2026-05-15",
"atsLocation": {
"city": null,
"state": null,
"country": null,
"countryId": null
},
"description": "<p><span style=\"font-size: 18pt\"><span style=\"font-weight: bold\">SRE Engineer </span></span></p>\n<p><span style=\"font-weight: bold\">Role Overview</span></p>\n<p>As a Site Reliability Engineer (SRE), you will be the guardian of our app and core business systems, ensuring their stability, availability, and recoverability. Through robust monitoring and alerting, incident response, release governance, capacity planning, automation, and disaster recovery drills, you will safeguard our end-user experience and maintain uninterrupted business continuity.</p>\n<p><br></p>\n<p><span style=\"font-weight: bold\">Core Responsibilities</span></p>\n<p><span style=\"font-weight: bold\">App Stability Assurance</span></p>\n<ul>\n<li>Own the stability monitoring for critical user journeys, including login, homepage, trading, payments, deposits/withdrawals, and core APIs.</li>\n<li>Define and track core Service Level Indicators (SLIs) such as user-side availability, API success/error rates, latency, and crash rates.</li>\n<li>Promptly detect and address issues like app launch failures, API timeouts, service degradation, and regional access anomalies.</li>\n</ul>\n<p><span style=\"font-weight: bold\">Monitoring, Alerting & Observability</span></p>\n<ul>\n<li>Build and optimize comprehensive observability capabilities encompassing logs, metrics, distributed tracing, business probes, and Real User Monitoring (RUM).</li>\n<li>Refine alerting rules to reduce noise/false positives and improve the accuracy of incident detection.</li>\n<li>Establish and enforce tiered incident classification (P0/P1/P2), alongside clear notification, escalation, and response protocols.</li>\n</ul>\n<p><span style=\"font-weight: bold\">Incident Response & Emergency Handling</span></p>\n<ul>\n<li>Lead or actively participate in production incident triage, mitigation, recovery, and post-mortem analysis.</li>\n<li>Develop and maintain emergency runbooks for critical scenarios (e.g., app downtime, core API failures, database anomalies, cloud service outages, network disruptions).</li>\n<li>Drive Root Cause Analysis (RCA) and ensure the closed-loop implementation of corrective actions.</li>\n</ul>\n<p><span style=\"font-weight: bold\">Release & Change Stability Governance</span></p>\n<ul>\n<li>Participate in establishing best practices for production releases, canary/gray deployments, rollbacks, change windows, and post-release monitoring.</li>\n<li>Identify and mitigate stability risks during the release pipeline to prevent incidents caused by deployments or configuration changes.</li>\n<li>Champion the adoption of automated deployments, automated rollbacks, and advanced change risk controls.</li>\n</ul>\n<p><span style=\"font-weight: bold\">Capacity, Performance & Resilience</span></p>\n<ul>\n<li>Contribute to capacity planning, performance stress testing, resource utilization monitoring, and scaling strategies.</li>\n<li>Drive the implementation of reliability patterns, including rate limiting, graceful degradation, circuit breaking, and backup/restore mechanisms.</li>\n<li>Regularly organize or participate in chaos engineering/fault drills, disaster recovery exercises, and restoration validation.</li>\n</ul>\n<p><span style=\"font-weight: bold\">Automation & Toil Reduction</span></p>\n<ul>\n<li>Develop tools and platforms for automated health checks, alert analysis, and system self-healing.</li>\n<li>Eliminate manual toil to drastically improve the efficiency of production issue resolution.</li>\n<li>Standardize operations by documenting Standard Operating Procedures (SOPs), runbooks, and post-mortem templates.</li>\n</ul>\n<p><span style=\"font-weight: bold\">Qualifications</span></p>\n<ul>\n<li>Solid understanding of core infrastructure components: Linux, networking, databases, caching, middleware, and cloud services.</li>\n<li>Familiarity with common modern architectures: App backend services, API gateways, load balancing, CDN, and Kubernetes/containerization.</li>\n<li>Hands-on experience with one or more monitoring and observability ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, CloudWatch, APM, distributed tracing).</li>\n<li>Proven track record in handling production incidents, with the ability to independently perform log analysis, trace debugging, performance profiling, and system recovery.</li>\n<li>Strong understanding of SRE workflows, including deployments, canary releases, rollbacks, capacity planning, incident response, and post-mortems.</li>\n<li>Proficiency in scripting or development (Shell, Python, or Go) to build automation tools.</li>\n<li><span style=\"font-weight: bold\">Preferred:</span> Experience ensuring the stability of global apps, or a background in Payments, FinTech, Web3, or Cross-border businesses.</li>\n</ul>",
"compensation": null,
"departmentId": "18846",
"locationType": "0",
"seekPromoted": false,
"jobCategoryId": null,
"jobOpeningName": "SRE Engineer ",
"departmentLabel": "11# Central Hub",
"jobOpeningStatus": "Open",
"minimumExperience": null,
"jobOpeningShareUrl": "https://redotpay.bamboohr.com/careers/165",
"employmentStatusLabel": "Full-Time"
}
}Get this page with API
Rendered from the bluedoor Job Postings API. Reproduce it:
GET https://api.bluedoor.sh/job-postings/v1/jobs/264b462a75ec5f683bbef371749b4bc5793b6440?include=descriptionJSONGET https://api.bluedoor.sh/job-postings/v1/orgs/c825e977-0449-4906-8427-64816f10b2c7JSONGET https://api.bluedoor.sh/job-postings/v1/sources/1e7f2d06-8d7b-467d-b843-323ee6bc1221JSONGET https://api.bluedoor.sh/job-postings/v1/jobs/264b462a75ec5f683bbef371749b4bc5793b6440/eventsJSON