Luthfi Farabi — Database Engineer

Luthfi Farabi Database engineer keeping a production cloud fleet alive —
and building the tools so it mostly keeps itself alive.

Solo DBA running a multi-database Alibaba Cloud fleet (MySQL · PostgreSQL · MongoDB), shipping AI-assisted automation so incidents fix themselves faster than I can type the RCA.

1TB · <1sonline DDL

95% → stableprod fire, out

~50%db bill cut

robots > toilit runs itself

The kit I reach for

No skill bars. Just the tools that have actually been in production with me.

▚ Databases

MySQLPostgreSQLMongoDBMariaDBOracleSybaseDB2CassandraSolr

☁ Alibaba Cloud

RDSApsaraDBDASDTSDMS

⚡ Performance

online DDL (INSTANT/INPLACE)index surgeryexecution-plan surgeryPgBouncerwork_mem tuning

⟲ Reliability

replicationHA / DRfailoverbackup / restore

▤ Observability

GrafanaPrometheusMimir

🖥 OS & Infrastructure

GNU/LinuxAIXWindowsNetworking

⚙ Automation & Code

BashPythonJavaClaude + MCPJira / Confluence via MCP

◇ Process

5-Whys RCArunbooksshadow validation

Work History

The production environments I've managed and evolved over the years.

luthfi@fleet: ~/history

luthfi@fleet:~$ cat experience.txt

▸ Senior DBA 1 @ PT Mid Solusi Nusantara (Mekari) March 2025 – Present Installation, configuration, and monitoring of ALIYUN services (RDS, DAS, DTS). Database logical design refinement, query performance tuning, cost optimization, and database migration.
▸ Data Engineering and Architect Lead @ PT KOLTIVA July 2024 – March 2025 Managed a Data Engineer and DBA team. Monitored server/DB/API with Datadog, Grafana and AWS CloudWatch. Security assessment, DB versioning, and cost optimization.
▸ Database Engineer Lead & DBA @ PT Investree Radhika Jaya April 2019 – February 2024 Installation, config, and monitoring for MySQL, MariaDB, Elasticsearch on ALIYUN. Handled DB tuning, back-up/recovery plans, security audits, and automated cron jobs.
▸ System Engineer & Staff DBA @ rumah123.com (REA Group Asia) February 2016 – March 2019 Configured AWS services (EC2, RDS, S3). Monitored via New Relic. Docker container creation, Cassandra & SOLR setups, backup procedures, and automated bash scripts for alerting.
▸ TQA / Technical Quality Assurance @ PT Adidata (Project on PT. Bank Mandiri) April 2015 – January 2016 Conducted and monitored product testing, investigated user complaints, and prepared quality reports.
▸ Staff Database Administrator @ PT Collega Inti Pratama July 2013 – December 2014 Installed and configured Linux/AIX servers, Sybase and DB2 database servers. Supported data center operations.
▸ IT Network @ PT Bramanty Adhikari Tibra Syandana February 2010 – April 2010 Installation of computer networks in Gatot Subroto Hospital, Jakarta.

luthfi@fleet:~$ cat education.txt

▸ Budi Luhur University 2009 – 2013 Faculty of Computer Study, Bachelor Degree, Majoring Computer Science.
▸ SMAN 5 Tangerang 2006 – 2009 Senior High Education.

Stuff I've built & fixed

Real incidents and real projects. The mess → what I did → the win. Tap any card for the detail.

The page said "high memory," the cause was three things stacked. No connection pooler in front of a Puma app meant 498 idle connections each holding memory hostage. Long-running statements created lock contention. And a couple of hot queries were doing 23-second sequential scans on tables missing the right index.

The fix went in as a layered stack: PgBouncer in transaction-pooling mode to collapse the connection count, server-side idle timeouts to reap the stragglers, work_mem right-sized so sorts stopped spilling, and a targeted index plan to kill the seq-scans. Then I wrote the 5-Whys RCA in Confluence so the root cause — not just the symptom — is on record.

INSTANT DDL is magic right up until you hit the limits nobody mentions — the instant-add column budget, and the metadata lock that can still stall behind a long transaction. So before touching prod I verified the INSTANT headroom on the table and pre-flighted the metadata locks to be sure the ALTER wouldn't queue behind anything.

Result: the columns landed with no table rewrite and no downtime on a 1TB / 543M-row table, in under a second. The service integration shipped on schedule.

The instinct on a CPU-pinned cluster is to scale up. I scaled down — after finding why it was hot. 843,000 slow queries in 27 hours and 94–98% CPU weren't a capacity problem, they were a query problem. The worst offender was a LATERAL JOIN doing a filesort on every execution; the right composite index removed the sort entirely.

On top of that, over 99% of connections were idle — bloat, not load. With the queries fixed and the connection story cleaned up, the 3× 32vCPU/128GB footprint was wild overkill. Right-sizing the boxes took 40–55% off the bill.

You can't alert on what you've never looked at. I read all 33 panels across the 19-host fleet and wrote down what each one was actually telling me. That surfaced the host quietly running 13,000 QPS with 21 slow queries/sec, another near connection saturation, and 17 blind spots — signals we were graphing but not alerting on.

The output wasn't a complaint, it was a spec: the exact alerts, thresholds and owners needed to turn those blind spots into pages before they turn into incidents.

The thesis: a DBA's judgment can be encoded, and the cloud's control plane can act on it. The roadmap pairs Claude as the brain — anomaly detection, RCA reasoning, index recommendations — with Alibaba RDS as the body that executes scaling, backups and schema changes.

Ten projects, sequenced from safest to boldest: AI anomaly detection, index tuning with shadow validation (test the index against real traffic before it goes live), predictive scaling, automated backup/restore drills, schema-change CI gates, and auto-RCA. Doc first — now turning into running code (see $ now).

Auto-RCA agent — pulls the signals from Grafana, walks a 5-Whys analysis, and writes the draft RCA straight into Confluence via MCP, so the post-incident doc exists before the adrenaline wears off.

Slow-query monitor — a Bash watcher that catches slow queries, hands them to Claude for an explanation + fix suggestion, and drops the whole analysis into Google Chat. Daily standups — auto-generated from Jira via MCP, so the status update writes itself.

A lock-free schema change on a 100GB+ table tripped a known-nasty MySQL virtual generated-column index bug, and the instance started looping through failovers. The safe path: drop the offending index to stop the bleeding, rebuild the table cleanly via DTS, then swap it in with an atomic rename — all inside the agreed maintenance window so no one downstream felt it.

Same week, on the other engine: a MongoDB COLLSCAN on ApsaraDB tracked down and indexed. Different database, same instinct — find the missing index, stop the scan.

Let's talk databases.

Hiring, a gnarly query that won't behave, or just want to compare automation notes — I'm reachable.

Luthfi Farabi Database engineer keeping a production cloud fleet alive —
and building the tools so it mostly keeps itself alive.

The kit I reach for

▚ Databases

☁ Alibaba Cloud

⚡ Performance

⟲ Reliability

▤ Observability

🖥 OS & Infrastructure

⚙ Automation & Code

◇ Process

Work History

Stuff I've built & fixed

Resolved Critical PostgreSQL Production Incidents & Stabilized Memory

Executed Zero-Downtime Schema Migrations on 1TB+ Databases

Optimized Query Performance & Reduced Cloud Database Costs by ~40–55%

Audited Fleet Observability & Designed Comprehensive Alerting Specifications

Architected an AI-Assisted Database Automation Roadmap

Developed Automated RCA & Incident Triage Bots using AI & Bash

Mitigated High-Impact Database Failover Loops During Maintenance

Let's talk databases.