fleet status: green ยท Jakarta, ID ๐Ÿ‡ฎ๐Ÿ‡ฉ

Luthfi Farabi Database engineer keeping a production cloud fleet alive โ€”
and building the tools so it mostly keeps itself alive.

Solo DBA running a multi-database Alibaba Cloud fleet (MySQL ยท PostgreSQL ยท MongoDB), shipping AI-assisted automation so incidents fix themselves faster than I can type the RCA.

1TB ยท <1sonline DDL
95% โ†’ stableprod fire, out
~50%db bill cut
robots > toilit runs itself

$ cat about.md

I'm the person who gets paged when the database is on fire โ€” and the person who, a week later, has built something so it never catches fire the same way twice.

I run a multi-database fleet on Alibaba Cloud solo: MySQL, PostgreSQL, MongoDB. Day to day that's index and execution-plan surgery, lock-free schema changes on terabyte tables, connection-pool sanity, and HA/DR that actually fails over when it has to.

The other half of the job is making the job smaller. I wire Claude + MCP into my workflows โ€” auto-RCA, slow-query analysis, predictive scaling โ€” so the boring, repeatable parts handle themselves.

Hands-on. Ships in production. Automates the boring parts. Writes the runbook after.

$ ls ~/toolbox

The kit I reach for

No skill bars. Just the tools that have actually been in production with me.

โ–š Databases

MySQLPostgreSQLMongoDBMariaDBOracleSybaseDB2CassandraSolr

โ˜ Alibaba Cloud

RDSApsaraDBDASDTSDMS

โšก Performance

online DDL (INSTANT/INPLACE)index surgeryexecution-plan surgeryPgBouncerwork_mem tuning

โŸฒ Reliability

replicationHA / DRfailoverbackup / restore

โ–ค Observability

GrafanaPrometheusMimir

๐Ÿ–ฅ OS & Infrastructure

GNU/LinuxAIXWindowsNetworking

โš™ Automation & Code

BashPythonJavaClaude + MCPJira / Confluence via MCP

โ—‡ Process

5-Whys RCArunbooksshadow validation

$ cat history.log

Work History

The production environments I've managed and evolved over the years.

luthfi@fleet: ~/history
luthfi@fleet:~$ cat experience.txt
  • โ–ธ Senior DBA 1 @ PT Mid Solusi Nusantara (Mekari) March 2025 โ€“ Present Installation, configuration, and monitoring of ALIYUN services (RDS, DAS, DTS). Database logical design refinement, query performance tuning, cost optimization, and database migration.
  • โ–ธ Data Engineering and Architect Lead @ PT KOLTIVA July 2024 โ€“ March 2025 Managed a Data Engineer and DBA team. Monitored server/DB/API with Datadog, Grafana and AWS CloudWatch. Security assessment, DB versioning, and cost optimization.
  • โ–ธ Database Engineer Lead & DBA @ PT Investree Radhika Jaya April 2019 โ€“ February 2024 Installation, config, and monitoring for MySQL, MariaDB, Elasticsearch on ALIYUN. Handled DB tuning, back-up/recovery plans, security audits, and automated cron jobs.
  • โ–ธ System Engineer & Staff DBA @ rumah123.com (REA Group Asia) February 2016 โ€“ March 2019 Configured AWS services (EC2, RDS, S3). Monitored via New Relic. Docker container creation, Cassandra & SOLR setups, backup procedures, and automated bash scripts for alerting.
  • โ–ธ TQA / Technical Quality Assurance @ PT Adidata (Project on PT. Bank Mandiri) April 2015 โ€“ January 2016 Conducted and monitored product testing, investigated user complaints, and prepared quality reports.
  • โ–ธ Staff Database Administrator @ PT Collega Inti Pratama July 2013 โ€“ December 2014 Installed and configured Linux/AIX servers, Sybase and DB2 database servers. Supported data center operations.
  • โ–ธ IT Network @ PT Bramanty Adhikari Tibra Syandana February 2010 โ€“ April 2010 Installation of computer networks in Gatot Subroto Hospital, Jakarta.
luthfi@fleet:~$ cat education.txt
  • โ–ธ Budi Luhur University 2009 โ€“ 2013 Faculty of Computer Study, Bachelor Degree, Majoring Computer Science.
  • โ–ธ SMAN 5 Tangerang 2006 โ€“ 2009 Senior High Education.

$ git log --oneline --stat

Stuff I've built & fixed

Real incidents and real projects. The mess โ†’ what I did โ†’ the win. Tap any card for the detail.

The page said "high memory," the cause was three things stacked. No connection pooler in front of a Puma app meant 498 idle connections each holding memory hostage. Long-running statements created lock contention. And a couple of hot queries were doing 23-second sequential scans on tables missing the right index.

The fix went in as a layered stack: PgBouncer in transaction-pooling mode to collapse the connection count, server-side idle timeouts to reap the stragglers, work_mem right-sized so sorts stopped spilling, and a targeted index plan to kill the seq-scans. Then I wrote the 5-Whys RCA in Confluence so the root cause โ€” not just the symptom โ€” is on record.

INSTANT DDL is magic right up until you hit the limits nobody mentions โ€” the instant-add column budget, and the metadata lock that can still stall behind a long transaction. So before touching prod I verified the INSTANT headroom on the table and pre-flighted the metadata locks to be sure the ALTER wouldn't queue behind anything.

Result: the columns landed with no table rewrite and no downtime on a 1TB / 543M-row table, in under a second. The service integration shipped on schedule.

The instinct on a CPU-pinned cluster is to scale up. I scaled down โ€” after finding why it was hot. 843,000 slow queries in 27 hours and 94โ€“98% CPU weren't a capacity problem, they were a query problem. The worst offender was a LATERAL JOIN doing a filesort on every execution; the right composite index removed the sort entirely.

On top of that, over 99% of connections were idle โ€” bloat, not load. With the queries fixed and the connection story cleaned up, the 3ร— 32vCPU/128GB footprint was wild overkill. Right-sizing the boxes took 40โ€“55% off the bill.

You can't alert on what you've never looked at. I read all 33 panels across the 19-host fleet and wrote down what each one was actually telling me. That surfaced the host quietly running 13,000 QPS with 21 slow queries/sec, another near connection saturation, and 17 blind spots โ€” signals we were graphing but not alerting on.

The output wasn't a complaint, it was a spec: the exact alerts, thresholds and owners needed to turn those blind spots into pages before they turn into incidents.

The thesis: a DBA's judgment can be encoded, and the cloud's control plane can act on it. The roadmap pairs Claude as the brain โ€” anomaly detection, RCA reasoning, index recommendations โ€” with Alibaba RDS as the body that executes scaling, backups and schema changes.

Ten projects, sequenced from safest to boldest: AI anomaly detection, index tuning with shadow validation (test the index against real traffic before it goes live), predictive scaling, automated backup/restore drills, schema-change CI gates, and auto-RCA. Doc first โ€” now turning into running code (see $ now).

Auto-RCA agent โ€” pulls the signals from Grafana, walks a 5-Whys analysis, and writes the draft RCA straight into Confluence via MCP, so the post-incident doc exists before the adrenaline wears off.

Slow-query monitor โ€” a Bash watcher that catches slow queries, hands them to Claude for an explanation + fix suggestion, and drops the whole analysis into Google Chat. Daily standups โ€” auto-generated from Jira via MCP, so the status update writes itself.

A lock-free schema change on a 100GB+ table tripped a known-nasty MySQL virtual generated-column index bug, and the instance started looping through failovers. The safe path: drop the offending index to stop the bleeding, rebuild the table cleanly via DTS, then swap it in with an atomic rename โ€” all inside the agreed maintenance window so no one downstream felt it.

Same week, on the other engine: a MongoDB COLLSCAN on ApsaraDB tracked down and indexed. Different database, same instinct โ€” find the missing index, stop the scan.

$ ./now  // what I'm focused on right now

luthfi@fleet: ~/now
luthfi@fleet:~$ now
// last updated 02 Jun 2026 โ€” this changes over time
  • โ–ธ Killing connection bloat for good. Rolling out PgBouncer transaction pooling and tuning MySQL connection pools (wait_timeout, stale-connection diagnosis) across the whole fleet.
  • โ–ธ Roadmap โ†’ running code. Turning the AI + RDS automation roadmap from a doc into something live โ€” starting with anomaly detection and the slow-query analyzer v2.
  • โ–ธ Making the bots boringly reliable. Hardening the Auto-RCA + daily-standup pipeline so the team trusts it running unattended.
  • โ–ธ Sharpening my Claude/MCP skill library. The toolkit I build all my DBA workflows on top of โ€” always being refined.
โ— fleet status: green ยท inspired by nownownow.com

$ ./contact

// the fleet is quiet. I have a minute.

Let's talk databases.

Hiring, a gnarly query that won't behave, or just want to compare automation notes โ€” I'm reachable.

โš ๏ธ Incident Simulator