Edward Oo

DevOps / Site Reliability Engineer (SRE)
📍Taipei, Taiwan

About Me

I'm Edward Oo, a passionate DevOps and Site Reliability Engineer based in Taipei, Taiwan.
I specialize in modernizing infrastructures, migrating legacy systems, and streamlining operations through CI/CD pipelines. I ensure high availability, reliability, and performance.
My strengths include workflow automation, enhanced observability, and custom tool development, showcasing a commitment to operational excellence and innovation.
I excel at solving complex challenges and collaborating with cross-functional teams, including developers, operations, and project managers, to deliver resilient and efficient systems.

Projects

  • BlahDNS

    Adblock secure DNS resolver

    459

  • Stacks

    Languages
    Python, Bash, JavaScript, YAML
    Monitoring
    AWS Cloudwatch, Grafana, Prometheus
    Clouds
    Amazon AWS
    DevOps
    GitLab, GitHub Actions, AWS CloudFormation, Terraform, AWS CDK, SonarCloud, PagerDuty, Zapier
    Tools
    Cloudflare WAF, Cloudflare Worker

    Work Experience

    • KKCompany, KKStream (BlendVision) Senior Site Reliability Engineer (SRE) | Full time

      Sep 2022 - Present

      • DevOps to strengthen service reliability
        • Revamped CI/CD flows to enable seamless deployments during large-scale migrations, minimizing downtime and disruptions.
        • Migrated legacy Chef cookbooks to Terraform, increasing team productivity by 30%.
        • Upgraded legacy PHP 5.4 and Ubuntu 14/16/18 systems, reducing P0 alarms related to outdated systems and dependencies by 50%.
        • Built and maintained Golden Images, standardizing environments and accelerating CI environment setup.
        • Migrated postfix mail server to Amazon SES, improving email reliability and scalability.
        • Transitioned from Logstash to Fluent Bit, aligning development and operations logging workflows for improved observability.
        • Developed and maintained Slack notification tools, enhancing operational visibility and incident communication.
        • Automated weekly and monthly service latency and SLA reports, reducing manual overhead.
        • Implemented Akamai CDN usage monitoring, optimizing content delivery performance.
        • Migrated legacy CloudFormation stacks to Terraform, streamlining infrastructure management.
        • Introduced a Maintenance Mode feature to prevent unexpected interruptions during deployments and upgrades.
      • Enhanced System Reliability and Migration Success
        • Transitioned from Classic Load Balancer (CLB) to Application Load Balancer (ALB), increasing service resilience and observability.
        • Collaborated with backend teams to migrate OpsWorks EC2 stacks to ECS Fargate, reducing false alarms by 20% and cutting P0 incident recovery time by 50%.
        • Refactored SLA reporting workflows with Lambda and CloudWatch, reducing manual effort and improving reporting accuracy.
      • Infrastructure Optimization
        • Upgraded production MySQL clusters from version 5.x to 8 LTS, enhancing performance and security.
        • Migrated Redis Clusters from Redis 4 to 6, doubling IOPS performance through hardware upgrades and tuning.
        • Migrated infrastructure management from CloudFormation to Terraform, accelerating infrastructure deployment cycles.
        • Built Slack pre-warm workflows for ECS Fargate services and RDS Clusters, reducing cold start times and improving availability.
      • Observability
        • Gradually rolled out OpenTelemetry (OTEL) across all services, improving system observability.
        • Deployed node-level and pod-level observability for video encoder jobs, enhancing debugging and performance monitoring.
        • Implemented S3 object tagging and alarm monitoring for better resource management and cost tracking.
        • Managed Prometheus server, exporter and AWS Managed Prometheus with Alertmanager as central alerting, metrics storage and visualization with Grafana, CloudWatch.
    • CoolbitX DevOps / Site Reliability Engineer | Full time

      Dec 2019 - Sep 2022

      • Logging, Observability and Monitoring
        • Developed in-house tools, including Golden image, slack bots, changelog generator, semantic release, AWS CDK templates, linters, and custom resources, streamlining DevOps processes.
        • Managed Prometheus, Grafana, and Loki monitoring stacks with Terraform for system insights and issue detection on Google Cloud Platform (GKE)
      • Performance and Security Improvements
        • Optimized the China site's browsing experience by doubling its speed through networking enhancements.
        • Deployed CloudFlare WAF and DDoS mitigation for robust API and site security.
      • Infrastructure and Deployment Advancements
        • Designed and managed architectures using AWS CDK, Terraform, including ECS, RDS, Lambda, API Gateway, DynamoDB and CI/CD pipelines.
        • Ensured High Availability (HA) and Disaster Recovery (DR) with Multi-AZ deployments.
      • DevSecOps and Automation
        • Integrated DevSecOps practices into workflows, automating vulnerability scans and configuration checks.
        • Built a Slack bot with AWS Lambda for pre-deployment checks, improving production readiness.

    Education

    • Master's degree, Interactive Media Design, National Taipei University of Technology (NTUT) | GPA 3.75

      Sep 2015 - Jun 2019

      Communication Design
      Human-Computer Interaction
      MaxMSP with Myo Armband computable stage lighting performance system
    • University of Applied Sciences Potsdam, Germany

      2016 - 2017 | Exchange semester

      Interface Design
      Human-Computer Interaction (HCI)
    • Information Communication, Bachelor of Science,  MingDao University, Taiwan

      Sep 2011 - Jun 2015

      President of Inline skate club
      Vice president of E-learning volunteer
      Class leader for 4 years

    Talks

    Volunteer Service

    Community

    Certifications

    Languages

    Chinese (Native speaker) English (Fluent)

    Last updated at April 22, 2025