Brandon Kauffman

Linux Administration

RHCSA and Linux+ certified. I have managed 300+ RHEL servers implementing Satellite, IAM, and ansiblizing deployments. I have used tools following Brendan Gregg's Systems Performance guide to increase performance. One example was increasing IO throughput by 20% by adjusting mount parameters.

Kubernetes

CKA and CCA certified. I have built out on-prem and EKS clusters with Kubernetes and Cilium. Cilium with egress gateway and BGP allowed us to apply the most secure network policies for on-prem environments. In EKS we were able to connect separated clusters. I have experience creating clusters with K3s, Terraform, and Kubespray. I have also worked with Talos for non prod environments.

I have implemented monitoring for Kubernetes clusters nearing 1000 projects. I used eBPF for network tracing, Elasticsearch for APM/Logging, and Prometheus with Grafana for metrics and visualizations. I have implemented Datadog and Vector in EKS environments. Vector was used to reduce metric and logging costs by 80%.

Rust

I have built webservers, front ends, BPF programs, and other tools using Rust. It is my programming language of choice for leet code and new projects. I often use Tauri to create desktop applications that speed up my workflow. I've contributed to several open source Rust projects.

Go

I have led a team of two other developers with in creating an internal platform for alert management. Over the course of a year, the project was able to reduce MTTA by 20% and reduce MTTR by 80%. I have also used Go to create many internal Prometheus exporters. I've contributed to several open source Go projects.

Python

I have built and maintained a variety of internal python packages for a team of 12 others. I've done data analysis on Oracle and Microsoft SQL Server with python. I have contributed and built Django applications with GraphQL and REST. I've implemented OTEL to an OSS projects and modified open source libraries

Site Reliability Engineering

I have SRE experience with a variety of monitoring products. I have maintained Elasticsearch, Grafana, Prometheus, OTEL with Clickhouse, and Thanos. I have used those products, Datadog, and Cloudwatch to implement SLOs. Using these tools to navigate deployment strategies uptime was increased from ~99.997% to ~99.99999% for crucial services. I have worked with Projects in Golang, Python, Java, and PHP to implement RUM, APM, Profiling, and Distributed Tracing. I've assisted developers in debugging applications to reduce avg response time by ~20%.

I've used SNMP to monitor network and physical appliances that support infrastructure and created dashboards and reports to improve capacity planning.

Kafka

I maintained a production kafka cluster with over 250 topics on-prem. The largest topic ingested 100,000 events per second. After switching to Red Panda, this topic was able to maintain 10,000,000 events per second after a consumer failure had been restored.

AWS

In AWS I worked on becoming SOC 2 compliant. I used tools like Prowler to examine the environment. By switching instances to ECS and later EKS, I was able to reduce cost by 10%. I created snapshot policies to further reduce costs while ensure a stable environment.

Devops

I have manage CD pipelines with a variety of tools. These pipelines included workloads for mobile workloads, Terraform, Ansible, Flux, and more. I was able to reduce CI/CD runtime by 60% by using best practices. I also automated alert resolutions using Ansible. I created a variety of terraform modules to deploy our services with IaC.

I managed a variety of databases such as Postgres, MongoDB, and Surrealdb. I staged regular backup, replication, and managed several major version upgrades. In postgres I was able to tune parameters that optimized the data for dev, staging, and prod workloads.