Machine Learning

A list of interests that I wish to work on in my career:

✦ Distributed Training Infrastructure That Actually Scales ✦

I want to work on scaling transformer-based models, say from 7B to 10T+ parameters by building every core parallelism strategy from scratch and truly understand what's happening under the hood. This means implementing data, tensor, pipeline, and context parallelism, rewriting core collectives for fault tolerance with RDMA and GPUDirect so training continues even when nodes fail, and designing systems that can train on 100K+ GPUs across multiple datacenters while maintaining near-perfect utilization. The challenge of debugging training instability, working with mixture of experts architectures, and mastering low-precision compute (FP8, FP4) makes this space endlessly compelling.

✦ GPU Kernel Engineering & Hardware-Accelerated ML ✦

I'm obsessed with writing high-performance CUDA kernels that push GPUs to their theoretical limits. There's something magical about reverse-engineering FlashAttention implementations, achieving cuBLAS-like performance on custom matmul kernels, and hitting 102-105% of cuBLAS performance on H100 BF16 operations. I want to master the art of deep kernel work using CUDA, Triton, CuTe DSL, and CUTLASS—profiling with Nsight Systems, debugging SM occupancy issues, fusing operations, optimizing memory traffic patterns, and writing inline PTX assembly when necessary. Projects like ThunderKittens and building FlashAttention variants from scratch represent exactly the kind of work I want to pursue: devastatingly performant GPU code that pushes the boundaries of what's possible on Blackwell and Hopper architectures.

✦ Production Inference Systems at Ridiculous Scale ✦

I want to work in places where my work could result in serving 20K+ QPS in production with LLMs and knowing every millisecond of latency I shave off improves the experience for millions of users. I have already started contributing to ML inference engines like vLLM and SGLang. More specifically, I look forward to developing prefix-aware routing algorithms to improve cache hit rates, implementing speculative decoding, and building systems achieving < 10ms/token latency for 70B-class models.

The challenge of optimizing multimodal inference for image generation, video synthesis, and audio models excites me the most. I get to design high-throughput, low-latency delivery pipelines, maximizing GPU utilization through clever batching and scheduling, and building monitoring tools to identify bottlenecks before they become production incidents. I wish to work with the entire inference optimization stack: model compilation, quantization strategies (QAT, PTQ, int8, int4, NormalFloat), serving architectures, writing efficient Triton kernels, and implementing custom collective communication algorithms.

✦ Building Machine Learning Systems from First Principles ✦

I believe the best way to truly understand ML systems is to build them from scratch. I want to create minimal, full-stack training/inference pipelines and to this end I aim to build an autograd engine, mini-GPT from scratch, LoRA, and fine-tuning models on real data, and yes, hating CUDA at least once before learning to love it. I'm drawn to challenges like the NanoGPT speedrun, building fully C/C++/CUDA implementations of multi-GPU training (ZeRO + FSDP) with quantized LLM training, writing custom CUTLASS kernels for quantized matrix multiplications, implementing distributed PyTorch trainers with FSDP and Tensor Parallelism, and creating LLM-powered applications running multimodal inference locally.

Ultimately, I'm drawn to work spanning the entire stack—from writing PTX assembly for tensor cores to designing API interfaces for fine-tuning language models. I excel at solving problems that require understanding of Linux kernel internals, driver behavior, network programming, distributed consensus systems, microservices architectures, and ML framework internals. I want to contribute to the fullest - whether that's implementing stable versions of new algorithms proposed by researchers, building instrumentation to eliminate Python GIL contention, making changes to fine-tuning systems, profiling RL pipelines to find improvement opportunities, and building efficient and scalable distributed RLHF stacks; these challenges that are spread across the stack energize me to the fullest.

✦ Democratizing AI Infrastructure & Working on Open Source ✦

As a proud supporter of the open-source and self-host movement, I believe the best way to accelerate AI progress is to build in public, share knowledge freely, and help others level up. I want to contribute to major projects like PyTorch, especially working on FSDP, building custom Python bytecode interpreters for graph capture without forcing users to rewrite code, rewriting C++ code for ABI compatibility, or enabling pdb debugging across distributed training jobs. I mostly hang out in GPU MODE (formerly CUDA MODE), learning how GPUs work from first principles and ship real-world projects. I want to work on projects like torchao (PyTorch native quantization), and making substantial PRs to projects like DeepSpeed, TorchTitan, and NeMO.

✦ Reinforcement Learning Systems & Post-Training Infrastructure ✦

Building and scaling distributed RL systems for model training is where systems challenges become uniquely challenging, and I love it! I want to work on everything from reward modeling to policy optimization, designing elastic environment microservices, optimizing multi-node training, and implementing policy-space methods like GRPO and PPO at scale. I want to build systems that manage everything from orchestration, numerics, parallelism, weight transfer, transparent failure recovery, multi-tenant scheduling, and autoscaling, all while providing researchers with a simple interface.

✦ Hardware-Software Co-Design & ML Accelerators ✦

Working at the intersection of hardware design and ML systems is so out of my comfort zone I want to risk everything just to see if I can pull it off. I aim to understand chip architecture deeply enough to provide meaningful feedback to hardware designers about how model changes impact performance. This includes working on compiler infrastructure for ML accelerators like Triton, building systems that give low-level control over hardware to expert users, and understanding trade-offs between different memory hierarchies, interconnect topologies, and compute primitives.