Introduction
Container security remains one of the most challenging aspects of modern cloud infrastructure. At Propel, we've implemented a sophisticated multi-tenant security architecture that leverages hardware-level isolation through Kata Containers and AWS Firecracker microVMs. This post examines the technical implementation details, security boundaries, and performance considerations of our approach.
Container Security: Beyond the Basics
The Fundamental Security Problem
Traditional container runtimes like Docker and containerd rely on Linux kernel primitives for isolation:
- Linux namespaces:
pid
,net
,ipc
,mnt
,uts
, anduser
namespaces provide logical isolation - Control groups (cgroups): Limit and account for resource usage
- Seccomp filters: Restrict syscall access to the kernel
- Capabilities: Provide fine-grained permission controls
However, these mechanisms share a critical weakness: they all depend on the host kernel's security boundary. A kernel vulnerability can allow container escape through namespace manipulation.
Recent examples demonstrate this risk:
- CVE-2022-0185: A heap-based buffer overflow in the Linux kernel's filesystem layer that enabled container escape
- CVE-2022-0492: A high-severity vulnerability in the Linux kernel's cgroups
release_agent
handling that was exploited in real-world attacks
Attack Vectors in Traditional Containers
They are several critical attack vectors:
- Kernel Exploits: Vulnerabilities in the 40+ million lines of Linux kernel code
- Syscall Misconfigurations: Improperly configured seccomp profiles
- Side-Channel Attacks: Shared CPU cache timing attacks, Spectre/Meltdown
- Resource Exhaustion: CPU, memory, I/O starvation
- Container Runtime Vulnerabilities: Security issues in Docker, containerd, etc.
These risks are significantly magnified in multi-tenant environments where workloads from different customers run on the same infrastructure.
Kata Containers Architecture
Kata Containers solves these problems by implementing a fundamentally different architecture:
┌───────────────────┐ ┌───────────────────┐
│ Container Process │ │ Container Process │
├───────────────────┤ ├───────────────────┤
│ Guest Kernel │ │ Guest Kernel │
├───────────────────┤ ├───────────────────┤
│ Virtualized HW │ │ Virtualized HW │
├───────────────────┤ ├───────────────────┤
│ Firecracker │ │ Firecracker │
└───────────────────┘ └───────────────────┘
│ │
┌─────────┴─────────────────────┴─────────┐
│ Host Kernel │
└─────────────────────────────────────────┘
Key Components
- Guest Kernel: A minimal, hardened Linux kernel (currently 5.10 LTS)
- Agent: In-VM service that manages container lifecycle
- Runtime: OCI-compatible layer that interfaces with Kubernetes
- VMM: Firecracker hypervisor providing hardware virtualization
Security Boundaries
The security architecture implements multiple defense layers:
- Hardware Virtualization: Uses CPU virtualization extensions (Intel VT-x, AMD-V)
- Memory Isolation: Separate EPT (Extended Page Tables) for each VM
- I/O Virtualization: Virtualized network and storage devices
- Resource Limits: Fine-grained memory, CPU, and I/O controls
This multi-layered defense approach is important because, as demonstrated with vulnerabilities like CVE-2022-0492, multiple security controls can block exploitation even when vulnerabilities exist in one layer.
Firecracker: Technical Implementation
Firecracker is not just a standard VMM—it's a specialized microVM manager with significant security advantages:
Minimalist Design
- 50,000 lines of Rust code: Dramatically smaller attack surface compared to QEMU's 1.4+ million lines
- Single-process architecture: Each Firecracker process manages one VM
- No BIOS/UEFI: Direct kernel boot without legacy firmware components
- Minimal device model: Only essential virtualized devices
Technical Capabilities
// Firecracker's device model is minimal, with just a few essential devices
struct DeviceManager {
// Block devices
block: Vec<Arc<Mutex<Block>>>,
// Network devices
net: Vec<Arc<Mutex<Net>>>,
// Virtual socket devices
vsock: Option<Arc<Mutex<Vsock>>>,
// Legacy devices (serial console, etc.)
legacy: Option<Arc<Mutex<Legacy>>>,
}
Firecracker implements a subset of virtio devices:
virtio-block
: Block storagevirtio-net
: Network interfacevirtio-vsock
: Host-guest communicationvirtio-balloon
: Memory management
Performance Characteristics
Our benchmarks show Firecracker's exceptional performance:
Metric | Firecracker | QEMU | Docker |
---|---|---|---|
Boot Time | ~125ms | ~700ms | ~50ms |
Memory Overhead | ~5MB | ~15MB | ~2MB |
CPU Overhead | ~2% | ~4% | ~1% |
Network Latency | ~70μs | ~50μs | ~25μs |
Storage Throughput | ~80% of host | ~85% of host | ~95% of host |
Integration with Kubernetes
Propel's architecture integrates Kata Containers with Kubernetes:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-fc
handler: kata-fc
scheduling:
nodeSelector:
kata-runtime: "true"
tolerations:
- key: "dedicated"
operator: "Equal"
value: "kata"
effect: "NoSchedule"
In Kubernetes environments, it's important to note that VM isolation applies at the pod level rather than at individual container level. This means all containers within a pod share the same VM boundary, which affects how security boundaries are implemented.
Custom Runtime Hook Implementation
We've implemented custom OCI runtime hooks for enhanced security:
func (k *KataAgent) createContainer(ctx context.Context, sandbox *Sandbox, c *Container) error {
// Apply security policies
spec := c.GetPatchedOCISpec()
// Add memory encryption
if sandbox.config.HypervisorConfig.MemoryEncryption {
// AMD SEV or Intel TDX configuration
spec.Linux.SecurityContext.MemoryEncryption = true
}
// Apply seccomp profile
spec.Linux.Seccomp = getSeccompProfile()
// Configure networking
netNS := sandbox.networkNamespace()
// Create the container with the patched spec
return k.sendReq(ctx, &grpc.CreateContainerRequest{
ContainerId: c.id,
ExecId: c.id,
OCI: spec,
NetworkNamespace: netNS,
})
}
A critical security consideration: while Docker defaults to enabling security features like AppArmor and seccomp filtering, Kubernetes disables the default seccomp profile. This means that Kubernetes users must explicitly re-enable these protections in each workload or at the cluster level.
Security Hardening Measures
Our production deployment includes these hardening measures:
-
Mandatory Kernel Command Line Parameters:
kernel_params = "console=hvc0 rootflags=data=ordered rootfstype=ext4 systemd.unified_cgroup_hierarchy=1 systemd.journald.forward_to_console=1 apparmor=1 security=apparmor page_poison=1 slub_debug=P vsyscall=none debugfs=off oops=panic nmi_watchdog=0 panic=10 lockdown=confidentiality"
-
Seccomp BPF Filters: Restrict available syscalls to the minimum required
-
Resource Limits:
resources: limits: cpu: 2 memory: 4Gi requests: cpu: 1 memory: 2Gi
-
SELinux/AppArmor Profiles: Mandatory access control for host protection
-
Memory Encryption: Support for AMD SEV (Secure Encrypted Virtualization)
The effectiveness of these security measures depends on proper configuration and regular maintenance. In particular, kernel hardening requires regular kernel updates and system reboots to ensure patched versions are actually in use.
Networking Architecture
Our networking architecture implements multiple isolation layers:
┌───────────────────┐ ┌───────────────────┐
│ Container Network │ │ Container Network │
├───────────────────┤ ├───────────────────┤
│ virtio-net │ │ virtio-net │
├───────────────────┤ ├───────────────────┤
│ TAP Device │ │ TAP Device │
├───────────────────┤ ├───────────────────┤
│ CNI Plugin │ │ CNI Plugin │
└───────────────────┘ └───────────────────┘
│ │
┌─────────┴─────────────────────┴─────────┐
│ Host Network │
└─────────────────────────────────────────┘
Each container gets:
- Dedicated virtio-net device
- Isolated TAP interface
- Separate network namespace
- Unique IP address and MAC address
- Firewall rules at the host level
While this virtualized I/O path provides stronger isolation, it typically introduces some overhead compared to native container implementations. Our benchmarks show approximately a 45μs increase in network latency compared to traditional containers, which is acceptable for most workloads but may impact extremely latency-sensitive applications.
Storage Architecture
Our storage architecture implements multiple isolation and performance layers:
┌───────────────────┐ ┌───────────────────┐
│ Container Storage │ │ Container Storage │
├───────────────────┤ ├───────────────────┤
│ virtio-blk │ │ virtio-blk │
├───────────────────┤ ├───────────────────┤
│ dm-verity │ │ dm-verity │
├───────────────────┤ ├───────────────────┤
│ CSI Plugin │ │ CSI Plugin │
└───────────────────┘ └───────────────────┘
│ │
┌─────────┴─────────────────────┴─────────┐
│ Host Storage │
└─────────────────────────────────────────┘
Each container gets:
- Dedicated virtio-blk device
- Integrity verification through dm-verity
- Copy-on-write overlay filesystem
- I/O throttling at the host level
The storage architecture introduces approximately a 15-20% overhead in I/O throughput compared to native containers, which we've found acceptable given the security benefits.
Performance Optimization Techniques
We've implemented several optimization techniques to minimize the virtualization overhead:
- Huge Pages: Reduces TLB misses and improves memory access performance by ~15%
- CPU Pinning: Dedicates specific CPU cores to containers, reducing context switching overhead by ~8%
- Direct Attach Storage: Bypasses the host filesystem for improved I/O, delivering up to 30% better throughput
- Memory Ballooning: Dynamically adjusts VM memory allocation, improving memory utilization by ~25%
- Hot Plug Resources: Adds/removes CPUs and memory without restart, essential for elastic workloads
These optimizations collectively reduce the performance gap between virtualized containers and native containers to less than 10% for most workloads, making the security benefits well worth the tradeoff.
Platform Considerations
It's important to note that Kata Containers is currently only available on Linux distributions, which may impact deployment options for organizations with mixed operating system environments. Additionally, as the project is managed by the OpenStack Foundation, users benefit from a well-established development model and community support.
Conclusion
Propel's implementation of Kata Containers with Firecracker provides defense-in-depth security for multi-tenant environments. By leveraging hardware virtualization, we've created true isolation between containers without sacrificing the developer experience or performance.
The multi-layered security approach has proven particularly valuable—even when vulnerabilities have been discovered in one layer, other protection mechanisms have prevented exploitation. This defense-in-depth philosophy is essential for securing modern containerized environments.
Our approach has proven effective in production environments, handling hundreds of containers with security boundaries comparable to traditional VMs but with the efficiency of containers.