Summary
The latest stable kernel is Linux 6.1.11, released by Greg K-H on February 9th 2023.
The latest mainline (development) kernel is 6.2-rc7, released on February 5th 2023.
Linux 6.2 progress
A typical kernel development cycle begins with the “merge window” (period of time during which disruptive changes are allowed to be merged into the kernel) followed by a series of (weekly) Release Candidate (RC) kernels, and then the final release. In most cases, RC7 is the final RC, but it is not all that unusual to have an extra week, as is likely the case this time around. Linus said a few weeks ago, “I am expecting to do an rc8 this release regardless, just because we effectively had a lost week or two in the early rc’s”, and indeed fixes for RC8 were still coming in as recently as today. We should at this rate see RC8 tomorrow (Sunday is the normal release day), and the 6.3 merge window in another week, meaning we’ll cover the 6.3 merge window in the next edition of this podcast. In the meantime, I encourage listeners to consider subscribing and supporting LWN (Linux Weekly News), who always have a great merge window summary.
Confidential Compute (aka “CoCo”)
If there were a “theme of the moment” for the industry (other than layoffs), it would probably be Confidential Compute. It seems one can’t go more than 10 minutes without seeing a patch for some new confidential compute feature in one of the major architectures, or the system IP that goes along with it. Examples in just the past few weeks (and which we’ll cover in a bit) include patches from both Intel (TDX) and AMD (SEV-SNP) for their Confidential Compute solutions, as well as PCI pass-through support in Hyper-V for Confidential VMs. At the same time, thought is going into revising the kernel’s “threat model” to update it for a world of Confidential Compute.
A fundamental tenet of Confidential Compute is that guests no longer necessarily have to trust the hypervisor on which they are running, and quite possibly also don’t trust the operator of the system either (whether a cloud, edge network, OEM, etc.). The theory goes that you might even have a server sitting in some (less than friendly) geographical location but still hold out a certain amount of trust for your “confidential” workloads based on properties provided by the silicon (and attested by introspecting the other physical and emulated devices provided by the system). In this model, you necessarily have to trust the silicon vendor, but maybe not much beyond that.
Elana Rehetova (Intel) posted “Linux guest kernel threat model for Confidential Computing” in which she addressed Greg Kroah-Hartman (“Greg K-H”), who apparently previously requested “that we ought to start discussing the updated threat model for kernel”. She had links to quite detailed writeups on Intel’s github. Greg replied to a point about not trusting the hypervisor with “That is, frankly, a very funny threat model. How realistic is it really given all of the other ways that a hypervisor can mess with a guest?”. And that did indeed used to be a good point. Some of the earlier attempts at Confidential Compute included architectural designs in which guest registers were not protected against single step debug (and introspection) from a hypervisor, for example. And so one can be forgiven for thinking that there are some fundamental gaps, but a lot has changed over the past few years, and the architectures have advanced quite a bit since.
Greg also noted that he “hate[s] the term “hardening”” (as applied to “hardening” device drivers against malicious hardware implementations (as opposed to just potentially buggy ones). He added, “Please just say it for what it really is, “fixing bugs to handle broken hardware”. We’ve done that for years when dealing with PCI and USB and even CPUs doing things that they shouldn’t be doing. How is this any different in the end? So what you also are saying here now is “we do not trust any PCI devices”, so please just say that (why do you trust USB devices?) If that is something that you all think that Linux should support, then let’s go from there.” David Alan Gilbert piled on with some context around Intel, and AMD’s implementations, and in particular that more than mere memory encryption is used; register state, guest VMSA (control), etc. – all of that and much more – is carefully managed under the new world order.
Daniel Berrange further clarified, in response to a discussion about deliberately malicious implementations of PCI and USB controllers, that, “As a baseline requirement, in the context of confidential computing the guest would not trust the hypervisor with data that needs to remain confidential, but would generally still expect it to provide a faithful implementation of a given device.” A lot of further back and forth took place with others piling on comments indicating a few folks weren’t aware of the different technical pieces involved (e.g. PCI IDE, CMA, DOE, SPDM and other acronyms) for device attestation prior to trusting it from within a guest, or that this was even possible. The thread was more informative for revealing that general knowledge of technology involved in Confidential Compute is not broadly pervasive. Perhaps there is an opportunity there for sessions at the newly revived in-person conferences taking place in ‘23.
Ongoing Development
Miguel Ojeda posted a patch introducing a new “Rust fixes” branch, noting, “While it may be a bit early to have a “fixes” branch, I guessed it would not hurt to start practicing how to do things for the future when we may get actual users. And since the opportunity presented itself, I wanted to also use this PR to bring up a “policy” topic and ideally get kernel maintainers to think about it.” He went on to describe the PR as containing a fix for a “soundness issue” related to UB (Undefined Behavior) in which “safe” rust code can nonetheless trigger UB in C code. He wanted to understand whether such fixes were truly considered fixes suitable for backport to stable and was mostly interested in addressing the policy aspect of the development process. Linus took the pull request without discussion, so presumably it wasn’t a big deal for him.
Saurabh Sengar posted “Device tree support for Hyper-V VMBus driver”, which “expands the VMBus driver to include device tree support. This feature allows for a kernel boot without the use of ACPI tables, resulting in a smaller memory footprint and potentially faster boot times. This is tested by enabling CONFIG_FLAT and OF_EARLY_FLATTREE for x86.” It isn’t articulated in the patch series, but this smells like an effort to support a special case minimal kernel – like the kind used by Amazon’s “Firecracker” for fast spinup VMs used to back ephemeral services like “functions”. It will be interesting to see what happens with this.
Elliot Berman (QUIC, part of Qualcomm) posted version 9 of a patch series “Drivers for gunyah hypervisor”. Gunyah is, “a Type-1 hypervisor independent of any high-level OS kernel, and runs in a higher CPU privilege level. It does not depend on any lower-privileged OS kernel/code for its core functionality. This increases its security and can support a much smaller trusted computing base than a Type-2 hypervisor.” The Gunyah source is available on github.
Breno Leitao posted “netpoll: Remove 4s sleep during carrier detection” noting that “Modern NICs do not seem to have this bouncing problem anymore, and this sleep slows down the machine boot unnecessarily”. What he meant is that traditionally the carrier on a link might be reported as “up” while autonegotiation was still underway. As Jakub Kicinski noted, especially on servers the “BMC [is often] communicating over NC-SI via the same NIC as gets used for netconsole. BMC will keep the PHY up, hence the carrier appearing instantly.”
Robin Murphy (Arm) posted a patch series aiming to “retire” the “iommu_ops” per bus IOMMU operations and reconcile around a common kernel implementation.
SongJae Park continues to organize periodic “Beer/Coffee/Tea” chat series virtual sessions for those interested in DAMON. The agenda and info is an a shared Google doc.
Architectures
Arm
Suzuki K Poulose posted several sets of (large) related RFC (Request For Comment) patches beginning with, “Support for Arm CCA VMs on Linux”. Arm CCA is a new feature introduced as part of the Armv9 architecture, including both the “Realm Management Extension” (RME) and associated system level IP changes required to build machines that support Confidential Compute “Realms”. In the CCA world, there are additional security states beyond the traditional Secure/Non-Secure. There is now a Realm state in which e.g. a Confidential Compute VM communicates with a new “RMM” (Realm Management Monitor) over an “RSI” (Realm Service Interface) to obtain special services on the Realm’s behalf. The RMM is separated from the traditional hypervisor and “provides standard interfaces – Realm Management Interface (RMI) – to the Normal world hypervisor to manage the VMs running in the Realm world (also called
Realms in short).” The idea is that the RMM is well known (e.g. Open Source) code that can be attested and trusted by a Realm to provide it with services on behalf of an untrusted hypervisor.
Arm include links to an updated “FVP” (Fixed Virtual Platform) modeling an RME-enabled v9 platform, alongside patched TF-A (Trusted Firmware), an RMM (Realm Management Monitor), kernel patches, an updated kvmtool (a lightweight alternative to qemu for starting VMs), and updated kvm-unit-tests. Suzuki notes that what they are seeking is feedback on:
- KVM integration of the Arm CCA
- KVM UABI for managing the Realms, seeking to generalise the operations wherever possible with other Confidential Compute solutions.
- Linux Guest support for Realms
kvx
Yann Sionneau (Kalrey) posted version 2 of a patch series, “Upstream kvx Linux port”, which adds support for yet another architecture, as used in the “Coolidge (aka MPPA3-80)” SoC. The architecture is a little endian VLIW (Very Long Instruction Word) with 32 and 64-bit execution modes, 64 GPRs, SIMD instructions, and (but of course) a “deep learning co-processor”. The architecture appears to borrow nomenclature from elsewhere, having both an “APIC” and a “GIC” as part of its interrupt controller story. Presumably these mean something quite different. In the mail, Yann notes that this is only an RFC at this stage, “since kvx support is not yet upstreamed into gcc/binutils”. The most infamous example of a VLIW architecture is, of course, Intel’s Itanium. It is slowly being removed from the kernel in a process that began in 2019 with the shipping of the final Itanium systems and deprecation of GCC and GLIBC support for it. If things go well, perhaps this new VLIW architecture can take Itanium’s place as the only one.
RISC-V
Anup Patel (Ventana Micro) is heavily involved in various RISC-V architecture enablement, including for the new “AIA” (Advanced Interrupt Architecture) specification, replacing the de facto use of SiFive’s “PLIC” interrupt controller. The spec has now been frozen (Anup provided a link to the frozen AIA specification) and initial patches are posted by Anup enabling support for guests to see a virtualized set of CSRs (Configuration and Status Registers). AIA is designed to be fully virtualizable, although as this author has noted from reading the spec, it does require an interaction with the IOMMU to interdict messages in order to allow for device live migration.
Sunil V L (Ventana Micro) posted patches to “Add basic ACPI support for RISC-V”. The patches come alongside others for EDK2 (UEFI, aka “Tianocore”), and Qemu (to run that firmware and boot RISC-V kernels enabled with ACPI support). This is an encouraging first step toward an embrace of the kinds of technologies required for viability in the mainstream. This author recalls the uphill battle that was getting ACPI support enabled for Arm. Perhaps the community has more experience to draw upon at this point, and a greater understanding of the importance of such standards to broader ecosystems. In any case, there were no objections this time around.
x86-64
Early last year (2022), David Woodhouse (Amazon) posted the 4th version of a patch series he had been working on titled “Parallel CPU bringup for x86_64” which aims to speed up the boot process for large SMP x86 systems. Traditionally, x86 systems would enter the Linux kernel in a single threaded mode with a “bootcpu” being the first core that happened to start Linux (not necessarily “cpu0”). Once early initialization was complete, this CPU would use a SIPI (Startup IPI or “Inter Processor Interrupt”) to signal to the “secondary” cores that they should start booting. The entire process could eventually take quite some time, and it would therefore be better if these “secondary” cores could start their initialization earlier – while the first core was getting things setup – and then rendezvous waiting for a signal to proceed.
Usama Arif (Bytedance) noted that these older patches “brought down the smpboot time from ~700ms to 100ms”. That’s a decent savings, especially when using kexec as Usama is doing (perhaps in a “Linuxboot” type of configuration with Linux as a bootloader), and at the scale of a large number of systems. Usama was interested to know whether these patches could be merged. David replied that the last time around there had been some AMD systems that broke with the patches, “We don’t *think* there are any remaining software issues; we think it’s hardware. Either an actual hardware race in CPU or chipset, or perhaps even something as simple as a voltage regulator which can’t cope with an increase in power draw from *all* the CPUs at the same time. We have prodded AMD a few times to investigate, but so far to no avail. Last time I actually spoke to Thomas [Gleixner – one of the core x86 maintainers] in person, I think he agreed that we should just merge it and disable the parallel mode for the affected AMD CPUs.”. The suggestion was to proceed to merge but to disable this feature on all AMD CPUs for the moment out of an abundance of caution.
Nikunj A Dadhania (AMD) posted patches enabling support for a “Secure TSC” for SNP (Secure Nested Paging) guests. SNP is part of AMD’s Confidential Compute strategy and securing the TSC (Time Stamp Counter) is a necessary part of enabling confidential guests to not have to trust the host hypervisor. Prior to these patches, a hypervisor could interdict the TSC, providing a different view of the passage of CPU time to the guest than reality. With the patches, “Secure TSC allows guest to securely use RDTSC/RDTSCP instructions as the parameters being used cannot be changed by hypervisor once the guest is launched. More details in the AMD64 APM Vol 2, Section “Secure TSC”. According to Nikunj, “During the boot-up of the secondary cpus, SecureTSC enabled guests need to query TSC info from Security processor (PSP). This communication channel is encrypted between the security processor and the guest, hypervisor is just the conduit to deliver the guest messages to the security processor. Each message is protected with an AEAD (AES-256 GCM).”
Rick Edgecomb (Intel) posted an updated patch series titled “Shadow stacks for userspace” that “implements Shadow Stacks for userspace using x86’s Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace.” As Rick notes, “The main use case for shadow stack is providing protection against return oriented programming attacks”. ROP attacks aim to string together pre-existing “gadgets” (existing pieces of code, not necessarily actually well defined functions in themselves) by finding a vulnerability in existing code that can cause a function to jump (return) into a gadget sequence. Shadow stacks mitigate this by adding an additional, separate hardware structure, that tracks all function entry/exit sequences and ensures returns only come from real function calls (or are special cased longjmp like sequences that usually require special handling).
Jarkko Sakkinen posted some fixes for AMD’s SEV-SNP hypervisor that had been discovered by the Enarx developers. I’m mentioning it because this patch series may have been the final one to go out by the startup “Profian”, which had been seeking to commercialize support for Enarx. Profian closed its doors in the past few weeks due to the macro-economic environment. Some great developers are available on the market and looking for new opportunities. If you are hiring, or know folks who are, you can see posts from the Profian engineers on LinkedIn.
Final words
The Open Source Summit North America returns in person (and virtual) this year, from May 10th in Vancouver, British Columbia, Canada. There are several other events planned to be colocated alongside the Open Source Summit. These include the (invite only) Linux Storage, Filesystem, Memory Management, and BPF (LSF/MM/BPF) Summit for which a CFP is open. Another colocated event is the Linux Security Summit North America, the CfP of which was announced by James Morris with a link for submitting proposals.
Cyril Hrubis (SuSE) posted an announcement that the Linux Test Project (LTP) for January 2023 had been released. It includes a number of new tests, among them “dirtyc0w_shmem aka CVE-2022-2590”. They have also updated the minimum C requirement to -std=gnu99. Linux itself moved to a baseline of C11 (from the much older C99 standard) since Linux 5.18.