S2E3 – 2023/03/09

Season 2 – Episode 3 – 2023/03/09

Summary

The latest stable kernel is Linux 6.2.2, released by Greg Kroah-Hartman on March 3rd 2023. The latest mainline (development) kernel is 6.3-rc1, released by Linus on March 5th 2023.

Mathieu Desnoyers has announced Userspace RCU release 0.14.0 which adopts a baseline requirement of C99 and C++11, and introduces new APIs for C++.

Alejandro Colomar announced man-pages-6.03 is released. Among the “most notable changes” is “We now have a hyperlinked PDF book of the Linux man-pages”.

Junio C Hamano announced that “A release candidate Git v2.40.0-rc2 is now available”.

Takashi Sakamoto has stepped up to become the owner of the FireWire subsystem.

Linux 6.2 released

Linux 6.2 was released “right on (the extended) schedule” on February 19th following an extra RC (Release Candidate) motivated by the end of year holidays. Linus noted in his release announcement that “Nothing unexpected happened” toward the end of the cycle but there were a “couple of small things” on the regression side that Thorsten Leemhuis is tracking. Since “they weren’t actively pushed by maintainers…they will have to show up for stable [kernel releases]”.

Thorsten diligently followed up with his summary of regressions, noting that “There are still quite a few known issues from this cycle mentioned below. Afaics none of them affect a lot of people”. He also recently posted “docs: describe how to quickly build a trimmed kernel” as “that’s something users will often have to do when they want to report an issue or test proposed fixes”.

Among the fixes that will come in for stable is a build fix for those running Linux 6.2 on a Talos II (IBM POWER9) machine, who may notice an “undefined reference to ‘hash__tlb_flush” during kernel compilation. A fix is being tracked for backport to stable.

Speaking of regressions, Nick Bowler identified an older regression beginning in Linux 6.1 that caused “random crashes” on his SPARC machine. Peter Xu responded that it was likely a THP (Transparent Huge Page) problem, perhaps showing up because THP was disabled (which it was in Nick’s configuration). Nick tested a fix from Peter that seemed to address the issue.

As you’ll see below, ongoing discussions are taking place about the removal of various legacy architectures from the kernel. Another proposal recently made (by Christoph Hellwig) is to “orphan JFS” (the “Journalling File System”). Stefan Tibus was among those who stood up and claimed to still be “a happy user of JFS from quite early on all my Linux installations”.

Linux 6.3-rc1

Linus announced the closure of the merge window (the period of time during which disruptive changes are allowed to be merged into the kernel) with the release of Linux 6.3-rc1, noting, “So after several releases where the merge windows had something odd going on, we finally had just a regular “two weeks of just merge window”. It was quite nice. In fact, it was quite nice in a couple of ways: not only didn’t I have a huge compressed merge window where I felt I had to cram as much as possible into the first few days, but the fact that we _have_ had a couple of merge windows where I really asked for people to have everything ready when the merge window opened seems to have set a pattern: the bulk of everything really did come in early”.

As usual, Linux Weekly News has an excellent summary of part 1 and part 2 of the merge window (across two weeks). I encourage you to subscribe and read it for a full breakdown.

Ongoing Development

Linux 6.2 brought with it initial support for the Rust programming language. Development continues apace upsteam, with proposed patches extending the support to include new features. Miguel Ojeda (the Rust for Linux maintainer) posted a pull request for Linux 6.3, including support for various new types. Daniel Almeida recently posted “rust: virtio: add virtio support”, which “adds virtIO support to the rust crate. This includes the capability to create a virtIO driver (through the module_virtio_driver macro and the respective Driver trait)”.

And the work extends to the architectural level also, with Conor Dooley recently posting “RISC-V: enable rust”, which he notes is a “somewhat blind (and maybe foolish) attempt at enabling Rust for RISC-V. I’ve tested this on Icicle [a prominent board], and the modules seem to work. I’d like to play around with Rust on RISC-V, but I’m not interested in using downstream kernels, so figured I should try and see what’s missing…”.

But probably the most interesting development in Rust language land has noting to do with Rust as a language at all. Instead, it is a patch series titled “Rust DRM subsystem abstractions (& preview AGX driver)” from Asahi Lina. In the patch, Lina notes “This is my first take on the Rust abstractions from the DRM [graphics] subsystem. It includes the abstractions themselves, some minor prerequisite changes to the C side, as well as drm-asahi GPU driver (for reference on how the abstractions are used, but not necessarily intended to land together)”. It’s that last part, patch 18, the one titled “drm/asahi: Add the Asahi driver for Apple AGX GPUs” which we refer to here. In it, Lina implements support for the GPUs used by the Apple M1, M1 Pro, M1 Max, M1 Ultra, and the Apple M2 silicon. This is not a small driver and an interesting demonstration of the level of capability already being reached in terms of Linux upstream Rust language support.

Lokesh Gidra posted an “RFC for new feature to move pages from one vma to another without split” which allows one “anonymous” (not file backed) page (the fundamental granule size by which memory is managed and accounted) to be moved from part of a runtime heap (VMA) to another without otherwise impacting the state of the overall heap. The intended benefit is to managed runtimes with garbage collection, allowing for simplified “coarse-grained page-level compaction” garbage collection algorithms “wherein pages containing live objects are slid next to each other without touching them, while reclaiming in-between pages which contain only garbage”. The patch posting includes a lengthy writeup explaining the details.

Alison Schofield posted patches titled “CXL Poison List Retrieval & Tracing” targeting the CXL 3.0 specification, which allows OS management software to obtain a list of memory locations that have been poisoned (corrupted due to a RAS event, such as an ECC failure), for example in a “CXL.mem” DDR memory device attached to a system using the serial CXL interconnect.

Dexuan Cui noted that “earlyprintk=ttyS0” was broken on AMD SNP (Confidential Compute) guests running under KVM. This turned out to be due to a particular code branch taken during initialization that varied based upon whether a kernel was entered in 64-bit mode via EFI or through a direct (e.g. kexec/qemu KVM device modeling userspace) type of load.

Zhangjin Wu posted “Add dead syscalls elimination support” intended to remove support from the kernel for “dead” syscalls “which are not used in target system”. Presumably this is to benefit deeply embedded architectures where any excess memory used by the kernel is precious.

Nick Alcock posted “MODULE_LICENSE removals, first tranche” intended to “remove the LICENSE_MODULE usage from files/objects that are not tristate” [meaning that they are not actually setup to be used as modules to begin with].

Bobby Eshleman posted “vsock: add support for sockmap”. Bytedance are apparently “testing usage of vsock as a way to redirect guest-local UDS [Unix Domain Socket] requests to the host and this patch series greatly improves the performance of such a setup”. By 121% throughput.

Chih-En Lin posted version 4 of a patch series “Introduce Copy-On-Write to Page Table” which aims to add support for COW to the other half of the equation. Copy-on-Write is commonly used as an optimization whereby a cloned process (for example, during a fork used to exec a new program) doesn’t actually get a copy of the entire memory used by the original process. Instead, the tracking structures (page tables) are modified to mark all the pages in the new process as read only. Only when it attempts to write to the memory are the actual pages copied. The COW page table patches aim to do the same for the page tables themselves, so that full copies are not needed until the new address space is modified. Pulling off this trick requires that some of the tables are copied, but not the leaf (PTE) entries themselves, which are shared between the two processes. David Hildenbrand thanked Chih-En for the work, and the measurements, but expressed concern about “how intrusive even this basic deduplication approach already is”.

On the subject of page tables, Matthew (Willy) Wilcox posted version 3 of “New page table range API” that allows for setting up multiple page table entries at once, noting “The point of all this is better performance, and Fenwei Yin has measured improvement on x86”.

Architectures

Arm

Kristina Martsenko posted “arm64: support Armv8.8 memcpy instructions in userspace”, which adds support for (you guessed it) the memcpy instructions that were added in Armv8.8. These are described by the FEAT_MOPS documentation in the Arm ARM. As Kristina puts it, “The aim is to avoid having many different performance-optimal memcpy implementations in software (tailored to CPU model and copy size) and the overhead of selecting between them. The new instructions are intended to be at least as fast as any alternative instruction sequence”.

Various Apple Silicon patches have been posted. As the Asahi Linux project noted recently in “an update and reality check”, Linux “6.2 notably adds device trees and basic support for M1 Pro/Max/Ultra machines. However, there is still a long road before upstream kernels are usable on laptops”. Nonetheless, patches continue to fly, with the latest including “Apple M2 PMU support” from Janne Grunau, which notes that “The PMU itself appears to work in the same way as o[n] M1”, and support for the Broadcom BCM4387 WiFi chip used by Apple M1 platforms from Hector Martin. Hector also posted “Apple T2 platform support” patches.

Itanium

The Intel Itanium architecture, also known as “IA-64”, was originally announced on October 4th 1999. It was intended as the successor to another legacy architecture that Intel had previously introduced back in 1978. That legacy architecture (known as “x86”) had a number of design challenges that could limit its future scalability, but it was also quite popular, and there were a relatively large number of systems deployed. Nonetheless, Intel was determined to replace x86 with a modern architecture designed with the future in mind. Itanium was co-designed with Hewlett-Packard, who created the original ISA specification. It featured 128 64-bit general purpose registers, 128 floating point registers, 64-bit predicate registers, and more besides.

Itanium was a VLIW (Very Long Instruction Word) machine that leveraged fixed-width “bundles” of instructions that are each individually 41-bits, plus a 5-bit template describing which type of instructions are present in the bundle. The Itanium implementation of VLIW is referred to as “EPIC” (Explicitly Parallel Instruction Computing) – which one must be careful not to confuse with the highly successful x86 architecture implementation from AMD known as “EPYC”. In Itanium, modern high performance microprocessor innovations such as hardware speculative and Out-of-Order execution take a back seat to software managed speculation, requiring an extremely complicated compiler toolchain that took many years to develop. Even then, it was clear early on that software management of dependencies and speculation could not compete with a hardware implementation, such as that used by contemporary x86 and RISC CPUs.

Intel Itanium processors were officially discontinued in January of 2020. As Ard Biesheuvel noted across several patch postings attempting to remove or mark IA-64 as broken, various support for Itanium has already been removed from dependent projects (such as upstream Tianocore – the EFI implementation needed to boot such systems, from which Intel itself removed such support in 2018), “QEMU no longer implements support for it”, and given the lack of systems and ongoing firmware maintenance, “there is zero test coverage using actual hardware” (“beyond a couple of machines used by distros to churn out packages”). Even this author has long since decommissioned his Itanium system (named “Hamartia” after the tragic hero, which he acquired during the upstreaming of PCI support for Arm as both of Itanium’s users had expressed concern that Arm support for PCI might break Itanium and it thus seemed important to be able to test that this mission-critical architecture was not broken in the process).

As of this writing support for Itanium has not (yet) been removed from the kernel.

LoongArch

A lot of work is going into the LoongArch [aside: could someone please let me know how to pronounce it properly?]. Recent patches include a patch from Youling Tang (“Add support for kernel relocation”) that “allows to compile [the] kernel as PIE and to relocated it at any virtual address at runtime” (to “pave the way to KASLR”, added in a later patch). Another patch “Add hardware breakpoints/watchpoints support” does what it says on the tin. Finally, Tianrui Zhao posted “Add KVM LoongArch support”, which adds KVM support noting that the Loongson (the company behind the architecture) “3A5000” chip “supports hardware assisted virtualization”.

RISC-V

Evan Green posted “RISC-V: Add a syscall for HW probing” which started an extremely long discussion about the right (and wrong) ways to handle the myriad (sometimes mutually incompatible) extensions supported by the RISC-V community. Traditionally, architectures were quite standardized with a central authority providing curation. But while RISC-V does have the RISC-V International organization, and the concept of ratification for extensions with a standard set of extensions defined in various profiles, the practical reality is somewhat less rigid than folks may be used to. As a result, there are in fact a very wide range of implementations, and the kernel needs to somehow be able to handle all of the hundreds of permutations.

Most architectures handle minor variation between implementations using the “HWCAP” infrastructure and the “Auxiliary Vectors” which are special environment variables exported into every running process. This allows (e.g.) userspace software to quickly determine whether a particular feature is supported or not. For example, the feature might be some novel atomic or vector support that isn’t present in older processors. But when it comes to RISC-V this approach isn’t as easy. As Evan said in his posting, “We don’t have enough space for these all in ELF_HWCAP and there’s no system call that quite does this, so let’s just provide an arch-specific one to probe for hardware capabilities. This currently just provides m{arch,imp,vendor}id, but with the key-value pairs we can pass more in the future”.

The response was swift, and negative, with Greg Kroah-Hartman responding, “Ick, this is exactly what sysfs is designed to export in a sane way. Why not just use that instead? The “key” would be the filename, and the value the value read from the filename”. The response was that this would slow down future RISC-V systems because of the large number of file operations that every process would need to perform on startup in order for the standard libraries to figure out what features were supported or not. Worse, some of the infrastructure for file operations might not be available at the time when it would be needed. This situation is a good reminder of the importance of standardization and the value that it can bring to any modern architecture.

Speaking of standardization, several rounds of patches were posted titled “Add basic ACPI support for RISC-V” which “enables the basic ACPI infrastructure for RISC-V”. According to Sunil V L, who posted the patch series, “Supporting external interrupt controllers is in progress and hence it is tested using poll based HVC SBI console and RAM disk”.

Other patches recently posted for RISC-V include “Introduce virtual kernel mapping KASLR”. The patches note that “The seed needed to virtually move the kernel is taken from the device tree, so we rely on the bootloader to provide the correct seed”. Later patches may add support for the RISC-V “Zkr” random extension so that this can be provided by hardware instead. As a dependent patch, Alexandre Ghiti posted “Introduce 64b relocatable kernel”.

Deepak Gupta posted “riscv control-flow integrity for U mode” in which he notes he has “been working on linux support for shadow stack and landing pad instruction on riscv for a while. These are still RFC quality. But at least they’re in a shape which can start a discussion”. The RISC-V extension adding support for control flow integrity is called Zisslpcfi, which rolls off the tongue just as easily as all of the other extension names, chosen by cats falling on keyboards.

Jesse Taube posted “Add RISC-V 32 NOMMU support”, noting, “This patch-set aims to add NOMMU support to RV32. Many people want to build simple emulators or HDL models of RISC-V. [T]his patch makes it possible to run linux on them”.

Returning to the topic of incompatible vendor extensions, Heiko Stuebner posted “RISC-V: T-Head vector handling”, which notes “As is widely known, the T-Head C9xx cores used for example in the Allwinner D1 implement an older non-ratifed variant of the vector spec. While userspace will probably have a lot more problems implementing support for both, on the kernel side the needed changes are actually somewhat small’ish and can be handled via alternatives somewhat nicely. With this patchset I could run the same userspace program (picked from some riscv-vector-test repository) that does some vector additions on both qemu and a d1-nezha board. On both platforms it ran successfully and even produced the same results”.

Super-H

Returning to the subject of dying architectures once again, an attempt was made by Christoph Hellwig to “Drop arch/sh and everything that depends on it” since “all of the support has been barely maintained for almost 10 years, and not at all for more than 1 year”. Geert Uytterhoeven noted that “The main issue is not the lack of people sending patches and fixes, but those patches never being applied by the maintainers. Perhaps someone is willing to stand up to take over maintainership?” This caused John Paul Adrian Glaubitz to raise his hand and say he “actually would be silling to do it but I’m a bit hesitant as I’m not 100% sure my skills are sufficient”. Rob Landley offered to help out too. It seems sh might survive this round.

x86

Mathieu Desnoyers was interested in formal documentation from Intel concerning concurrent modification of code while it is executing (specifically, updating instructions to patch them as calling a debug handler via “INT3”). He wrote to Peter Anvin saying “I have emails from you dating from a few years back unofficially stating that it’s OK to update the first byte of an instruction with a single-byte int3 concurrently…Olivier Dion is working on the libpatch project aiming to use this property for low-latency/low-overhead live code patching in user-space as well, but we cannot find an official statement from Intel that guarantees this breakpoint-bypass technique is indeed OK without stopping the world while patching”. Steven Rostedt was among those who noted “The fact that we have been using it for over 10 years without issue should be a good guarantee”. Mathieu was able to find comprehensive documentation in the AMD manual that allows it, but noted again “I cannot find anything with respect to asynchronous cross-modification of code stated as clearly in Intel’s documentation”. Anyone want to help him?

Development continues toward implementing support for “Flexible Return and Event Delivery” aka “FRED” on Intel architecture. Among the latest patches, Ammar Faizi includes a fix to the “sysret_rip” selftest that handles the fact that FRED’s “syscall” instruction (to enter the kernel from userspace) no longer clobbers (overwrites) the x86 “rcx” and “r11” registers. On the subject of tests, Mingwei Zhang posted patches updated the “amx_test” suite to add support for several of the “new entities” that are present in Intel’s AMX (matrix extension) architecture.

Sean Christopherson posted “KVM: x86: Add “governed” X86_FEATURE framework”, which is intended to “manage and cache KVM-governed features, i.e. CPUID based features that require explicit KVM enabling and/or need to be queried semi-frequently by KVM”. According to Sean, “The idea originally came up in the context of the architectural LBRs [Last Branch Record, a profiling mechanism to record precisely the last N branched] series as a way to avoid querying guest CPUID in hot paths without needing a dedicated flag, but as evidenced by the shortlog, the most common usage is to handle the ever-growing list of SVM [Shared Virtual Memory] that are exposed to L1”. Reducing calls to CPUID is generally a good thing since it results in a (possibly lengthy) trap into microcode, and is also a context serializing instruction.

Paolo Bonzini posted “Cross-Thread Return Address Predictions vulnerability”, noting that “Certain AMD processors are vulnerable to a cross-thread return address predictions bug. When running in SMT [Simultaneous Multi-Threading] mode and one of the sibling threads transitions out of C0 state, the other thread gets access to twice as many entries in the RSB [Return Stack Buffer], but unfortunately the predictions of the now-halter logical processor are not purged”. Paolo is referring to the fact that x86 processors include two logical “threads” (which Intel calls “Hyperthreads” – a trademarked name – and more generally are known as SMT or Simultaneous Multi-Thread). Most modern x86 processors include an optimization that when one logical thread is not being used and the other transitions into what software sees as a “low power” state, what actually happens is that the partitioned resources as given to the other thread, which consequently sees a boost in performance as it is no longer contending on the back end for execution units, and now has double the store buffer and predictor entries.

But in this case, the RSB [Return Stack Buffer] entries are not zeroed out in the process, meaning that it is possible for a malicious thread to “train” the RSB predictor later used by the peer thread to guess that certain function call return paths will be used. This opens up an opportunity to cause a sibling thread to speculatively execute down a wrong path that leaves cache breadcrumbs that can be measured in order to potentially leak certain information. Paolo addresses this by adding a KVM (hypervisor) parameter that “if set, will prevent the user from disabling the HLT, MWAIT, and CSTATE exits”, preventing the other thread from preventing the hypervisor from stuffing the RSB with dummy safe values when the sibling thread goes to sleep.

Dionna Glaze posted “Add throttling detection to sev-guest”, noting that “The guest request synchronous API from SEV-SNP [AMD’s Confidential Computing feature] to the host’s security processor consumes a global resource. For this reason, AMD’s docs recommend that the host implements a throttling mechanism. In order for the guest to know it’s been throttled and should try its request again, we need some good-faith communication from the host that the request has been throttled. These patches work with the existing dev/sev-guest ABI”. 

On the subject of Confidential Compute, Kai Huang posted version 9 of a patch series “TDX host kernel support” aiming to add support for Intel’s TDX Confidential Compute extensions, while Jeremi Piotrowski posted “Support nested SNP KVM guests on Hyper-V” intending to add support for nested (hypervisor inside hypervisor) support for AMD’s Confidential Compute to the Hyper-V hypervisor as used by Microsoft Azure. Nested Confidential Compute sounds fun.

Rick Edgecombe posted version 6 of “Shadow stacks for userspace”, a series that “implements Shadow Stacks for userspace using x86’s Control-flow Enforcement Technology (CET)”. As he reminds us, CET supports both shadow stacks and indirect branch tracking (landing pads), but these patches “implements just the shadow stack part of this feature, and just for userspace”.

Michael S. Tsirkin posted “revert RNG seed mess” noting “All attempts to fix up passing RNG [random entropy] seed via setup_data entry failed. Let’s just rip out all of it. We’ll start over”.

Arnd Bergmann posted “x86: make 64-bit defconfig the default” noting that 32-bit kernel builds were “rarely what anyone wants these days”. The patch changes “the default so that the 64-bit config gets used unless the user asked for i686_defconfig, uses ARCH=i386 or runs on a system that “uname -m” identifies as i386/i486/i586/i686”.

1 comment on “S2E3 – 2023/03/09

  1. jonmasters says:

    And I already found one minor mistake. SVM in that context above meant Secure VM not Virtual Memory. Sorry!

Leave a Reply

%d bloggers like this: