Season 2 – Episode 4 – 2023/04/24
The latest stable kernel is Linux 6.3, released by Linus Torvalds on Sunday, April 23rd, 2023.
The latest mainline (development) kernel is 6.3. The Linux 6.4 “merge window” is open.
Linus Torvalds announced the release of Linux 6.3, noting, “It’s been a calm release this time around, and the last week was really no different. So here we are, right on schedule”. As usual, the KernelNewbies website has a summary of Linux 6.3, including links to the appropriate LWN (Linux Weekly News) articles with deep dives for each new feature (if you like this podcast and want to support Linux Kernel journalism, please subscribe to Linux Weekly News).
Linux 6.3 includes additional support for the Rust programming language, a new red-black tree data structure for BPF programs, and the removal of a large number of legacy Arm systems.
With the release of Linux 6.3 comes the opening of the “merge window” (period of time during which disruptive changes are allowed to be merged into the kernel source code) for what will be Linux 6.4 in another couple of months. The next podcast release will include a full summary.
Thorsten Leemhuis has been doing his usual excellent work tracking regressions. He posted multiple updates during the Linux 6.3 development cycle as usual, at one point saying that “The list of regressions from the 6.3 cycle I track is still quite short”. Most seemed to relate to build problems that had stalled for fixes. He had been concerned that there “are two regressions from the 6.2 cycle still not fixed”. These included that “Wake-on-lan (WOL) apparently is broken for a huge number of users” and “a huge number of DISCARD request on NVME devices with Btrfs” causing “a performance regression for some users”. With the final release of Linux 6.3, he has “nothing much to report”, with just “two regression from the 6.3 cycle…worth mentioning”.
Sebastian Andrej Siewior announced pre-empt RT (Real Time) patch v6.3-rc5-rt8.
Shuah Khan posted a summary of complaints addressed by the Linux Kernel Code of Conduct Committee between October 1, 2022 through March 31, 2023. During that time, they received reports of “Unacceptable behavior of comments in email” 6 times. Most were resolved with “Clarification on the Code of Conduct related to maintainer rights and responsibility to reject code”. Overall “The reports were about the decisions made in rejecting code and these actions are not viewed as violations of the Code of Conduct”.
It cannot have escaped anyone’s attention that there is an active military conflict ongoing in Europe. I try to keep politics out of this podcast. We are, after all, not lacking for other places in which to debate our opinions. Similarly, for the most part, it can be convenient as Open Source developers to attempt to live in an online world devoid of politics and physical boundaries, but the real world very much continues to exist, and in the real world there are consequences (in the form of sanctions) faced by those who invade other sovereign nations. Those consequences can be imposed by governments, but also by fellow developers. The latter was the case over the past month with a patch posted to the Linux “netdev” networking development list.
An engineer from (sanctioned) Russian company Baikal Electronics attempted to post some network patches. His post was greeted by a terse response from one of the maintainers: “We don’t feel comfortable accepting patches from or relating to hardware produced by your organization. Please withhold networking contributions until further notice”. Baikal is known for its connections to the Russian state. The question of official policy was subsequently raised by James Harkonnen, citing a message allegedly from Linus in which he reportedly said “I will not stop any kernel developer I trust from taking patches from Russian sources that they in turn trust, but at the same time I will also not override anybody who goes “I don’t want to have anything to do with this” and doesn’t want to work with Russian companies”. James wanted a clarification as to any official position. As of this date no follow up discussion appears to have taken place, and there does not appear to be an official kernel-wide policy on Russian patches.
Konstantin Ryabitsev, who is responsible for running kernel.org on behalf of Linux Foundation, posted “Introducing bugbot”, in which he described a new tool that aims to be “a bridge between bugzilla [as in bugzilla.kernel.org] and public-inbox (the mailing list). The tool is “still a very early release” but it is able to “Create bugs from mailing list discussions, with full history”, and “Start mailing list threads from pre-triaged bugzilla bugs”. He closed (presciently) with “bugbot is very young and probably full of bugs, so it will still see a lot of change and will likely explode a couple of times”. True to the prediction, bugbot saw that it was summoned by the announcement of its existence and it replied to the thread, which Konstantin used as an example of the “may explode” comment he had made. Generally feedback to the new tool was positive.
Anjali Kulkarni posted version 3 of “Process connector bug fixes & enhancements”, a patch series to improve the performance of monitoring the exit of dependent threads. According to Anjali, “Oracle DB runs on a large scale with 100000s of short lived processes, starting up and exiting quickly. A process monitoring DB daemon which tracks and cleans up after processes that have died without a proper exit needs notifications only when a process died with a non-zero exit code (which should be rare)”. The patches allow a “client [to] register to listen for only exit or fork or a mix of all events. This greatly enhances performance”.
Vlastimil Babka posted “remove SLOB and allow kfree() with kmem_cache_alloc()”. In the patch posted, Vlastimil notes that “The SLOB allocator was deprecated in 6.2 so I think we can start exposing the complete removal in for-next and aim at 6.4 if there are no complaints”.
Thorsten Leemhuis (“the Linux kernel’s regression tracker”) poked an older thread about a 20% UDP performance degradation that Tariq Toukan (NVIDIA) had reported a few months ago. The report observed that a specific CFS (Completely Fair Scheduler, the current default Linux scheduler) patch was the culprit, but that the team discovering it “couldn’t come up with a good explanation how this patch causes this issue”. Thorsten tagged the mail for followup tracking.
Lukas Bulwahn posted “Updating information on lanana.org”. Lanana was setup to be “The Linux Assigned Names and Numbers Authority”, a play on organizations like the IANA: Internet Assigned Numbers Authority, that assigns e.g. IP addresses on the internet. As the patches note, “As described in Documentation/admin-guide/devices.rst, the device number register (or linux device list) is at Documentation/admin-guide/devices.txt and no longer maintained at lanana.org”. Lanana still technically hosts some of the LSB (Linux Standard Base) IDs.
On the Rust front, Asahi Lina posted “rust: add uapi crate” that “introduce[s] a new ‘uapi’ crate that will contain only these [uapi] publicly usable definitions” for use by userspace APIs.
Marcelo Tosatti posted “fold per-CPU vmstats remotely”, a patch that notes a (Red Hat) customer had encountered a system in which 48 out of 52 CPUs were in a “nohz_full” state (i.e. completely idle with the idle “tick” interrupt stopped), where a process on the system was “trapped in throttle_direct_reclaim” (a low memory “reclaim” codepath) but was not making progress because the counters the reclaim code wanted to use were stale (coming from a completely idle CPU) and not updating. The patch series causes the “vmstat_shepered” kernel thread to “flush the per-CPU counters to the global counters from remote [other] CPUs”.
Reinette Chatre posted “vfio/pci: Support dynamic allocation of MSI-X interrupts”. MSIs are “Message Signaled Interrupts”, typically used by modern buses, such as PCIe, in which an interrupt is not signaled using a traditional wiggling of a wire, but instead by a memory write to a special magic address that subsequently causes an actual hard-wired interrupt to be asserted. In the patch posting, Reinette noted that “Qemu allocates interrupts incrementally at the time the guest unmasks an interrupt, for example each time a Linux guest runs request_irq(). Dynamic allocation of MSI-X interrupts was not possible until v6.2. This prompted Qemu to, when allocating a new interrupt, first release a previously allocated interrupts (including disable of MSI-X) followed by re-allocation of all interrupts that includes the new interrupt”. This of course may not be possible while a device or accelerator is running. The patches are marked as RFC (Request For Comments) because “vfio support for dynamic MSI-X needs to work with existing user space as well as upcoming user space that takes advantage of this feature”. Reinette adds, “I would appreciate guidance on the expectations and requirements surrounding error handling when considering existing user space”. She provides several scenarios to consider.
Tejun Heo posted version 3 of “sched: Implement BPF extensible scheduler class”, which “proposed a new scheduler class called ‘ext_sched_class’, or sched_ext, which allows scheduling policies to be implemented as BPF programs”. BPF (Berkeley Packet Filter) programs are small specially processed “bytecode” programs that can be loaded into the kernel and run within a special form of sandbox. They are commonly used to implement certain tracing logic and come with restrictions (for obvious reasons) on the nature of the modifications they can make to a running kernel. Due to their complexity, and potential intrusiveness of allowing scheduling algorithms to be implemented in BPF programs, the patches come with a (lengthy) “Motivation” section, describing the “Ease of experimentation and exploration”, among other reasons for allowing BPF extension of the scheduler instead of requiring traditional patches. An example provided includes that of implementing an L1TF (L1 Terminal Fault, a speculation execution security side-channel bug in certain x86 CPUs) aware scheduler that performs co-scheduling of (safe to pair) peer threads using sibling hyperthreads using BPF.
Joel Fernandes sent a patch adding himself as a maintainer for RCU, noting “I have spent years learning / contributing to RCU with several features, talks and presentations, with my most recent work being on Lazy-RCU. Please consider me for M[aintainer], so I can tell my wife why I spend a lot of my weekends and evenings on this complicated and mysterious thing — which is mostly in the hopes of preventing the world from burning down because everything runs on this one way or another”. RCU (Read-Copy-Update) is a notoriously difficult subsystem to understand yet it is a feature of certain modern Operating Systems that allows them to gain significant performance enhancements from the fundamental notion of having different views into the same data, based upon point-in-time producers and consumers that come and go. Joel later followed up with “Core RCU patches for 6.4”, including the shiny new MAINTAINERS change and several other fixes.
Separately, Paul McKenney (the original RCU author, and co-inventor) posted assorted updates to sleepable RCU (SRCU) reducing cache footprint and marking it non-optional in Kconfig (kernel build configuration), “courtesy of new-age printk() requirements”.
Mike Kravetz raised a concern about THP (Transparent Huge Page) “backed thread stacks”. In his mail, he cited a “product team” that had “recently experienced ‘memory bloat’ in their environment” due to the alignment of the allocations they had used for thread local stacks within the Java Virtual Machine (JVM) runtime. Mike questioned whether stacks should always be THP given that “Stacks by their very nature grow in somewhat unpredictable ways over time”. Most replies were along the lines that the JVM should alter how it does allocations to use the MADV_NOHUGEPAGE parameter to madvise when allocating space for thread stacks.
Carlos Llamas posted “Using page-fault handler in binder” about “trying to remove the current page handling in [Android’s userspace IPC] binder and switch to using ->fault() and other mm/ infrastructure”. He was seeking pointers and input on the direction from other developers.
Mike Rapoport posted a patch series that “move[s] core MM initialization to mm/mm_init.c”.
Randy Dunlap noted that uclinux.org was dead and requested references to it be removed from the Linux kernel MAINTAINERS file.
Jonathan Corbet (of LWN) posted various cleanups to the kernel documentation (which he maintains), including an “arch reorg” to clean up architecture specific docs.
Lukasz Luba posted “Introduce runtime modifiable Energy Model”, a patch set that “adds a new feature which allows to modify Energy Model (EM) power values at runtime. It will allow to better reflect power model of a recent SoCs and silicon. Different characteristics of the power usage can be leverages and thus better decisions made during task placement”. Thus, the kernel’s (CFS) scheduler can (with this patch) make a decision about where to schedule (place, or migrate) a running process (known as a task within the kernel) according to the power usage that the silicon knows will vary according to nature of the workload, and its use of hardware. For example, heavy GPU use will cause a GPU to heat up and alter a chip’s (SoC’s) thermal properties in a manner that may make it better to migrate other tasks to a different core.
Reports of Itanium’s demise may not have been greatly exaggerated, but when it comes to the kernel they may have been a little premature by a month or two. Florian Weimer followed up to “Retire IA64/Itanium support” with a question, “Is this still going ahead? In userspace, ia64 is of course full of special cases, too, so many of us really want to see it gone, but we can’t really start the removal process while there is still kernel support”.
Tianrui Zhao posted version 5 of “Add KVM LoongArch support”.
Huacai Chen posted a patch, “LoongArch: Make WriteCombine configurable for ioremap()” that aims to work around a PCIe protocol violation in the implementation of the LS7A chipset.
Separately, Huacai also posted a patch enabling the kernel itself to use FPU (Floating Point Unit) functions. Quoting the patch, “They can be used by some other kernel components, e.g. the AMDGPU graphic driver for DCN”.
WANG Xuerui posted “LongArch: Make bounds-checking instructions useful”, referring to “BCE” (Bounds Checking Error) instructions, similar to those of other architectures, such as x86_64.
Laurent Dufour posted “Online new threads according to the current SMT level”, which aims to balance a hotplugged CPU’s SMT level against the current one used by the overall system. For example, a system capable of SMT8 but booted in SMT4 will currently nonetheless online all 8 SMT threads of a subsequently added CPU, rather than only 4 (to match the system).
Evan Green posted the fourth version of “RISC-V Hardware Probing User Interface”, which aims to handle the number of (potentially incompatible) ISA extensions present in implementations of the RISC-V architecture. The basic idea is to provide a vDSO (virtual Dynamic Shared Object – a kind of library that appears in userspace and is fast to link against, but is owned by the kernel) and backing syscall (for fallback use by the vDSO in certain cases) that can quickly hand an application key/value pairs representative of potential ISA features present on a system. The previous attempts had experienced pushback, so this time Evan came with performance numbers showing the (many) orders of magnitude differences in performance between using a vDSO/syscall approach vs. the sysfs file interface originally counter proposed by Greg KH (Greg Kroah-Hartman). Greg had preferred an application perform many open calls to parse sysfs files in order to determine the capabilities of a system, but this would be expensive for every binary. This patch series was later merged by Palmer Dabbelt (the RISC-V kernel maintainer) and should therefore make its way into the Linux 6.4 kernel series in the next couple of months.
Sia Jee Heng posted version 5 of a patch series implementing hibernation support for RISC-V. According to the posting, “This series adds RISC-V Hibernation/suspend to disk support. Low level Arch functions were created to support hibernation”. The cover letter explains how e.g. swsusp_arch_resume “creates a temporary page table that [covering only] the linear map. It copies the restore code to a ‘safe’ page, then [start] restore the memory image”.
Heiko Stuebner posted “RISC-V: support some cryptography accelerations”. These rely on version 14 of a previous patch series adding experimental support for the “v” (vector) extension, which has not been ratified (made official) by the RISC-V International organization yet. And speaking of this, a recent discussion of the non-standard implementation of the RISC-V vector extension in the “T-Head C9xx” cores suggests describing those as an “errata” implementation.
The PINE64 project recently began shipping a RISC-V development board known as “Star64”. This board uses the StarFive JH7110 SoC for which Samin Guo recently posted an updated ethernet driver, apparently based on the DesignWare MAC from Synopsys. Separately, Walker Chen posted a DMA driver for the same SoC, and Mason Huo posted cpufreq support (which included enabling “the axp15060 pmic for the cpu power source”). Seems an effort is underway to upstream support for this low-cost “Raspberry Pi”-like alternative in the RISC-V ecosystem.
Greg Ungerer posted “riscv: support ELF format binaries in nommu mode” which does what it says on the tin: “add the ability to run ELF format binaries when running RISC-V in nommu mode. That support is actually part of the ELF-FDPIC loader, so these changes are all about making that work on RISC-V”. Greg notes, “These changes have not been used to run actual ELF-FDPIC binaries. It is used to load and run normal ELF – compiled -pie format. Though the underlying changes are expected to work with full ELF-FDPIC binaries if or when that is supported on RISC-V in gcc”.
Anup Patel posted version 18 of “RISC-V IPI Improvements” which aims to teach RISC-V (on suitable hardware) how to use “normal per-CPU interrupts” to send IPIs (Inter-Processor Interrupts), as well as remote TLB (Translation Lookaside Buffer) flushes and cache maintenance operations without having to resort to calls into “M” mode firmware.
Rick Edgecombe posted version 8 of “Shadow stacks for userspace”, to which Borislav Petkov replied “Yes, finally! That was loooong in the making. Thanks for the persistence and patience”. He signed off as having reviewed the patches.
Ian Rogers posted “Event updates for GNR, MTL and SKL”. Apparently these perf events are generated automatically using a script on Intel’s github (that’s pretty sweet).
Usama Arif posted version 15 of “Parallel CPU bringup for x86_64”. This is about doing parallel calls to INIT/SIPI/SIPI (the initialization sequences used by x86 CPUs to bring them up) rather than the single threaded process that previously was used by the Linux kernel.
Tony Luck posted version 2 of “Handle corrected machine check interrupt storms”, which includes additional patches from Smita Koralahalli that “Extend the logic of handling Intel’s corrected machine check interrupt storms to AMD’s threshold interrupts”.
Yi Liu posted “iommu: Add nested domain support”, which “Introduce[s] a new domain type for a user space I/O address, which is nested on top of another address space address represented by a UNMANAGED domain”.
Kirill A. Shutemov posted version 16 of “Linear Address Masking enabling”. As he noted, “(LAM) modifies the checking that is applied to 64-bit linear addresses, allowing software to use of the untranslated address bits for metadata. The capability can be used for efficient address sanitizers (ASAN) implementation and for optimizations in JITs and virtual machines”. It’s also been present in architectures such as Arm for many, many years as TBI (Top Byte Ignore), etc.
Kuppuswamy Sathyanarayanan posted “TDX Guest Quote generation support”, which enables “TDX” (Trusted Domain Extensions – aka Confidential Compute) guests to attest to their “trustworthiness to other entities before provisioning secrets to the guest”. The patch describes a two step process including a “TDREPORT generation” and a “Quote generation”. The report captures measurements while the report is sent to a “Quoting Enclave” (QE) that generates a “remotely verifiable Quote”. A special conduit is provided for guests to send these quotes.
Shan Kang posted some benchmark results from KVM for Intel’s “FRED” (Flexible Return and Event Delivery) new syscall/sysenter enhanced architecture.
Mario Limonciello posted “Add vendor agnostic mechanism to report hardware sleep”, noting that “An import part of validating that S0ix [an SoC level idle power state] worked properly is to check how much of a cycle was spent in a hardware sleep state”.