on
What is PID 0?
I get nerd-sniped a lot. People offhandedly ask something innocent, and I lose the next several hours (or in this case, days) comprehensively figuring out the answer. Usually this ends up in a rant thread on mastodon or in some private chat group or other. But for once I have the energy to write one up for the blog.
Today’s innocent question:
Is there a reason UIDs start at 0 but PIDs start at 1?
The very short version: Unix PIDs do start at 0! PID 0 just isn’t shown to userspace through traditional APIs. PID 0 starts the kernel, then retires to a quiet life of helping a bit with process scheduling and power management. Also the entire web is mostly wrong about PID 0, because of one sentence on Wikipedia from 16 years ago.
There’s a slightly longer short version right at the end, or you can stick with me for the extremely long middle bit!
But surely you could just google what PID 0 is, right? Why am I even publishing this?
The internet is wrong
At time of writing, if you go ask the web about PID 0, you’ll get a mix of incorrect and misleading information, and almost no correct answers.
After figuring out the truth, I asked Google, Bing, DuckDuckGo and Kagi what PID 0 is on linux. I looked through the top 20 results for each, as well as whatever knowledge boxes and AI word salads they organically gave me. That’s 2 pages of results on Google, for reference.
All of them failed to produce a fully correct answer. Most had a single partially correct answer somewhere in the first 20 results, but never near the top or showcased. DDG did best, with the partially correct answer at number 4. Google did the worst, no correct answer at all. And in any case, the incorrect answers were so prevalent and consistent with each other that you wouldn’t believe the one correct site anyway.
The top-2 results on all engines were identical, interestingly: a stackoverflow answer that is wrong, and a spammy looking site that seems to have embraced LLM slop, because partway through failing to explain PID 0 it randomly shifts to talking about PID loops, from control system theory, before snapping out of it a paragraph later and going back to Unix PIDs.
Going directly to the source of the LLM slop fared slightly better, on account of them having stolen from books as well as the web, but they still make shit up in the usual amount. I was able to get a correct answer though, using the classic prompting technique of already knowing the answer and retrying until I got good RNG.
If we set aside the few entirely wrong answers (“there is no PID 0”, “it launches init then exits”, “it’s part of systemd”, “it’s the entire kernel”, “it spins in an infinite loop and nothing else”), the most common answer follows a single theme: PID 0 has something to do with paging, or swap space, virtual memory management in some way.
This theme comes straight from, where else? Wikipedia’s article on PIDs, which said:
There are two tasks with specially distinguished process IDs: swapper or sched has process ID 0 and is responsible for paging, and is actually part of the kernel rather than a normal user-mode process. Process ID 1 is usually the init process primarily responsible for starting and shutting down the system.
That text has been on Wikipedia for 16 years, and in that time has been quoted, paraphrased and distorted across the web to the point that it’s displaced the truth. It’s a pretty funny dynamic, and also a bit sad, given the source code for Linux and the BSDs is right there, you can just check.
(Later note: after I published this, someone went and updated the article to have the correct information. The link above takes you to the old version so that the rest of this explanation still makes sense, but at time of writing this update the current version of the PID article is accurate)
To explain why Wikipedia was inaccurate here, we need to take a little history lesson.
The history of PID 0 in Unix
As I said in the opening TLDR, PID 0 does some scheduling and power management, and no paging. It’s what the scheduler runs when it has nothing else for a CPU core to do.
The exact implementation obviously varies across kernels and versions, but all the ones I inspected follow the same broad pattern: when PID 0 gets to run, it tries to find something else that could run in its place. Failing that, it puts the current CPU core to sleep until something else wakes it back up, and then loops around and starts over.
Don’t take my word for it. Here’s
do_idle
in the Linux kernel, which is called in an infinite loop by
PID 0. nohz_run_idle_balance
tries to find alternate work. The while
loop puts the core to sleep. After wakeup, schedule_idle
lets the
scheduler take over and put the core to work again.
But maybe that’s just linux, I hear you say. Okay, here’s
sched_idletd
in the FreeBSD kernel. tdq_idled
tries to steal runnable tasks from
another core. Failing that, cpu_idle
puts the core to sleep. Rinse,
repeat.
Okay sure but these are modern kernels, maybe it was different in the
olden days? Okay, how about
sched
in 4.3BSD, from the summer of 1986? Computers are getting smaller and
OSes more compact, so the scheduler and idle loop are now smushed into
one routine. It tries to find something to schedule, and failing that
sleep
s until an external event wakes it back up.
Incidentally, this is the origin of the vague allegation that PID0 is
sometimes called “sched”: in earlier Unixes, the function that
implements PID 0 is literally called sched
.
Still not sure? Maybe it’s just a weird BSD thing that leaked into Linux?
Okay fine, here’s sched in Unix V4, the first known version of the Unix kernel written in C. Again the scheduler and idle loop are firmly intertwined, and there’s also some PDP-11 esoterics that are confusing to modern eyes, but the same bones are there: find a runnable process and switch to it, or idle and then try again.
You could go further back. The source code for Unix V1 is out there, as well as an early prototype on PDP-7. However, it’s all in PDP assembler, uses some mnemonics that don’t seem to be listed in the surviving assembler references I could find, and the kernel’s structured a fair bit differently from the C version.
That said, if you want to go digging, I believe the swap
routine is
the meat of the scheduler. And finally we get a clear idea of the root
of the Wikipedia claim: in the earliest Unix implementation, the
scheduler was sometimes nicknamed the “swapper.”
It was called that because, now that we’re back at the beginning of Unix, one routine encompasses not only scheduling and idling, but also moving entire process memory images between the small core memory and secondary storage. Hard drives in this case, references in the kernel code as well as the Computer History wiki confirm that Bell Labs’s PDP-11 at the time ran an RS11 disk for the core OS and process swapping, and an RK03 for the user filesystem.
(Sidebar! This is where the / vs. /usr split comes from. /usr was the part of early Unix stored on the RK03 disk, whereas the smaller root filesystem was on the RS11. Unless you’re still running on a PDP-11 with single RS11 and RK03 disks, a split /usr is vestigial and causes a variety of problems in early boot)
So now the history is hopefully fairly clear. In the first Unix the
world at large saw (Unix V5), entry zero in the process table
initialized the kernel, then looped in the sched
function, defined
in slp.c
. Those two names clearly telegraph the loop’s primary
functions. However, the scheduling algorithm is quite simple at this
point, and so almost all of sched
’s code is concerned with swapping
process images in and out of core memory in order to make scheduling
happen. Thinking of this function as the “swapper” is reasonable, even
if the original source code never uses that name.
This essential structure survives to this day, with a lot more
complications. Whole-process swapping gave way to demand
paging, and so PID 0
stopped concerning itself with even a little memory management. As
both the scheduling algorithms and the mechanics of idling a CPU
became more complex, scheduling and idling were split out into
separate pieces of code, and you end up with what we’ve had for at
least two decades: the function implementing PID 0 has sched
or
idle
in its name, and has a supporting role in doing those two
things.
Going back to the Wikipedia article, it seems the author of that edit wanted to write “swapping”, in the classic Unix V5 sense of swapping out whole processes as a consequence of scheduling. But the edit didn’t clarify that “swapping” was being used in an archaic sense that was likely to confuse the modern reader. Furthermore the edit wrote “paging” rather than “swapping”. I don’t know why, but my guess is that it’s because the canonical article for this general memory management concept is titled “Memory paging”, whereas “swapping” is a disambiguation page. In the moment of making the edit, I could definitely see myself swapping out for the seemingly preferred term.
Unfortunately, in this particular context, replacing “swapping” with “paging” makes the sentence incorrect. And there it sat for 16 years, slowly leaking into the rest of the web as people quoted wikipedia at each other and paraphrased or elaborated further in the wrong direction.
Okay, end of rant about how the web is turning to ash in our hands. It’d be nice if it didn’t, or at least it’d be nice if half the industry wasn’t breathlessly building ways to spray more petrol on what’s left. So it goes. Back to PID 0 now.
Are those functions really PID 0?
Above, I claim by fiat that the functions I’m linking to are PID 0. Tracing all of them would take a lot more words, but I’ll demonstrate the point on Linux and leave you to trace the others. I encourage you to do so! It’s remarkable how similar to each other different kernels are in this area, both across current OSes and over time. They’ve become more complex, but the family tree is still evident.
Disclaimer, the Linux kernel is a very complex beast. I’m not going to
walk through every single thing the kernel does before reaching
do_idle
. Think of this as a signposts to help orient you, not a
comprehensive breakdown. This was written using the 6.9 kernel source
code, so if you’re visiting from the future: hello! I hope your
dilithium matrix is cycling well, and things may have changed.
We begin! The bootloader jumps to the first instruction of kernel code. The first few steps from here are extremely specific to the CPU architecture and nearby chipset hardware. I’m going to skip that and begin at start_kernel, where the machine has been set up to a common baseline and architecture-independent kernel code takes over (albeit still assisted by arch-specific helpers).
At this point, start_kernel
is the only thing running on the machine
(yes I know about ring minus 1 and SMM and so on, I said I was
simplifying). On multicore systems, the bootloader/firmware/hardware
arranges for a single CPU core to be running, called the bootstrap
core. That single thread of execution is what we’re looking at, and
it’s all we get until the kernel starts the other cores itself.
The first thing to get called is
set_task_stack_end_magic(&init_task)
. Well that looks relevant! It’s
a very simple function that writes a magic number to the top of
init_task
’s stack space to detect overflows. init_task
is
statically defined in
init_task.c,
and the leading comment tells us it’s the first task. What’s a task
though?
task_struct, PIDs TIDs TGIDs and oh no
Here we have to take a detour into something very confusing: the Linux kernel and its userspace disagree on the meaning of PID.
In the kernel, the unit of running things is the task_struct
. It
represents one thread of execution, rather than a whole process. To
the kernel, a PID identifies a task, not a
process. task_struct.pid
is the identifier for that one thread only.
The kernel still needs to represent the concept of a userspace process somehow, but it’s not a nice crunchy data structure you can point at. Instead, threads are collected into “thread groups”, and groups are identified by a thread group identifier, or TGID. Userspace calls thread groups processes, and thus the kernel TGID is called the PID in userspace.
To add confusion, these numbers are often the same. When a new thread group is created (e.g. when userspace runs fork()), the new thread is given a new thread ID, and that ID also becomes the new group’s TGID. So for single-threaded processes, kernel TID and TGID are identical, and asking either the kernel or userspace what this thing’s “PID” is would give you the same number. But that equivalence breaks once you spawn more threads: the new thread gets its own thread ID (which is what the kernel calls a PID), but inherits its parent’s thread group ID (which userspace calls a PID).
To add even more confusion, the arrival of containers forces threads
and processes to have multiple identities. The thing that’s PID 1 in a
docker container is very much not the same as PID 1 outside the
container. This is tracked in a separate pid
struct, which keeps
track of the different thread IDs a task_struct
has, depending on
which PID namespace is asking.
I’m a userspace enjoyer by day, so when I started this rabbithole I interpreted “PID 0” in the question as an analog to the PID 1 I know, that /bin/init thing. But now the question is ambiguous! PID 0 could mean thread 0, or it could mean thread group 0.
At the beginning of the kernel, the answer is fortunately easy:
init_task
represents PID 0 by everyone’s definition. It’s the thread
with ID 0 (which is the PID according to the kernel), it’s the only
thread in the group with ID 0 (which is the PID according to
userspace), and no child PID namespaces exist yet, so there’s no other
numbers for init_task
to be.
This will get muddier later on because thread group 0 is going to grow more threads, so in userspace terms we’ll have a PID 0 process that contains several threads, one of which has TID 0.
In the rest of this post I’m going to try and say “task” or “thread”
to mean a single thread of execution, the thing described by a
task_struct
; and “thread group” for the thing userspace would call a
process. But it’s not just you, it’s terribly confusing.
Erratum: some folks pointed out that I got two details above wrong! I’m grateful for the corrections, which are as follows.
It’s true in general that TIDs and TGIDs are sometimes the same as described above, but it’s possible to construct a fresh userspace process in which the single initial thread has a TID that doesn’t match its TGID. If you execve() in a multithreaded process from any thread other than the initial thread, the kernel will kill all other threads, and make the exec-ing thread the leader of the thread group. The TID of the thread doesn’t change, and so the new process will execute on a thread whose TID doesn’t match its TGID.
The second error is more specific to this post’s topic: on linux, all threads within thread group 0 have thread ID 0! It’s explicitly special-cased in a few places, and as far as I can tell is the only place in the kernel where multiple definitely different threads have the exact same identity.
Okay, back to the code walk…
The path to the idle task
So, we know init_task
is the PID 0 we’re looking for, albeit now
it’s actually two different PID 0s at the same time because it’s the
thread with ID 0 within the thread group with ID 0. How do we know
that init_task
describes the currently-executing CPU context?
There’s a few things. We know we’re the only thread of execution
currently happening, and init_task
is described as the first task,
aka the first thread. That sounds like us. It’s using init_stack
as
its stack, which is the stack we’re currently using (proving this
requires digging into arch-specific code and gcc linker scripts, so
I’m going to skip it, but have fun!). Its __state
is TASK_RUNNING
,
which means it’s either running right now, or it’s runnable and
waiting for CPU time. The kernel scheduler isn’t initialized yet, so
there can’t really be any other runnable task at this point. This
could be a setup for an elaborate trolling, but the evidence suggests
that this init_task
is us. And spoiler, we’re not being trolled,
init_task
is indeed the initial thread that executes start_kernel
.
At this point a lot of early kernel initialization happens. We can skip over all that for our purposes, and pick up at the call to sched_init. This function does basic initialization of the CPU scheduler’s data structures. A lot happens because the scheduler is a large beast, we’ll just peek at a couple of relevant lines:
/*
* The idle task doesn't need the kthread struct to
* function, but it is dressed up as a per-CPU
* kthread and thus needs to play the part if we want
* to avoid special-casing it in code that deals with
* per-CPU kthreads.
*/
WARN_ON(!set_kthread_struct(current));
/*
* Make us the idle thread. Technically, schedule()
* should not be called from this thread, however
* somewhere below it might be, but because we are the
* idle thread, we just pick up running again when this
* runqueue becomes "idle".
*/
init_idle(current, smp_processor_id());
The first line describes the currently executing thread as “the idle
task,” and mentions that it’s a special kernel thread: most kernel
threads are run by kthreadd
, which is task 2 and doesn’t exist
yet. If you’re on linux, ps ax | grep kthreadd
will show that
kthreadd is PID 2 in userspace, in addition to also being thread/task
ID 2 in the kernel.
The second line explicitly tells the scheduler that the currently
running thread is the “idle thread” for the bootstrap CPU
core. current
is a pointer to the currently-running task_struct
,
which at this point in execution points to init_task
. The
implementation of current
is another very architecture-specific
piece of code, so I’m going to encourage you to go poke at it if
curious, and move right along.
Going back to start_kernel
, the remaining initialization code
doesn’t concern us, so we can skip straight to the call to
rest_init. This
function is short and sweet: it spawns task 1, which will become the
init process in userspace; and task 2 for kthreadd
, which manages
all future kernel threads.
We’ll be following the life of task 1, and although it will someday become PID 1 in userspace, to start it’ll run kernel_init. Not yet though. These new tasks exist and are known to the scheduler, but they’re not running yet because we haven’t asked the scheduler to do its thing yet. (caveat: in some kernel configurations, the scheduler may get a chance to switch to task 1 and 2 sooner than what I’m about to describe, but these first tasks are orchestrated such that the outcome is nearly identical.)
Finally, rest_init
calls
cpu_startup_entry,
which goes into an infinite loop of calling
do_idle. And
here we are, we’ve become the idle task on the bootstrap CPU core. On
the first iteration, we don’t put the CPU to sleep because there are
other runnable tasks (the two we just made). So we drop to the bottom
of do_idle
, and go into schedule_idle
. The scheduler finally gets
to run, and we switch away from task 0. kthreadd
in task 2 isn’t
terribly interesting, it does a little initialization then yields the
CPU again until something else asks to create kernel threads. Let’s
follow task 1 instead, it’s much more fun.
Task 1 starts at kernel_init. This does even more kernel initialization, including bringing up all device drivers and mounting either the initramfs or the final root filesystem. And then, at last, it calls run_init_process to drop out of kernel mode and execute userspace’s init program. If init(1) asks the kernel who it is, it’ll be told that it is thread 1, which is part of thread group 1. Or thread 1 in PID 1, in the conventional userspace vocabulary.
It was a surprise to me that task/pid 1 does a whole bunch of kernel work before if morphs into the familiar userspace process! A large chunk of what I think of as the kernel booting technically happens in PID 1, albeit in a very different looking universe to init(1) in userspace. Why not do those bits in task 0, like the earlier bits of init?
PID 0 in multicore systems
If you’ve been following carefully so far, you may be wondering about the other CPU cores. So far we’ve run entirely single-threaded, and when we initialized the scheduler we explicitly told it to pin task 0 to the bootstrap core. When does that change?
The answer is, in task 1! The first thing kernel_init
does is start
up all other CPU cores. This means the bulk of the boot process that
happens in kernel_init
can make use of all available CPU power,
rather than being stuck on a single thread. Starting CPU cores is
quite intricate, but the exciting bit for our purposes is the call to
smp_init. In
turn, it calls
fork_idle
for each non-bootstrap core, creating a new idle thread and pinning it
to that core.
This is where the “PID 0” term gets muddy, because these new idle tasks have non-zero thread IDs, but they are still part of thread group 0. So, in userpace parlance, PID 0 is a process that contains one pinned thread per core, with thread 0 pinned to the bootstrap core.
Erratum: some folks pointed out that the above paragraph is wrong!
I’m grateful for the correction, which is as follows. As mentioned in
the erratum earlier in the post, idle tasks are special-cased, and all
idle threads across all cores share the same identity: thread ID 0,
and thread group ID 0. This happens in a few separate places in code
because of the different sets of fields that record TID and TGID, but
it’s all within the fork_idle
call.
First, fork_idle
calls the common copy_process
function to make a
new task as a copy of the currently running task. Normally this would
allocate a new TID for the new task. However, there is a special
case
that skips allocation of a new struct pid
if the caller signals that
it’s making an idle task. Then, fork_idle
calls init_idle_pids
,
which further explicitly resets all the task’s identifiers to match
init_struct_pid
, which is the identity of init_task
. As a result,
every idle task on every CPU core shares an identity with the
init_task
we’ve followed through early kernel boot, and they all
have PID 0 under both the kernel and userspace’s definition of a PID.
After that, smp_init
runs
bringup_nonboot_cpus,
which does architecture-specific incantations to wake up the cores. As
each core starts, it does a bit of arch-specific setup to make itself
presentable, then runs cpu_startup_entry
and do_idle
, just like
the bootstrap core did with task 0. All CPU cores are now alive and
can run tasks, and kernel_init
proceeds with the rest of boot.
I’m bad at conclusions
And that’s it! To summarize:
PID 0 does exist, it’s the one thread that starts the kernel, provided by the bootstrap CPU core.
PID 0 runs early kernel initialization, then becomes the bootstrap CPU core’s idle task, and plays a minor supporting role in scheduling and power management.
PID 0 has done this, with different degrees of fanciness but the same broad strokes, since the first Unix kernels. You can go read the source code of many of them and see for yourself! That’s cool.
PID 0 has nothing to do with memory management. In early Unix kernels it did some incidental memory management as part of process scheduling. PID 0 stopped doing that many decades ago.
On Linux, “PID” is ambiguous because userspace and the kernel use “PID” to refer to different values: the TID for the kernel, and the TGID for userspace. The kernel’s definition wins in practice for PID 0, because none of the entities that make up PID 0 are visible to userspace through the traditional Unix APIs.
On multicore linux systems, every CPU core gets an idle thread. All those idle threads are part of thread group 0, which userspace would call PID 0. They are also a special case in the kernel, and all share the single thread ID 0.
Seemingly all Q&A websites on the internet function primarily by paraphrasing Wikipedia. This is made evident and awkward when Wikipedia accidentally makes the web repeat incorrect information for 16 years.
This conclusion used to say that I now need to figure out how to submit an edit to wikipedia while complying with the various policies on self-promotion, sourcing, primary research and so on… But it looks like someone’s already made an edit, and provided additional sources for the important modifications. Thanks, mysterious benefactor!
Thanks for joining me on this chronicling of how I end up going on very large sidequests when presented with short, odd questions.