Linux Audio

Check our new training course

Loading...
Note: File does not exist in v5.4.
   1.. This file is dual-licensed: you can use it either under the terms
   2.. of the GPL 2.0 or the GFDL 1.2 license, at your option. Note that this
   3.. dual licensing only applies to this file, and not this project as a
   4.. whole.
   5..
   6.. a) This file is free software; you can redistribute it and/or
   7..    modify it under the terms of the GNU General Public License as
   8..    published by the Free Software Foundation version 2 of
   9..    the License.
  10..
  11..    This file is distributed in the hope that it will be useful,
  12..    but WITHOUT ANY WARRANTY; without even the implied warranty of
  13..    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  14..    GNU General Public License for more details.
  15..
  16.. Or, alternatively,
  17..
  18.. b) Permission is granted to copy, distribute and/or modify this
  19..    document under the terms of the GNU Free Documentation License,
  20..    Version 1.2 version published by the Free Software
  21..    Foundation, with no Invariant Sections, no Front-Cover Texts
  22..    and no Back-Cover Texts. A copy of the license is included at
  23..    Documentation/userspace-api/media/fdl-appendix.rst.
  24..
  25.. TODO: replace it to GPL-2.0 OR GFDL-1.2 WITH no-invariant-sections
  26
  27===========================
  28Lockless Ring Buffer Design
  29===========================
  30
  31Copyright 2009 Red Hat Inc.
  32
  33:Author:   Steven Rostedt <srostedt@redhat.com>
  34:License:  The GNU Free Documentation License, Version 1.2
  35           (dual licensed under the GPL v2)
  36:Reviewers:  Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
  37	     and Frederic Weisbecker.
  38
  39
  40Written for: 2.6.31
  41
  42Terminology used in this Document
  43---------------------------------
  44
  45tail
  46	- where new writes happen in the ring buffer.
  47
  48head
  49	- where new reads happen in the ring buffer.
  50
  51producer
  52	- the task that writes into the ring buffer (same as writer)
  53
  54writer
  55	- same as producer
  56
  57consumer
  58	- the task that reads from the buffer (same as reader)
  59
  60reader
  61	- same as consumer.
  62
  63reader_page
  64	- A page outside the ring buffer used solely (for the most part)
  65	  by the reader.
  66
  67head_page
  68	- a pointer to the page that the reader will use next
  69
  70tail_page
  71	- a pointer to the page that will be written to next
  72
  73commit_page
  74	- a pointer to the page with the last finished non-nested write.
  75
  76cmpxchg
  77	- hardware-assisted atomic transaction that performs the following::
  78
  79	    A = B if previous A == C
  80
  81	    R = cmpxchg(A, C, B) is saying that we replace A with B if and only
  82		if current A is equal to C, and we put the old (current)
  83		A into R
  84
  85	    R gets the previous A regardless if A is updated with B or not.
  86
  87	  To see if the update was successful a compare of ``R == C``
  88	  may be used.
  89
  90The Generic Ring Buffer
  91-----------------------
  92
  93The ring buffer can be used in either an overwrite mode or in
  94producer/consumer mode.
  95
  96Producer/consumer mode is where if the producer were to fill up the
  97buffer before the consumer could free up anything, the producer
  98will stop writing to the buffer. This will lose most recent events.
  99
 100Overwrite mode is where if the producer were to fill up the buffer
 101before the consumer could free up anything, the producer will
 102overwrite the older data. This will lose the oldest events.
 103
 104No two writers can write at the same time (on the same per-cpu buffer),
 105but a writer may interrupt another writer, but it must finish writing
 106before the previous writer may continue. This is very important to the
 107algorithm. The writers act like a "stack". The way interrupts works
 108enforces this behavior::
 109
 110
 111  writer1 start
 112     <preempted> writer2 start
 113         <preempted> writer3 start
 114                     writer3 finishes
 115                 writer2 finishes
 116  writer1 finishes
 117
 118This is very much like a writer being preempted by an interrupt and
 119the interrupt doing a write as well.
 120
 121Readers can happen at any time. But no two readers may run at the
 122same time, nor can a reader preempt/interrupt another reader. A reader
 123cannot preempt/interrupt a writer, but it may read/consume from the
 124buffer at the same time as a writer is writing, but the reader must be
 125on another processor to do so. A reader may read on its own processor
 126and can be preempted by a writer.
 127
 128A writer can preempt a reader, but a reader cannot preempt a writer.
 129But a reader can read the buffer at the same time (on another processor)
 130as a writer.
 131
 132The ring buffer is made up of a list of pages held together by a linked list.
 133
 134At initialization a reader page is allocated for the reader that is not
 135part of the ring buffer.
 136
 137The head_page, tail_page and commit_page are all initialized to point
 138to the same page.
 139
 140The reader page is initialized to have its next pointer pointing to
 141the head page, and its previous pointer pointing to a page before
 142the head page.
 143
 144The reader has its own page to use. At start up time, this page is
 145allocated but is not attached to the list. When the reader wants
 146to read from the buffer, if its page is empty (like it is on start-up),
 147it will swap its page with the head_page. The old reader page will
 148become part of the ring buffer and the head_page will be removed.
 149The page after the inserted page (old reader_page) will become the
 150new head page.
 151
 152Once the new page is given to the reader, the reader could do what
 153it wants with it, as long as a writer has left that page.
 154
 155A sample of how the reader page is swapped: Note this does not
 156show the head page in the buffer, it is for demonstrating a swap
 157only.
 158
 159::
 160
 161  +------+
 162  |reader|          RING BUFFER
 163  |page  |
 164  +------+
 165                  +---+   +---+   +---+
 166                  |   |-->|   |-->|   |
 167                  |   |<--|   |<--|   |
 168                  +---+   +---+   +---+
 169                   ^ |             ^ |
 170                   | +-------------+ |
 171                   +-----------------+
 172
 173
 174  +------+
 175  |reader|          RING BUFFER
 176  |page  |-------------------+
 177  +------+                   v
 178    |             +---+   +---+   +---+
 179    |             |   |-->|   |-->|   |
 180    |             |   |<--|   |<--|   |<-+
 181    |             +---+   +---+   +---+  |
 182    |              ^ |             ^ |   |
 183    |              | +-------------+ |   |
 184    |              +-----------------+   |
 185    +------------------------------------+
 186
 187  +------+
 188  |reader|          RING BUFFER
 189  |page  |-------------------+
 190  +------+ <---------------+ v
 191    |  ^          +---+   +---+   +---+
 192    |  |          |   |-->|   |-->|   |
 193    |  |          |   |   |   |<--|   |<-+
 194    |  |          +---+   +---+   +---+  |
 195    |  |             |             ^ |   |
 196    |  |             +-------------+ |   |
 197    |  +-----------------------------+   |
 198    +------------------------------------+
 199
 200  +------+
 201  |buffer|          RING BUFFER
 202  |page  |-------------------+
 203  +------+ <---------------+ v
 204    |  ^          +---+   +---+   +---+
 205    |  |          |   |   |   |-->|   |
 206    |  |  New     |   |   |   |<--|   |<-+
 207    |  | Reader   +---+   +---+   +---+  |
 208    |  |  page ----^                 |   |
 209    |  |                             |   |
 210    |  +-----------------------------+   |
 211    +------------------------------------+
 212
 213
 214
 215It is possible that the page swapped is the commit page and the tail page,
 216if what is in the ring buffer is less than what is held in a buffer page.
 217
 218::
 219
 220            reader page    commit page   tail page
 221                |              |             |
 222                v              |             |
 223               +---+           |             |
 224               |   |<----------+             |
 225               |   |<------------------------+
 226               |   |------+
 227               +---+      |
 228                          |
 229                          v
 230      +---+    +---+    +---+    +---+
 231  <---|   |--->|   |--->|   |--->|   |--->
 232  --->|   |<---|   |<---|   |<---|   |<---
 233      +---+    +---+    +---+    +---+
 234
 235This case is still valid for this algorithm.
 236When the writer leaves the page, it simply goes into the ring buffer
 237since the reader page still points to the next location in the ring
 238buffer.
 239
 240
 241The main pointers:
 242
 243  reader page
 244	    - The page used solely by the reader and is not part
 245              of the ring buffer (may be swapped in)
 246
 247  head page
 248	    - the next page in the ring buffer that will be swapped
 249              with the reader page.
 250
 251  tail page
 252	    - the page where the next write will take place.
 253
 254  commit page
 255	    - the page that last finished a write.
 256
 257The commit page only is updated by the outermost writer in the
 258writer stack. A writer that preempts another writer will not move the
 259commit page.
 260
 261When data is written into the ring buffer, a position is reserved
 262in the ring buffer and passed back to the writer. When the writer
 263is finished writing data into that position, it commits the write.
 264
 265Another write (or a read) may take place at anytime during this
 266transaction. If another write happens it must finish before continuing
 267with the previous write.
 268
 269
 270   Write reserve::
 271
 272       Buffer page
 273      +---------+
 274      |written  |
 275      +---------+  <--- given back to writer (current commit)
 276      |reserved |
 277      +---------+ <--- tail pointer
 278      | empty   |
 279      +---------+
 280
 281   Write commit::
 282
 283       Buffer page
 284      +---------+
 285      |written  |
 286      +---------+
 287      |written  |
 288      +---------+  <--- next position for write (current commit)
 289      | empty   |
 290      +---------+
 291
 292
 293 If a write happens after the first reserve::
 294
 295       Buffer page
 296      +---------+
 297      |written  |
 298      +---------+  <-- current commit
 299      |reserved |
 300      +---------+  <--- given back to second writer
 301      |reserved |
 302      +---------+ <--- tail pointer
 303
 304  After second writer commits::
 305
 306
 307       Buffer page
 308      +---------+
 309      |written  |
 310      +---------+  <--(last full commit)
 311      |reserved |
 312      +---------+
 313      |pending  |
 314      |commit   |
 315      +---------+ <--- tail pointer
 316
 317  When the first writer commits::
 318
 319       Buffer page
 320      +---------+
 321      |written  |
 322      +---------+
 323      |written  |
 324      +---------+
 325      |written  |
 326      +---------+  <--(last full commit and tail pointer)
 327
 328
 329The commit pointer points to the last write location that was
 330committed without preempting another write. When a write that
 331preempted another write is committed, it only becomes a pending commit
 332and will not be a full commit until all writes have been committed.
 333
 334The commit page points to the page that has the last full commit.
 335The tail page points to the page with the last write (before
 336committing).
 337
 338The tail page is always equal to or after the commit page. It may
 339be several pages ahead. If the tail page catches up to the commit
 340page then no more writes may take place (regardless of the mode
 341of the ring buffer: overwrite and produce/consumer).
 342
 343The order of pages is::
 344
 345 head page
 346 commit page
 347 tail page
 348
 349Possible scenario::
 350
 351                               tail page
 352    head page         commit page  |
 353        |                 |        |
 354        v                 v        v
 355      +---+    +---+    +---+    +---+
 356  <---|   |--->|   |--->|   |--->|   |--->
 357  --->|   |<---|   |<---|   |<---|   |<---
 358      +---+    +---+    +---+    +---+
 359
 360There is a special case that the head page is after either the commit page
 361and possibly the tail page. That is when the commit (and tail) page has been
 362swapped with the reader page. This is because the head page is always
 363part of the ring buffer, but the reader page is not. Whenever there
 364has been less than a full page that has been committed inside the ring buffer,
 365and a reader swaps out a page, it will be swapping out the commit page.
 366
 367::
 368
 369            reader page    commit page   tail page
 370                |              |             |
 371                v              |             |
 372               +---+           |             |
 373               |   |<----------+             |
 374               |   |<------------------------+
 375               |   |------+
 376               +---+      |
 377                          |
 378                          v
 379      +---+    +---+    +---+    +---+
 380  <---|   |--->|   |--->|   |--->|   |--->
 381  --->|   |<---|   |<---|   |<---|   |<---
 382      +---+    +---+    +---+    +---+
 383                          ^
 384                          |
 385                      head page
 386
 387
 388In this case, the head page will not move when the tail and commit
 389move back into the ring buffer.
 390
 391The reader cannot swap a page into the ring buffer if the commit page
 392is still on that page. If the read meets the last commit (real commit
 393not pending or reserved), then there is nothing more to read.
 394The buffer is considered empty until another full commit finishes.
 395
 396When the tail meets the head page, if the buffer is in overwrite mode,
 397the head page will be pushed ahead one. If the buffer is in producer/consumer
 398mode, the write will fail.
 399
 400Overwrite mode::
 401
 402              tail page
 403                 |
 404                 v
 405      +---+    +---+    +---+    +---+
 406  <---|   |--->|   |--->|   |--->|   |--->
 407  --->|   |<---|   |<---|   |<---|   |<---
 408      +---+    +---+    +---+    +---+
 409                          ^
 410                          |
 411                      head page
 412
 413
 414              tail page
 415                 |
 416                 v
 417      +---+    +---+    +---+    +---+
 418  <---|   |--->|   |--->|   |--->|   |--->
 419  --->|   |<---|   |<---|   |<---|   |<---
 420      +---+    +---+    +---+    +---+
 421                                   ^
 422                                   |
 423                               head page
 424
 425
 426                      tail page
 427                          |
 428                          v
 429      +---+    +---+    +---+    +---+
 430  <---|   |--->|   |--->|   |--->|   |--->
 431  --->|   |<---|   |<---|   |<---|   |<---
 432      +---+    +---+    +---+    +---+
 433                                   ^
 434                                   |
 435                               head page
 436
 437Note, the reader page will still point to the previous head page.
 438But when a swap takes place, it will use the most recent head page.
 439
 440
 441Making the Ring Buffer Lockless:
 442--------------------------------
 443
 444The main idea behind the lockless algorithm is to combine the moving
 445of the head_page pointer with the swapping of pages with the reader.
 446State flags are placed inside the pointer to the page. To do this,
 447each page must be aligned in memory by 4 bytes. This will allow the 2
 448least significant bits of the address to be used as flags, since
 449they will always be zero for the address. To get the address,
 450simply mask out the flags::
 451
 452  MASK = ~3
 453
 454  address & MASK
 455
 456Two flags will be kept by these two bits:
 457
 458   HEADER
 459	- the page being pointed to is a head page
 460
 461   UPDATE
 462	- the page being pointed to is being updated by a writer
 463          and was or is about to be a head page.
 464
 465::
 466
 467	      reader page
 468		  |
 469		  v
 470		+---+
 471		|   |------+
 472		+---+      |
 473			    |
 474			    v
 475	+---+    +---+    +---+    +---+
 476    <---|   |--->|   |-H->|   |--->|   |--->
 477    --->|   |<---|   |<---|   |<---|   |<---
 478	+---+    +---+    +---+    +---+
 479
 480
 481The above pointer "-H->" would have the HEADER flag set. That is
 482the next page is the next page to be swapped out by the reader.
 483This pointer means the next page is the head page.
 484
 485When the tail page meets the head pointer, it will use cmpxchg to
 486change the pointer to the UPDATE state::
 487
 488
 489              tail page
 490                 |
 491                 v
 492      +---+    +---+    +---+    +---+
 493  <---|   |--->|   |-H->|   |--->|   |--->
 494  --->|   |<---|   |<---|   |<---|   |<---
 495      +---+    +---+    +---+    +---+
 496
 497              tail page
 498                 |
 499                 v
 500      +---+    +---+    +---+    +---+
 501  <---|   |--->|   |-U->|   |--->|   |--->
 502  --->|   |<---|   |<---|   |<---|   |<---
 503      +---+    +---+    +---+    +---+
 504
 505"-U->" represents a pointer in the UPDATE state.
 506
 507Any access to the reader will need to take some sort of lock to serialize
 508the readers. But the writers will never take a lock to write to the
 509ring buffer. This means we only need to worry about a single reader,
 510and writes only preempt in "stack" formation.
 511
 512When the reader tries to swap the page with the ring buffer, it
 513will also use cmpxchg. If the flag bit in the pointer to the
 514head page does not have the HEADER flag set, the compare will fail
 515and the reader will need to look for the new head page and try again.
 516Note, the flags UPDATE and HEADER are never set at the same time.
 517
 518The reader swaps the reader page as follows::
 519
 520  +------+
 521  |reader|          RING BUFFER
 522  |page  |
 523  +------+
 524                  +---+    +---+    +---+
 525                  |   |--->|   |--->|   |
 526                  |   |<---|   |<---|   |
 527                  +---+    +---+    +---+
 528                   ^ |               ^ |
 529                   | +---------------+ |
 530                   +-----H-------------+
 531
 532The reader sets the reader page next pointer as HEADER to the page after
 533the head page::
 534
 535
 536  +------+
 537  |reader|          RING BUFFER
 538  |page  |-------H-----------+
 539  +------+                   v
 540    |             +---+    +---+    +---+
 541    |             |   |--->|   |--->|   |
 542    |             |   |<---|   |<---|   |<-+
 543    |             +---+    +---+    +---+  |
 544    |              ^ |               ^ |   |
 545    |              | +---------------+ |   |
 546    |              +-----H-------------+   |
 547    +--------------------------------------+
 548
 549It does a cmpxchg with the pointer to the previous head page to make it
 550point to the reader page. Note that the new pointer does not have the HEADER
 551flag set.  This action atomically moves the head page forward::
 552
 553  +------+
 554  |reader|          RING BUFFER
 555  |page  |-------H-----------+
 556  +------+                   v
 557    |  ^          +---+   +---+   +---+
 558    |  |          |   |-->|   |-->|   |
 559    |  |          |   |<--|   |<--|   |<-+
 560    |  |          +---+   +---+   +---+  |
 561    |  |             |             ^ |   |
 562    |  |             +-------------+ |   |
 563    |  +-----------------------------+   |
 564    +------------------------------------+
 565
 566After the new head page is set, the previous pointer of the head page is
 567updated to the reader page::
 568
 569  +------+
 570  |reader|          RING BUFFER
 571  |page  |-------H-----------+
 572  +------+ <---------------+ v
 573    |  ^          +---+   +---+   +---+
 574    |  |          |   |-->|   |-->|   |
 575    |  |          |   |   |   |<--|   |<-+
 576    |  |          +---+   +---+   +---+  |
 577    |  |             |             ^ |   |
 578    |  |             +-------------+ |   |
 579    |  +-----------------------------+   |
 580    +------------------------------------+
 581
 582  +------+
 583  |buffer|          RING BUFFER
 584  |page  |-------H-----------+  <--- New head page
 585  +------+ <---------------+ v
 586    |  ^          +---+   +---+   +---+
 587    |  |          |   |   |   |-->|   |
 588    |  |  New     |   |   |   |<--|   |<-+
 589    |  | Reader   +---+   +---+   +---+  |
 590    |  |  page ----^                 |   |
 591    |  |                             |   |
 592    |  +-----------------------------+   |
 593    +------------------------------------+
 594
 595Another important point: The page that the reader page points back to
 596by its previous pointer (the one that now points to the new head page)
 597never points back to the reader page. That is because the reader page is
 598not part of the ring buffer. Traversing the ring buffer via the next pointers
 599will always stay in the ring buffer. Traversing the ring buffer via the
 600prev pointers may not.
 601
 602Note, the way to determine a reader page is simply by examining the previous
 603pointer of the page. If the next pointer of the previous page does not
 604point back to the original page, then the original page is a reader page::
 605
 606
 607             +--------+
 608             | reader |  next   +----+
 609             |  page  |-------->|    |<====== (buffer page)
 610             +--------+         +----+
 611                 |                | ^
 612                 |                v | next
 613            prev |              +----+
 614                 +------------->|    |
 615                                +----+
 616
 617The way the head page moves forward:
 618
 619When the tail page meets the head page and the buffer is in overwrite mode
 620and more writes take place, the head page must be moved forward before the
 621writer may move the tail page. The way this is done is that the writer
 622performs a cmpxchg to convert the pointer to the head page from the HEADER
 623flag to have the UPDATE flag set. Once this is done, the reader will
 624not be able to swap the head page from the buffer, nor will it be able to
 625move the head page, until the writer is finished with the move.
 626
 627This eliminates any races that the reader can have on the writer. The reader
 628must spin, and this is why the reader cannot preempt the writer::
 629
 630              tail page
 631                 |
 632                 v
 633      +---+    +---+    +---+    +---+
 634  <---|   |--->|   |-H->|   |--->|   |--->
 635  --->|   |<---|   |<---|   |<---|   |<---
 636      +---+    +---+    +---+    +---+
 637
 638              tail page
 639                 |
 640                 v
 641      +---+    +---+    +---+    +---+
 642  <---|   |--->|   |-U->|   |--->|   |--->
 643  --->|   |<---|   |<---|   |<---|   |<---
 644      +---+    +---+    +---+    +---+
 645
 646The following page will be made into the new head page::
 647
 648             tail page
 649                 |
 650                 v
 651      +---+    +---+    +---+    +---+
 652  <---|   |--->|   |-U->|   |-H->|   |--->
 653  --->|   |<---|   |<---|   |<---|   |<---
 654      +---+    +---+    +---+    +---+
 655
 656After the new head page has been set, we can set the old head page
 657pointer back to NORMAL::
 658
 659             tail page
 660                 |
 661                 v
 662      +---+    +---+    +---+    +---+
 663  <---|   |--->|   |--->|   |-H->|   |--->
 664  --->|   |<---|   |<---|   |<---|   |<---
 665      +---+    +---+    +---+    +---+
 666
 667After the head page has been moved, the tail page may now move forward::
 668
 669                      tail page
 670                          |
 671                          v
 672      +---+    +---+    +---+    +---+
 673  <---|   |--->|   |--->|   |-H->|   |--->
 674  --->|   |<---|   |<---|   |<---|   |<---
 675      +---+    +---+    +---+    +---+
 676
 677
 678The above are the trivial updates. Now for the more complex scenarios.
 679
 680
 681As stated before, if enough writes preempt the first write, the
 682tail page may make it all the way around the buffer and meet the commit
 683page. At this time, we must start dropping writes (usually with some kind
 684of warning to the user). But what happens if the commit was still on the
 685reader page? The commit page is not part of the ring buffer. The tail page
 686must account for this::
 687
 688
 689            reader page    commit page
 690                |              |
 691                v              |
 692               +---+           |
 693               |   |<----------+
 694               |   |
 695               |   |------+
 696               +---+      |
 697                          |
 698                          v
 699      +---+    +---+    +---+    +---+
 700  <---|   |--->|   |-H->|   |--->|   |--->
 701  --->|   |<---|   |<---|   |<---|   |<---
 702      +---+    +---+    +---+    +---+
 703                 ^
 704                 |
 705             tail page
 706
 707If the tail page were to simply push the head page forward, the commit when
 708leaving the reader page would not be pointing to the correct page.
 709
 710The solution to this is to test if the commit page is on the reader page
 711before pushing the head page. If it is, then it can be assumed that the
 712tail page wrapped the buffer, and we must drop new writes.
 713
 714This is not a race condition, because the commit page can only be moved
 715by the outermost writer (the writer that was preempted).
 716This means that the commit will not move while a writer is moving the
 717tail page. The reader cannot swap the reader page if it is also being
 718used as the commit page. The reader can simply check that the commit
 719is off the reader page. Once the commit page leaves the reader page
 720it will never go back on it unless a reader does another swap with the
 721buffer page that is also the commit page.
 722
 723
 724Nested writes
 725-------------
 726
 727In the pushing forward of the tail page we must first push forward
 728the head page if the head page is the next page. If the head page
 729is not the next page, the tail page is simply updated with a cmpxchg.
 730
 731Only writers move the tail page. This must be done atomically to protect
 732against nested writers::
 733
 734  temp_page = tail_page
 735  next_page = temp_page->next
 736  cmpxchg(tail_page, temp_page, next_page)
 737
 738The above will update the tail page if it is still pointing to the expected
 739page. If this fails, a nested write pushed it forward, the current write
 740does not need to push it::
 741
 742
 743             temp page
 744                 |
 745                 v
 746              tail page
 747                 |
 748                 v
 749      +---+    +---+    +---+    +---+
 750  <---|   |--->|   |--->|   |--->|   |--->
 751  --->|   |<---|   |<---|   |<---|   |<---
 752      +---+    +---+    +---+    +---+
 753
 754Nested write comes in and moves the tail page forward::
 755
 756                      tail page (moved by nested writer)
 757              temp page   |
 758                 |        |
 759                 v        v
 760      +---+    +---+    +---+    +---+
 761  <---|   |--->|   |--->|   |--->|   |--->
 762  --->|   |<---|   |<---|   |<---|   |<---
 763      +---+    +---+    +---+    +---+
 764
 765The above would fail the cmpxchg, but since the tail page has already
 766been moved forward, the writer will just try again to reserve storage
 767on the new tail page.
 768
 769But the moving of the head page is a bit more complex::
 770
 771              tail page
 772                 |
 773                 v
 774      +---+    +---+    +---+    +---+
 775  <---|   |--->|   |-H->|   |--->|   |--->
 776  --->|   |<---|   |<---|   |<---|   |<---
 777      +---+    +---+    +---+    +---+
 778
 779The write converts the head page pointer to UPDATE::
 780
 781              tail page
 782                 |
 783                 v
 784      +---+    +---+    +---+    +---+
 785  <---|   |--->|   |-U->|   |--->|   |--->
 786  --->|   |<---|   |<---|   |<---|   |<---
 787      +---+    +---+    +---+    +---+
 788
 789But if a nested writer preempts here, it will see that the next
 790page is a head page, but it is also nested. It will detect that
 791it is nested and will save that information. The detection is the
 792fact that it sees the UPDATE flag instead of a HEADER or NORMAL
 793pointer.
 794
 795The nested writer will set the new head page pointer::
 796
 797             tail page
 798                 |
 799                 v
 800      +---+    +---+    +---+    +---+
 801  <---|   |--->|   |-U->|   |-H->|   |--->
 802  --->|   |<---|   |<---|   |<---|   |<---
 803      +---+    +---+    +---+    +---+
 804
 805But it will not reset the update back to normal. Only the writer
 806that converted a pointer from HEAD to UPDATE will convert it back
 807to NORMAL::
 808
 809                      tail page
 810                          |
 811                          v
 812      +---+    +---+    +---+    +---+
 813  <---|   |--->|   |-U->|   |-H->|   |--->
 814  --->|   |<---|   |<---|   |<---|   |<---
 815      +---+    +---+    +---+    +---+
 816
 817After the nested writer finishes, the outermost writer will convert
 818the UPDATE pointer to NORMAL::
 819
 820
 821                      tail page
 822                          |
 823                          v
 824      +---+    +---+    +---+    +---+
 825  <---|   |--->|   |--->|   |-H->|   |--->
 826  --->|   |<---|   |<---|   |<---|   |<---
 827      +---+    +---+    +---+    +---+
 828
 829
 830It can be even more complex if several nested writes came in and moved
 831the tail page ahead several pages::
 832
 833
 834  (first writer)
 835
 836              tail page
 837                 |
 838                 v
 839      +---+    +---+    +---+    +---+
 840  <---|   |--->|   |-H->|   |--->|   |--->
 841  --->|   |<---|   |<---|   |<---|   |<---
 842      +---+    +---+    +---+    +---+
 843
 844The write converts the head page pointer to UPDATE::
 845
 846              tail page
 847                 |
 848                 v
 849      +---+    +---+    +---+    +---+
 850  <---|   |--->|   |-U->|   |--->|   |--->
 851  --->|   |<---|   |<---|   |<---|   |<---
 852      +---+    +---+    +---+    +---+
 853
 854Next writer comes in, and sees the update and sets up the new
 855head page::
 856
 857  (second writer)
 858
 859             tail page
 860                 |
 861                 v
 862      +---+    +---+    +---+    +---+
 863  <---|   |--->|   |-U->|   |-H->|   |--->
 864  --->|   |<---|   |<---|   |<---|   |<---
 865      +---+    +---+    +---+    +---+
 866
 867The nested writer moves the tail page forward. But does not set the old
 868update page to NORMAL because it is not the outermost writer::
 869
 870                      tail page
 871                          |
 872                          v
 873      +---+    +---+    +---+    +---+
 874  <---|   |--->|   |-U->|   |-H->|   |--->
 875  --->|   |<---|   |<---|   |<---|   |<---
 876      +---+    +---+    +---+    +---+
 877
 878Another writer preempts and sees the page after the tail page is a head page.
 879It changes it from HEAD to UPDATE::
 880
 881  (third writer)
 882
 883                      tail page
 884                          |
 885                          v
 886      +---+    +---+    +---+    +---+
 887  <---|   |--->|   |-U->|   |-U->|   |--->
 888  --->|   |<---|   |<---|   |<---|   |<---
 889      +---+    +---+    +---+    +---+
 890
 891The writer will move the head page forward::
 892
 893
 894  (third writer)
 895
 896                      tail page
 897                          |
 898                          v
 899      +---+    +---+    +---+    +---+
 900  <---|   |--->|   |-U->|   |-U->|   |-H->
 901  --->|   |<---|   |<---|   |<---|   |<---
 902      +---+    +---+    +---+    +---+
 903
 904But now that the third writer did change the HEAD flag to UPDATE it
 905will convert it to normal::
 906
 907
 908  (third writer)
 909
 910                      tail page
 911                          |
 912                          v
 913      +---+    +---+    +---+    +---+
 914  <---|   |--->|   |-U->|   |--->|   |-H->
 915  --->|   |<---|   |<---|   |<---|   |<---
 916      +---+    +---+    +---+    +---+
 917
 918
 919Then it will move the tail page, and return back to the second writer::
 920
 921
 922  (second writer)
 923
 924                               tail page
 925                                   |
 926                                   v
 927      +---+    +---+    +---+    +---+
 928  <---|   |--->|   |-U->|   |--->|   |-H->
 929  --->|   |<---|   |<---|   |<---|   |<---
 930      +---+    +---+    +---+    +---+
 931
 932
 933The second writer will fail to move the tail page because it was already
 934moved, so it will try again and add its data to the new tail page.
 935It will return to the first writer::
 936
 937
 938  (first writer)
 939
 940                               tail page
 941                                   |
 942                                   v
 943      +---+    +---+    +---+    +---+
 944  <---|   |--->|   |-U->|   |--->|   |-H->
 945  --->|   |<---|   |<---|   |<---|   |<---
 946      +---+    +---+    +---+    +---+
 947
 948The first writer cannot know atomically if the tail page moved
 949while it updates the HEAD page. It will then update the head page to
 950what it thinks is the new head page::
 951
 952
 953  (first writer)
 954
 955                               tail page
 956                                   |
 957                                   v
 958      +---+    +---+    +---+    +---+
 959  <---|   |--->|   |-U->|   |-H->|   |-H->
 960  --->|   |<---|   |<---|   |<---|   |<---
 961      +---+    +---+    +---+    +---+
 962
 963Since the cmpxchg returns the old value of the pointer the first writer
 964will see it succeeded in updating the pointer from NORMAL to HEAD.
 965But as we can see, this is not good enough. It must also check to see
 966if the tail page is either where it use to be or on the next page::
 967
 968
 969  (first writer)
 970
 971                 A        B    tail page
 972                 |        |        |
 973                 v        v        v
 974      +---+    +---+    +---+    +---+
 975  <---|   |--->|   |-U->|   |-H->|   |-H->
 976  --->|   |<---|   |<---|   |<---|   |<---
 977      +---+    +---+    +---+    +---+
 978
 979If tail page != A and tail page != B, then it must reset the pointer
 980back to NORMAL. The fact that it only needs to worry about nested
 981writers means that it only needs to check this after setting the HEAD page::
 982
 983
 984  (first writer)
 985
 986                 A        B    tail page
 987                 |        |        |
 988                 v        v        v
 989      +---+    +---+    +---+    +---+
 990  <---|   |--->|   |-U->|   |--->|   |-H->
 991  --->|   |<---|   |<---|   |<---|   |<---
 992      +---+    +---+    +---+    +---+
 993
 994Now the writer can update the head page. This is also why the head page must
 995remain in UPDATE and only reset by the outermost writer. This prevents
 996the reader from seeing the incorrect head page::
 997
 998
 999  (first writer)
1000
1001                 A        B    tail page
1002                 |        |        |
1003                 v        v        v
1004      +---+    +---+    +---+    +---+
1005  <---|   |--->|   |--->|   |--->|   |-H->
1006  --->|   |<---|   |<---|   |<---|   |<---
1007      +---+    +---+    +---+    +---+