Linux Audio

Check our new training course

Loading...
v5.9
  1.. SPDX-License-Identifier: GPL-2.0
  2
  3Journal (jbd2)
  4--------------
  5
  6Introduced in ext3, the ext4 filesystem employs a journal to protect the
  7filesystem against corruption in the case of a system crash. A small
  8continuous region of disk (default 128MiB) is reserved inside the
  9filesystem as a place to land “important” data writes on-disk as quickly
 10as possible. Once the important data transaction is fully written to the
 11disk and flushed from the disk write cache, a record of the data being
 12committed is also written to the journal. At some later point in time,
 13the journal code writes the transactions to their final locations on
 14disk (this could involve a lot of seeking or a lot of small
 15read-write-erases) before erasing the commit record. Should the system
 16crash during the second slow write, the journal can be replayed all the
 17way to the latest commit record, guaranteeing the atomicity of whatever
 18gets written through the journal to the disk. The effect of this is to
 19guarantee that the filesystem does not become stuck midway through a
 20metadata update.
 21
 22For performance reasons, ext4 by default only writes filesystem metadata
 23through the journal. This means that file data blocks are /not/
 24guaranteed to be in any consistent state after a crash. If this default
 25guarantee level (``data=ordered``) is not satisfactory, there is a mount
 26option to control journal behavior. If ``data=journal``, all data and
 27metadata are written to disk through the journal. This is slower but
 28safest. If ``data=writeback``, dirty data blocks are not flushed to the
 29disk before the metadata are written to disk through the journal.
 30
 
 
 
 
 
 
 
 
 
 
 
 31The journal inode is typically inode 8. The first 68 bytes of the
 32journal inode are replicated in the ext4 superblock. The journal itself
 33is normal (but hidden) file within the filesystem. The file usually
 34consumes an entire block group, though mke2fs tries to put it in the
 35middle of the disk.
 36
 37All fields in jbd2 are written to disk in big-endian order. This is the
 38opposite of ext4.
 39
 40NOTE: Both ext4 and ocfs2 use jbd2.
 41
 42The maximum size of a journal embedded in an ext4 filesystem is 2^32
 43blocks. jbd2 itself does not seem to care.
 44
 45Layout
 46~~~~~~
 47
 48Generally speaking, the journal has this format:
 49
 50.. list-table::
 51   :widths: 16 48 16
 52   :header-rows: 1
 53
 54   * - Superblock
 55     - descriptor\_block (data\_blocks or revocation\_block) [more data or
 56       revocations] commmit\_block
 57     - [more transactions...]
 58   * - 
 59     - One transaction
 60     -
 61
 62Notice that a transaction begins with either a descriptor and some data,
 63or a block revocation list. A finished transaction always ends with a
 64commit. If there is no commit record (or the checksums don't match), the
 65transaction will be discarded during replay.
 66
 67External Journal
 68~~~~~~~~~~~~~~~~
 69
 70Optionally, an ext4 filesystem can be created with an external journal
 71device (as opposed to an internal journal, which uses a reserved inode).
 72In this case, on the filesystem device, ``s_journal_inum`` should be
 73zero and ``s_journal_uuid`` should be set. On the journal device there
 74will be an ext4 super block in the usual place, with a matching UUID.
 75The journal superblock will be in the next full block after the
 76superblock.
 77
 78.. list-table::
 79   :widths: 12 12 12 32 12
 80   :header-rows: 1
 81
 82   * - 1024 bytes of padding
 83     - ext4 Superblock
 84     - Journal Superblock
 85     - descriptor\_block (data\_blocks or revocation\_block) [more data or
 86       revocations] commmit\_block
 87     - [more transactions...]
 88   * - 
 89     -
 90     -
 91     - One transaction
 92     -
 93
 94Block Header
 95~~~~~~~~~~~~
 96
 97Every block in the journal starts with a common 12-byte header
 98``struct journal_header_s``:
 99
100.. list-table::
101   :widths: 8 8 24 40
102   :header-rows: 1
103
104   * - Offset
105     - Type
106     - Name
107     - Description
108   * - 0x0
109     - \_\_be32
110     - h\_magic
111     - jbd2 magic number, 0xC03B3998.
112   * - 0x4
113     - \_\_be32
114     - h\_blocktype
115     - Description of what this block contains. See the jbd2_blocktype_ table
116       below.
117   * - 0x8
118     - \_\_be32
119     - h\_sequence
120     - The transaction ID that goes with this block.
121
122.. _jbd2_blocktype:
123
124The journal block type can be any one of:
125
126.. list-table::
127   :widths: 16 64
128   :header-rows: 1
129
130   * - Value
131     - Description
132   * - 1
133     - Descriptor. This block precedes a series of data blocks that were
134       written through the journal during a transaction.
135   * - 2
136     - Block commit record. This block signifies the completion of a
137       transaction.
138   * - 3
139     - Journal superblock, v1.
140   * - 4
141     - Journal superblock, v2.
142   * - 5
143     - Block revocation records. This speeds up recovery by enabling the
144       journal to skip writing blocks that were subsequently rewritten.
145
146Super Block
147~~~~~~~~~~~
148
149The super block for the journal is much simpler as compared to ext4's.
150The key data kept within are size of the journal, and where to find the
151start of the log of transactions.
152
153The journal superblock is recorded as ``struct journal_superblock_s``,
154which is 1024 bytes long:
155
156.. list-table::
157   :widths: 8 8 24 40
158   :header-rows: 1
159
160   * - Offset
161     - Type
162     - Name
163     - Description
164   * -
165     -
166     -
167     - Static information describing the journal.
168   * - 0x0
169     - journal\_header\_t (12 bytes)
170     - s\_header
171     - Common header identifying this as a superblock.
172   * - 0xC
173     - \_\_be32
174     - s\_blocksize
175     - Journal device block size.
176   * - 0x10
177     - \_\_be32
178     - s\_maxlen
179     - Total number of blocks in this journal.
180   * - 0x14
181     - \_\_be32
182     - s\_first
183     - First block of log information.
184   * -
185     -
186     -
187     - Dynamic information describing the current state of the log.
188   * - 0x18
189     - \_\_be32
190     - s\_sequence
191     - First commit ID expected in log.
192   * - 0x1C
193     - \_\_be32
194     - s\_start
195     - Block number of the start of log. Contrary to the comments, this field
196       being zero does not imply that the journal is clean!
197   * - 0x20
198     - \_\_be32
199     - s\_errno
200     - Error value, as set by jbd2\_journal\_abort().
201   * -
202     -
203     -
204     - The remaining fields are only valid in a v2 superblock.
205   * - 0x24
206     - \_\_be32
207     - s\_feature\_compat;
208     - Compatible feature set. See the table jbd2_compat_ below.
209   * - 0x28
210     - \_\_be32
211     - s\_feature\_incompat
212     - Incompatible feature set. See the table jbd2_incompat_ below.
213   * - 0x2C
214     - \_\_be32
215     - s\_feature\_ro\_compat
216     - Read-only compatible feature set. There aren't any of these currently.
217   * - 0x30
218     - \_\_u8
219     - s\_uuid[16]
220     - 128-bit uuid for journal. This is compared against the copy in the ext4
221       super block at mount time.
222   * - 0x40
223     - \_\_be32
224     - s\_nr\_users
225     - Number of file systems sharing this journal.
226   * - 0x44
227     - \_\_be32
228     - s\_dynsuper
229     - Location of dynamic super block copy. (Not used?)
230   * - 0x48
231     - \_\_be32
232     - s\_max\_transaction
233     - Limit of journal blocks per transaction. (Not used?)
234   * - 0x4C
235     - \_\_be32
236     - s\_max\_trans\_data
237     - Limit of data blocks per transaction. (Not used?)
238   * - 0x50
239     - \_\_u8
240     - s\_checksum\_type
241     - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
242       more info.
243   * - 0x51
244     - \_\_u8[3]
245     - s\_padding2
246     -
247   * - 0x54
248     - \_\_u32
249     - s\_padding[42]
 
 
 
 
 
 
 
 
 
250     -
251   * - 0xFC
252     - \_\_be32
253     - s\_checksum
254     - Checksum of the entire superblock, with this field set to zero.
255   * - 0x100
256     - \_\_u8
257     - s\_users[16\*48]
258     - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
259       shared external journals, but I imagine Lustre (or ocfs2?), which use
260       the jbd2 code, might.
261
262.. _jbd2_compat:
263
264The journal compat features are any combination of the following:
265
266.. list-table::
267   :widths: 16 64
268   :header-rows: 1
269
270   * - Value
271     - Description
272   * - 0x1
273     - Journal maintains checksums on the data blocks.
274       (JBD2\_FEATURE\_COMPAT\_CHECKSUM)
275
276.. _jbd2_incompat:
277
278The journal incompat features are any combination of the following:
279
280.. list-table::
281   :widths: 16 64
282   :header-rows: 1
283
284   * - Value
285     - Description
286   * - 0x1
287     - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
288   * - 0x2
289     - Journal can deal with 64-bit block numbers.
290       (JBD2\_FEATURE\_INCOMPAT\_64BIT)
291   * - 0x4
292     - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
293   * - 0x8
294     - This journal uses v2 of the checksum on-disk format. Each journal
295       metadata block gets its own checksum, and the block tags in the
296       descriptor table contain checksums for each of the data blocks in the
297       journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
298   * - 0x10
299     - This journal uses v3 of the checksum on-disk format. This is the same as
300       v2, but the journal block tag size is fixed regardless of the size of
301       block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
 
 
302
303.. _jbd2_checksum_type:
304
305Journal checksum type codes are one of the following.  crc32 or crc32c are the
306most likely choices.
307
308.. list-table::
309   :widths: 16 64
310   :header-rows: 1
311
312   * - Value
313     - Description
314   * - 1
315     - CRC32
316   * - 2
317     - MD5
318   * - 3
319     - SHA1
320   * - 4
321     - CRC32C
322
323Descriptor Block
324~~~~~~~~~~~~~~~~
325
326The descriptor block contains an array of journal block tags that
327describe the final locations of the data blocks that follow in the
328journal. Descriptor blocks are open-coded instead of being completely
329described by a data structure, but here is the block structure anyway.
330Descriptor blocks consume at least 36 bytes, but use a full block:
331
332.. list-table::
333   :widths: 8 8 24 40
334   :header-rows: 1
335
336   * - Offset
337     - Type
338     - Name
339     - Descriptor
340   * - 0x0
341     - journal\_header\_t
342     - (open coded)
343     - Common block header.
344   * - 0xC
345     - struct journal\_block\_tag\_s
346     - open coded array[]
347     - Enough tags either to fill up the block or to describe all the data
348       blocks that follow this descriptor block.
349
350Journal block tags have any of the following formats, depending on which
351journal feature and block tag flags are set.
352
353If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
354defined as ``struct journal_block_tag3_s``, which looks like the
355following. The size is 16 or 32 bytes.
356
357.. list-table::
358   :widths: 8 8 24 40
359   :header-rows: 1
360
361   * - Offset
362     - Type
363     - Name
364     - Descriptor
365   * - 0x0
366     - \_\_be32
367     - t\_blocknr
368     - Lower 32-bits of the location of where the corresponding data block
369       should end up on disk.
370   * - 0x4
371     - \_\_be32
372     - t\_flags
373     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
374       more info.
375   * - 0x8
376     - \_\_be32
377     - t\_blocknr\_high
378     - Upper 32-bits of the location of where the corresponding data block
379       should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
380       not enabled.
381   * - 0xC
382     - \_\_be32
383     - t\_checksum
384     - Checksum of the journal UUID, the sequence number, and the data block.
385   * -
386     -
387     -
388     - This field appears to be open coded. It always comes at the end of the
389       tag, after t_checksum. This field is not present if the "same UUID" flag
390       is set.
391   * - 0x8 or 0xC
392     - char
393     - uuid[16]
394     - A UUID to go with this tag. This field appears to be copied from the
395       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
396       field.
397
398.. _jbd2_tag_flags:
399
400The journal tag flags are any combination of the following:
401
402.. list-table::
403   :widths: 16 64
404   :header-rows: 1
405
406   * - Value
407     - Description
408   * - 0x1
409     - On-disk block is escaped. The first four bytes of the data block just
410       happened to match the jbd2 magic number.
411   * - 0x2
412     - This block has the same UUID as previous, therefore the UUID field is
413       omitted.
414   * - 0x4
415     - The data block was deleted by the transaction. (Not used?)
416   * - 0x8
417     - This is the last tag in this descriptor block.
418
419If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
420is defined as ``struct journal_block_tag_s``, which looks like the
421following. The size is 8, 12, 24, or 28 bytes:
422
423.. list-table::
424   :widths: 8 8 24 40
425   :header-rows: 1
426
427   * - Offset
428     - Type
429     - Name
430     - Descriptor
431   * - 0x0
432     - \_\_be32
433     - t\_blocknr
434     - Lower 32-bits of the location of where the corresponding data block
435       should end up on disk.
436   * - 0x4
437     - \_\_be16
438     - t\_checksum
439     - Checksum of the journal UUID, the sequence number, and the data block.
440       Note that only the lower 16 bits are stored.
441   * - 0x6
442     - \_\_be16
443     - t\_flags
444     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
445       more info.
446   * -
447     -
448     -
449     - This next field is only present if the super block indicates support for
450       64-bit block numbers.
451   * - 0x8
452     - \_\_be32
453     - t\_blocknr\_high
454     - Upper 32-bits of the location of where the corresponding data block
455       should end up on disk.
456   * -
457     -
458     -
459     - This field appears to be open coded. It always comes at the end of the
460       tag, after t_flags or t_blocknr_high. This field is not present if the
461       "same UUID" flag is set.
462   * - 0x8 or 0xC
463     - char
464     - uuid[16]
465     - A UUID to go with this tag. This field appears to be copied from the
466       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
467       field.
468
469If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
470JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
471``struct jbd2_journal_block_tail``, which looks like this:
472
473.. list-table::
474   :widths: 8 8 24 40
475   :header-rows: 1
476
477   * - Offset
478     - Type
479     - Name
480     - Descriptor
481   * - 0x0
482     - \_\_be32
483     - t\_checksum
484     - Checksum of the journal UUID + the descriptor block, with this field set
485       to zero.
486
487Data Block
488~~~~~~~~~~
489
490In general, the data blocks being written to disk through the journal
491are written verbatim into the journal file after the descriptor block.
492However, if the first four bytes of the block match the jbd2 magic
493number then those four bytes are replaced with zeroes and the “escaped”
494flag is set in the descriptor block tag.
495
496Revocation Block
497~~~~~~~~~~~~~~~~
498
499A revocation block is used to prevent replay of a block in an earlier
500transaction. This is used to mark blocks that were journalled at one
501time but are no longer journalled. Typically this happens if a metadata
502block is freed and re-allocated as a file data block; in this case, a
503journal replay after the file block was written to disk will cause
504corruption.
505
506**NOTE**: This mechanism is NOT used to express “this journal block is
507superseded by this other journal block”, as the author (djwong)
508mistakenly thought. Any block being added to a transaction will cause
509the removal of all existing revocation records for that block.
510
511Revocation blocks are described in
512``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
513length, but use a full block:
514
515.. list-table::
516   :widths: 8 8 24 40
517   :header-rows: 1
518
519   * - Offset
520     - Type
521     - Name
522     - Description
523   * - 0x0
524     - journal\_header\_t
525     - r\_header
526     - Common block header.
527   * - 0xC
528     - \_\_be32
529     - r\_count
530     - Number of bytes used in this block.
531   * - 0x10
532     - \_\_be32 or \_\_be64
533     - blocks[0]
534     - Blocks to revoke.
535
536After r\_count is a linear array of block numbers that are effectively
537revoked by this transaction. The size of each block number is 8 bytes if
538the superblock advertises 64-bit block number support, or 4 bytes
539otherwise.
540
541If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
542JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
543block is a ``struct jbd2_journal_revoke_tail``, which has this format:
544
545.. list-table::
546   :widths: 8 8 24 40
547   :header-rows: 1
548
549   * - Offset
550     - Type
551     - Name
552     - Description
553   * - 0x0
554     - \_\_be32
555     - r\_checksum
556     - Checksum of the journal UUID + revocation block
557
558Commit Block
559~~~~~~~~~~~~
560
561The commit block is a sentry that indicates that a transaction has been
562completely written to the journal. Once this commit block reaches the
563journal, the data stored with this transaction can be written to their
564final locations on disk.
565
566The commit block is described by ``struct commit_header``, which is 32
567bytes long (but uses a full block):
568
569.. list-table::
570   :widths: 8 8 24 40
571   :header-rows: 1
572
573   * - Offset
574     - Type
575     - Name
576     - Descriptor
577   * - 0x0
578     - journal\_header\_s
579     - (open coded)
580     - Common block header.
581   * - 0xC
582     - unsigned char
583     - h\_chksum\_type
584     - The type of checksum to use to verify the integrity of the data blocks
585       in the transaction. See jbd2_checksum_type_ for more info.
586   * - 0xD
587     - unsigned char
588     - h\_chksum\_size
589     - The number of bytes used by the checksum. Most likely 4.
590   * - 0xE
591     - unsigned char
592     - h\_padding[2]
593     -
594   * - 0x10
595     - \_\_be32
596     - h\_chksum[JBD2\_CHECKSUM\_BYTES]
597     - 32 bytes of space to store checksums. If
598       JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
599       are set, the first ``__be32`` is the checksum of the journal UUID and
600       the entire commit block, with this field zeroed. If
601       JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
602       crc32 of all the blocks already written to the transaction.
603   * - 0x30
604     - \_\_be64
605     - h\_commit\_sec
606     - The time that the transaction was committed, in seconds since the epoch.
607   * - 0x38
608     - \_\_be32
609     - h\_commit\_nsec
610     - Nanoseconds component of the above timestamp.
611
v6.13.7
  1.. SPDX-License-Identifier: GPL-2.0
  2
  3Journal (jbd2)
  4--------------
  5
  6Introduced in ext3, the ext4 filesystem employs a journal to protect the
  7filesystem against metadata inconsistencies in the case of a system crash. Up
  8to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
  9size limits) can be reserved inside the filesystem as a place to land
 10“important” data writes on-disk as quickly as possible. Once the important
 11data transaction is fully written to the disk and flushed from the disk write
 12cache, a record of the data being committed is also written to the journal. At
 13some later point in time, the journal code writes the transactions to their
 14final locations on disk (this could involve a lot of seeking or a lot of small
 15read-write-erases) before erasing the commit record. Should the system
 16crash during the second slow write, the journal can be replayed all the
 17way to the latest commit record, guaranteeing the atomicity of whatever
 18gets written through the journal to the disk. The effect of this is to
 19guarantee that the filesystem does not become stuck midway through a
 20metadata update.
 21
 22For performance reasons, ext4 by default only writes filesystem metadata
 23through the journal. This means that file data blocks are /not/
 24guaranteed to be in any consistent state after a crash. If this default
 25guarantee level (``data=ordered``) is not satisfactory, there is a mount
 26option to control journal behavior. If ``data=journal``, all data and
 27metadata are written to disk through the journal. This is slower but
 28safest. If ``data=writeback``, dirty data blocks are not flushed to the
 29disk before the metadata are written to disk through the journal.
 30
 31In case of ``data=ordered`` mode, Ext4 also supports fast commits which
 32help reduce commit latency significantly. The default ``data=ordered``
 33mode works by logging metadata blocks to the journal. In fast commit
 34mode, Ext4 only stores the minimal delta needed to recreate the
 35affected metadata in fast commit space that is shared with JBD2.
 36Once the fast commit area fills in or if fast commit is not possible
 37or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
 38A full commit invalidates all the fast commits that happened before
 39it and thus it makes the fast commit area empty for further fast
 40commits. This feature needs to be enabled at mkfs time.
 41
 42The journal inode is typically inode 8. The first 68 bytes of the
 43journal inode are replicated in the ext4 superblock. The journal itself
 44is normal (but hidden) file within the filesystem. The file usually
 45consumes an entire block group, though mke2fs tries to put it in the
 46middle of the disk.
 47
 48All fields in jbd2 are written to disk in big-endian order. This is the
 49opposite of ext4.
 50
 51NOTE: Both ext4 and ocfs2 use jbd2.
 52
 53The maximum size of a journal embedded in an ext4 filesystem is 2^32
 54blocks. jbd2 itself does not seem to care.
 55
 56Layout
 57~~~~~~
 58
 59Generally speaking, the journal has this format:
 60
 61.. list-table::
 62   :widths: 16 48 16
 63   :header-rows: 1
 64
 65   * - Superblock
 66     - descriptor_block (data_blocks or revocation_block) [more data or
 67       revocations] commmit_block
 68     - [more transactions...]
 69   * - 
 70     - One transaction
 71     -
 72
 73Notice that a transaction begins with either a descriptor and some data,
 74or a block revocation list. A finished transaction always ends with a
 75commit. If there is no commit record (or the checksums don't match), the
 76transaction will be discarded during replay.
 77
 78External Journal
 79~~~~~~~~~~~~~~~~
 80
 81Optionally, an ext4 filesystem can be created with an external journal
 82device (as opposed to an internal journal, which uses a reserved inode).
 83In this case, on the filesystem device, ``s_journal_inum`` should be
 84zero and ``s_journal_uuid`` should be set. On the journal device there
 85will be an ext4 super block in the usual place, with a matching UUID.
 86The journal superblock will be in the next full block after the
 87superblock.
 88
 89.. list-table::
 90   :widths: 12 12 12 32 12
 91   :header-rows: 1
 92
 93   * - 1024 bytes of padding
 94     - ext4 Superblock
 95     - Journal Superblock
 96     - descriptor_block (data_blocks or revocation_block) [more data or
 97       revocations] commmit_block
 98     - [more transactions...]
 99   * - 
100     -
101     -
102     - One transaction
103     -
104
105Block Header
106~~~~~~~~~~~~
107
108Every block in the journal starts with a common 12-byte header
109``struct journal_header_s``:
110
111.. list-table::
112   :widths: 8 8 24 40
113   :header-rows: 1
114
115   * - Offset
116     - Type
117     - Name
118     - Description
119   * - 0x0
120     - __be32
121     - h_magic
122     - jbd2 magic number, 0xC03B3998.
123   * - 0x4
124     - __be32
125     - h_blocktype
126     - Description of what this block contains. See the jbd2_blocktype_ table
127       below.
128   * - 0x8
129     - __be32
130     - h_sequence
131     - The transaction ID that goes with this block.
132
133.. _jbd2_blocktype:
134
135The journal block type can be any one of:
136
137.. list-table::
138   :widths: 16 64
139   :header-rows: 1
140
141   * - Value
142     - Description
143   * - 1
144     - Descriptor. This block precedes a series of data blocks that were
145       written through the journal during a transaction.
146   * - 2
147     - Block commit record. This block signifies the completion of a
148       transaction.
149   * - 3
150     - Journal superblock, v1.
151   * - 4
152     - Journal superblock, v2.
153   * - 5
154     - Block revocation records. This speeds up recovery by enabling the
155       journal to skip writing blocks that were subsequently rewritten.
156
157Super Block
158~~~~~~~~~~~
159
160The super block for the journal is much simpler as compared to ext4's.
161The key data kept within are size of the journal, and where to find the
162start of the log of transactions.
163
164The journal superblock is recorded as ``struct journal_superblock_s``,
165which is 1024 bytes long:
166
167.. list-table::
168   :widths: 8 8 24 40
169   :header-rows: 1
170
171   * - Offset
172     - Type
173     - Name
174     - Description
175   * -
176     -
177     -
178     - Static information describing the journal.
179   * - 0x0
180     - journal_header_t (12 bytes)
181     - s_header
182     - Common header identifying this as a superblock.
183   * - 0xC
184     - __be32
185     - s_blocksize
186     - Journal device block size.
187   * - 0x10
188     - __be32
189     - s_maxlen
190     - Total number of blocks in this journal.
191   * - 0x14
192     - __be32
193     - s_first
194     - First block of log information.
195   * -
196     -
197     -
198     - Dynamic information describing the current state of the log.
199   * - 0x18
200     - __be32
201     - s_sequence
202     - First commit ID expected in log.
203   * - 0x1C
204     - __be32
205     - s_start
206     - Block number of the start of log. Contrary to the comments, this field
207       being zero does not imply that the journal is clean!
208   * - 0x20
209     - __be32
210     - s_errno
211     - Error value, as set by jbd2_journal_abort().
212   * -
213     -
214     -
215     - The remaining fields are only valid in a v2 superblock.
216   * - 0x24
217     - __be32
218     - s_feature_compat;
219     - Compatible feature set. See the table jbd2_compat_ below.
220   * - 0x28
221     - __be32
222     - s_feature_incompat
223     - Incompatible feature set. See the table jbd2_incompat_ below.
224   * - 0x2C
225     - __be32
226     - s_feature_ro_compat
227     - Read-only compatible feature set. There aren't any of these currently.
228   * - 0x30
229     - __u8
230     - s_uuid[16]
231     - 128-bit uuid for journal. This is compared against the copy in the ext4
232       super block at mount time.
233   * - 0x40
234     - __be32
235     - s_nr_users
236     - Number of file systems sharing this journal.
237   * - 0x44
238     - __be32
239     - s_dynsuper
240     - Location of dynamic super block copy. (Not used?)
241   * - 0x48
242     - __be32
243     - s_max_transaction
244     - Limit of journal blocks per transaction. (Not used?)
245   * - 0x4C
246     - __be32
247     - s_max_trans_data
248     - Limit of data blocks per transaction. (Not used?)
249   * - 0x50
250     - __u8
251     - s_checksum_type
252     - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
253       more info.
254   * - 0x51
255     - __u8[3]
256     - s_padding2
257     -
258   * - 0x54
259     - __be32
260     - s_num_fc_blocks
261     - Number of fast commit blocks in the journal.
262   * - 0x58
263     - __be32
264     - s_head
265     - Block number of the head (first unused block) of the journal, only
266       up-to-date when the journal is empty.
267   * - 0x5C
268     - __u32
269     - s_padding[40]
270     -
271   * - 0xFC
272     - __be32
273     - s_checksum
274     - Checksum of the entire superblock, with this field set to zero.
275   * - 0x100
276     - __u8
277     - s_users[16*48]
278     - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
279       shared external journals, but I imagine Lustre (or ocfs2?), which use
280       the jbd2 code, might.
281
282.. _jbd2_compat:
283
284The journal compat features are any combination of the following:
285
286.. list-table::
287   :widths: 16 64
288   :header-rows: 1
289
290   * - Value
291     - Description
292   * - 0x1
293     - Journal maintains checksums on the data blocks.
294       (JBD2_FEATURE_COMPAT_CHECKSUM)
295
296.. _jbd2_incompat:
297
298The journal incompat features are any combination of the following:
299
300.. list-table::
301   :widths: 16 64
302   :header-rows: 1
303
304   * - Value
305     - Description
306   * - 0x1
307     - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
308   * - 0x2
309     - Journal can deal with 64-bit block numbers.
310       (JBD2_FEATURE_INCOMPAT_64BIT)
311   * - 0x4
312     - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
313   * - 0x8
314     - This journal uses v2 of the checksum on-disk format. Each journal
315       metadata block gets its own checksum, and the block tags in the
316       descriptor table contain checksums for each of the data blocks in the
317       journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
318   * - 0x10
319     - This journal uses v3 of the checksum on-disk format. This is the same as
320       v2, but the journal block tag size is fixed regardless of the size of
321       block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
322   * - 0x20
323     - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
324
325.. _jbd2_checksum_type:
326
327Journal checksum type codes are one of the following.  crc32 or crc32c are the
328most likely choices.
329
330.. list-table::
331   :widths: 16 64
332   :header-rows: 1
333
334   * - Value
335     - Description
336   * - 1
337     - CRC32
338   * - 2
339     - MD5
340   * - 3
341     - SHA1
342   * - 4
343     - CRC32C
344
345Descriptor Block
346~~~~~~~~~~~~~~~~
347
348The descriptor block contains an array of journal block tags that
349describe the final locations of the data blocks that follow in the
350journal. Descriptor blocks are open-coded instead of being completely
351described by a data structure, but here is the block structure anyway.
352Descriptor blocks consume at least 36 bytes, but use a full block:
353
354.. list-table::
355   :widths: 8 8 24 40
356   :header-rows: 1
357
358   * - Offset
359     - Type
360     - Name
361     - Descriptor
362   * - 0x0
363     - journal_header_t
364     - (open coded)
365     - Common block header.
366   * - 0xC
367     - struct journal_block_tag_s
368     - open coded array[]
369     - Enough tags either to fill up the block or to describe all the data
370       blocks that follow this descriptor block.
371
372Journal block tags have any of the following formats, depending on which
373journal feature and block tag flags are set.
374
375If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
376defined as ``struct journal_block_tag3_s``, which looks like the
377following. The size is 16 or 32 bytes.
378
379.. list-table::
380   :widths: 8 8 24 40
381   :header-rows: 1
382
383   * - Offset
384     - Type
385     - Name
386     - Descriptor
387   * - 0x0
388     - __be32
389     - t_blocknr
390     - Lower 32-bits of the location of where the corresponding data block
391       should end up on disk.
392   * - 0x4
393     - __be32
394     - t_flags
395     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
396       more info.
397   * - 0x8
398     - __be32
399     - t_blocknr_high
400     - Upper 32-bits of the location of where the corresponding data block
401       should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
402       not enabled.
403   * - 0xC
404     - __be32
405     - t_checksum
406     - Checksum of the journal UUID, the sequence number, and the data block.
407   * -
408     -
409     -
410     - This field appears to be open coded. It always comes at the end of the
411       tag, after t_checksum. This field is not present if the "same UUID" flag
412       is set.
413   * - 0x8 or 0xC
414     - char
415     - uuid[16]
416     - A UUID to go with this tag. This field appears to be copied from the
417       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
418       field.
419
420.. _jbd2_tag_flags:
421
422The journal tag flags are any combination of the following:
423
424.. list-table::
425   :widths: 16 64
426   :header-rows: 1
427
428   * - Value
429     - Description
430   * - 0x1
431     - On-disk block is escaped. The first four bytes of the data block just
432       happened to match the jbd2 magic number.
433   * - 0x2
434     - This block has the same UUID as previous, therefore the UUID field is
435       omitted.
436   * - 0x4
437     - The data block was deleted by the transaction. (Not used?)
438   * - 0x8
439     - This is the last tag in this descriptor block.
440
441If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
442is defined as ``struct journal_block_tag_s``, which looks like the
443following. The size is 8, 12, 24, or 28 bytes:
444
445.. list-table::
446   :widths: 8 8 24 40
447   :header-rows: 1
448
449   * - Offset
450     - Type
451     - Name
452     - Descriptor
453   * - 0x0
454     - __be32
455     - t_blocknr
456     - Lower 32-bits of the location of where the corresponding data block
457       should end up on disk.
458   * - 0x4
459     - __be16
460     - t_checksum
461     - Checksum of the journal UUID, the sequence number, and the data block.
462       Note that only the lower 16 bits are stored.
463   * - 0x6
464     - __be16
465     - t_flags
466     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
467       more info.
468   * -
469     -
470     -
471     - This next field is only present if the super block indicates support for
472       64-bit block numbers.
473   * - 0x8
474     - __be32
475     - t_blocknr_high
476     - Upper 32-bits of the location of where the corresponding data block
477       should end up on disk.
478   * -
479     -
480     -
481     - This field appears to be open coded. It always comes at the end of the
482       tag, after t_flags or t_blocknr_high. This field is not present if the
483       "same UUID" flag is set.
484   * - 0x8 or 0xC
485     - char
486     - uuid[16]
487     - A UUID to go with this tag. This field appears to be copied from the
488       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
489       field.
490
491If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
492JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
493``struct jbd2_journal_block_tail``, which looks like this:
494
495.. list-table::
496   :widths: 8 8 24 40
497   :header-rows: 1
498
499   * - Offset
500     - Type
501     - Name
502     - Descriptor
503   * - 0x0
504     - __be32
505     - t_checksum
506     - Checksum of the journal UUID + the descriptor block, with this field set
507       to zero.
508
509Data Block
510~~~~~~~~~~
511
512In general, the data blocks being written to disk through the journal
513are written verbatim into the journal file after the descriptor block.
514However, if the first four bytes of the block match the jbd2 magic
515number then those four bytes are replaced with zeroes and the “escaped”
516flag is set in the descriptor block tag.
517
518Revocation Block
519~~~~~~~~~~~~~~~~
520
521A revocation block is used to prevent replay of a block in an earlier
522transaction. This is used to mark blocks that were journalled at one
523time but are no longer journalled. Typically this happens if a metadata
524block is freed and re-allocated as a file data block; in this case, a
525journal replay after the file block was written to disk will cause
526corruption.
527
528**NOTE**: This mechanism is NOT used to express “this journal block is
529superseded by this other journal block”, as the author (djwong)
530mistakenly thought. Any block being added to a transaction will cause
531the removal of all existing revocation records for that block.
532
533Revocation blocks are described in
534``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
535length, but use a full block:
536
537.. list-table::
538   :widths: 8 8 24 40
539   :header-rows: 1
540
541   * - Offset
542     - Type
543     - Name
544     - Description
545   * - 0x0
546     - journal_header_t
547     - r_header
548     - Common block header.
549   * - 0xC
550     - __be32
551     - r_count
552     - Number of bytes used in this block.
553   * - 0x10
554     - __be32 or __be64
555     - blocks[0]
556     - Blocks to revoke.
557
558After r_count is a linear array of block numbers that are effectively
559revoked by this transaction. The size of each block number is 8 bytes if
560the superblock advertises 64-bit block number support, or 4 bytes
561otherwise.
562
563If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
564JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
565block is a ``struct jbd2_journal_revoke_tail``, which has this format:
566
567.. list-table::
568   :widths: 8 8 24 40
569   :header-rows: 1
570
571   * - Offset
572     - Type
573     - Name
574     - Description
575   * - 0x0
576     - __be32
577     - r_checksum
578     - Checksum of the journal UUID + revocation block
579
580Commit Block
581~~~~~~~~~~~~
582
583The commit block is a sentry that indicates that a transaction has been
584completely written to the journal. Once this commit block reaches the
585journal, the data stored with this transaction can be written to their
586final locations on disk.
587
588The commit block is described by ``struct commit_header``, which is 32
589bytes long (but uses a full block):
590
591.. list-table::
592   :widths: 8 8 24 40
593   :header-rows: 1
594
595   * - Offset
596     - Type
597     - Name
598     - Descriptor
599   * - 0x0
600     - journal_header_s
601     - (open coded)
602     - Common block header.
603   * - 0xC
604     - unsigned char
605     - h_chksum_type
606     - The type of checksum to use to verify the integrity of the data blocks
607       in the transaction. See jbd2_checksum_type_ for more info.
608   * - 0xD
609     - unsigned char
610     - h_chksum_size
611     - The number of bytes used by the checksum. Most likely 4.
612   * - 0xE
613     - unsigned char
614     - h_padding[2]
615     -
616   * - 0x10
617     - __be32
618     - h_chksum[JBD2_CHECKSUM_BYTES]
619     - 32 bytes of space to store checksums. If
620       JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
621       are set, the first ``__be32`` is the checksum of the journal UUID and
622       the entire commit block, with this field zeroed. If
623       JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
624       crc32 of all the blocks already written to the transaction.
625   * - 0x30
626     - __be64
627     - h_commit_sec
628     - The time that the transaction was committed, in seconds since the epoch.
629   * - 0x38
630     - __be32
631     - h_commit_nsec
632     - Nanoseconds component of the above timestamp.
633
634Fast commits
635~~~~~~~~~~~~
636
637Fast commit area is organized as a log of tag length values. Each TLV has
638a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
639of the entire field. It is followed by variable length tag specific value.
640Here is the list of supported tags and their meanings:
641
642.. list-table::
643   :widths: 8 20 20 32
644   :header-rows: 1
645
646   * - Tag
647     - Meaning
648     - Value struct
649     - Description
650   * - EXT4_FC_TAG_HEAD
651     - Fast commit area header
652     - ``struct ext4_fc_head``
653     - Stores the TID of the transaction after which these fast commits should
654       be applied.
655   * - EXT4_FC_TAG_ADD_RANGE
656     - Add extent to inode
657     - ``struct ext4_fc_add_range``
658     - Stores the inode number and extent to be added in this inode
659   * - EXT4_FC_TAG_DEL_RANGE
660     - Remove logical offsets to inode
661     - ``struct ext4_fc_del_range``
662     - Stores the inode number and the logical offset range that needs to be
663       removed
664   * - EXT4_FC_TAG_CREAT
665     - Create directory entry for a newly created file
666     - ``struct ext4_fc_dentry_info``
667     - Stores the parent inode number, inode number and directory entry of the
668       newly created file
669   * - EXT4_FC_TAG_LINK
670     - Link a directory entry to an inode
671     - ``struct ext4_fc_dentry_info``
672     - Stores the parent inode number, inode number and directory entry
673   * - EXT4_FC_TAG_UNLINK
674     - Unlink a directory entry of an inode
675     - ``struct ext4_fc_dentry_info``
676     - Stores the parent inode number, inode number and directory entry
677
678   * - EXT4_FC_TAG_PAD
679     - Padding (unused area)
680     - None
681     - Unused bytes in the fast commit area.
682
683   * - EXT4_FC_TAG_TAIL
684     - Mark the end of a fast commit
685     - ``struct ext4_fc_tail``
686     - Stores the TID of the commit, CRC of the fast commit of which this tag
687       represents the end of
688
689Fast Commit Replay Idempotence
690~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
691
692Fast commits tags are idempotent in nature provided the recovery code follows
693certain rules. The guiding principle that the commit path follows while
694committing is that it stores the result of a particular operation instead of
695storing the procedure.
696
697Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
698was associated with inode 10. During fast commit, instead of storing this
699operation as a procedure "rename a to b", we store the resulting file system
700state as a "series" of outcomes:
701
702- Link dirent b to inode 10
703- Unlink dirent a
704- Inode 10 with valid refcount
705
706Now when recovery code runs, it needs "enforce" this state on the file
707system. This is what guarantees idempotence of fast commit replay.
708
709Let's take an example of a procedure that is not idempotent and see how fast
710commits make it idempotent. Consider following sequence of operations:
711
7121) rm A
7132) mv B A
7143) read A
715
716If we store this sequence of operations as is then the replay is not idempotent.
717Let's say while in replay, we crash after (2). During the second replay,
718file A (which was actually created as a result of "mv B A" operation) would get
719deleted. Thus, file named A would be absent when we try to read A. So, this
720sequence of operations is not idempotent. However, as mentioned above, instead
721of storing the procedure fast commits store the outcome of each procedure. Thus
722the fast commit log for above procedure would be as follows:
723
724(Let's assume dirent A was linked to inode 10 and dirent B was linked to
725inode 11 before the replay)
726
7271) Unlink A
7282) Link A to inode 11
7293) Unlink B
7304) Inode 11
731
732If we crash after (3) we will have file A linked to inode 11. During the second
733replay, we will remove file A (inode 11). But we will create it back and make
734it point to inode 11. We won't find B, so we'll just skip that step. At this
735point, the refcount for inode 11 is not reliable, but that gets fixed by the
736replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
737into a series of idempotent outcomes, fast commits ensured idempotence during
738the replay.
739
740Journal Checkpoint
741~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
742
743Checkpointing the journal ensures all transactions and their associated buffers
744are submitted to the disk. In-progress transactions are waited upon and included
745in the checkpoint. Checkpointing is used internally during critical updates to
746the filesystem including journal recovery, filesystem resizing, and freeing of
747the journal_t structure.
748
749A journal checkpoint can be triggered from userspace via the ioctl
750EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
751Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
752can be used to verify input to the ioctl. It returns error if there is any
753invalid input, otherwise it returns success without performing
754any checkpointing. This can be used to check whether the ioctl exists on a
755system and to verify there are no issues with arguments or flags. The
756other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
757EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
758discarded or zero-filled, respectively, after the journal checkpoint is
759complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
760cannot both be set. The ioctl may be useful when snapshotting a system or for
761complying with content deletion SLOs.