Linux Audio

Check our new training course

Linux kernel drivers training

May 6-19, 2025
Register
Loading...
Note: File does not exist in v6.13.7.
  1
  2Ext4 Filesystem
  3===============
  4
  5Ext4 is an an advanced level of the ext3 filesystem which incorporates
  6scalability and reliability enhancements for supporting large filesystems
  7(64 bit) in keeping with increasing disk capacities and state-of-the-art
  8feature requirements.
  9
 10Mailing list:	linux-ext4@vger.kernel.org
 11Web site:	http://ext4.wiki.kernel.org
 12
 13
 141. Quick usage instructions:
 15===========================
 16
 17Note: More extensive information for getting started with ext4 can be
 18      found at the ext4 wiki site at the URL:
 19      http://ext4.wiki.kernel.org/index.php/Ext4_Howto
 20
 21  - Compile and install the latest version of e2fsprogs (as of this
 22    writing version 1.41.3) from:
 23
 24    http://sourceforge.net/project/showfiles.php?group_id=2406
 25	
 26	or
 27
 28    ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
 29
 30	or grab the latest git repository from:
 31
 32    git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
 33
 34  - Note that it is highly important to install the mke2fs.conf file
 35    that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If
 36    you have edited the /etc/mke2fs.conf file installed on your system,
 37    you will need to merge your changes with the version from e2fsprogs
 38    1.41.x.
 39
 40  - Create a new filesystem using the ext4 filesystem type:
 41
 42    	# mke2fs -t ext4 /dev/hda1
 43
 44    Or to configure an existing ext3 filesystem to support extents: 
 45
 46	# tune2fs -O extents /dev/hda1
 47
 48    If the filesystem was created with 128 byte inodes, it can be
 49    converted to use 256 byte for greater efficiency via:
 50
 51        # tune2fs -I 256 /dev/hda1
 52
 53    (Note: we currently do not have tools to convert an ext4
 54    filesystem back to ext3; so please do not do try this on production
 55    filesystems.)
 56
 57  - Mounting:
 58
 59	# mount -t ext4 /dev/hda1 /wherever
 60
 61  - When comparing performance with other filesystems, it's always
 62    important to try multiple workloads; very often a subtle change in a
 63    workload parameter can completely change the ranking of which
 64    filesystems do well compared to others.  When comparing versus ext3,
 65    note that ext4 enables write barriers by default, while ext3 does
 66    not enable write barriers by default.  So it is useful to use
 67    explicitly specify whether barriers are enabled or not when via the
 68    '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
 69    for a fair comparison.  When tuning ext3 for best benchmark numbers,
 70    it is often worthwhile to try changing the data journaling mode; '-o
 71    data=writeback' can be faster for some workloads.  (Note however that
 72    running mounted with data=writeback can potentially leave stale data
 73    exposed in recently written files in case of an unclean shutdown,
 74    which could be a security exposure in some situations.)  Configuring
 75    the filesystem with a large journal can also be helpful for
 76    metadata-intensive workloads.
 77
 782. Features
 79===========
 80
 812.1 Currently available
 82
 83* ability to use filesystems > 16TB (e2fsprogs support not available yet)
 84* extent format reduces metadata overhead (RAM, IO for access, transactions)
 85* extent format more robust in face of on-disk corruption due to magics,
 86* internal redundancy in tree
 87* improved file allocation (multi-block alloc)
 88* lift 32000 subdirectory limit imposed by i_links_count[1]
 89* nsec timestamps for mtime, atime, ctime, create time
 90* inode version field on disk (NFSv4, Lustre)
 91* reduced e2fsck time via uninit_bg feature
 92* journal checksumming for robustness, performance
 93* persistent file preallocation (e.g for streaming media, databases)
 94* ability to pack bitmaps and inode tables into larger virtual groups via the
 95  flex_bg feature
 96* large file support
 97* Inode allocation using large virtual block groups via flex_bg
 98* delayed allocation
 99* large block (up to pagesize) support
100* efficient new ordered mode in JBD2 and ext4(avoid using buffer head to force
101  the ordering)
102
103[1] Filesystems with a block size of 1k may see a limit imposed by the
104directory hash tree having a maximum depth of two.
105
1062.2 Candidate features for future inclusion
107
108* Online defrag (patches available but not well tested)
109* reduced mke2fs time via lazy itable initialization in conjunction with
110  the uninit_bg feature (capability to do this is available in e2fsprogs
111  but a kernel thread to do lazy zeroing of unused inode table blocks
112  after filesystem is first mounted is required for safety)
113
114There are several others under discussion, whether they all make it in is
115partly a function of how much time everyone has to work on them. Features like
116metadata checksumming have been discussed and planned for a bit but no patches
117exist yet so I'm not sure they're in the near-term roadmap.
118
119The big performance win will come with mballoc, delalloc and flex_bg
120grouping of bitmaps and inode tables.  Some test results available here:
121
122 - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html
123 - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html
124
1253. Options
126==========
127
128When mounting an ext4 filesystem, the following option are accepted:
129(*) == default
130
131ro                   	Mount filesystem read only. Note that ext4 will
132                     	replay the journal (and thus write to the
133                     	partition) even when mounted "read only". The
134                     	mount options "ro,noload" can be used to prevent
135		     	writes to the filesystem.
136
137journal_checksum	Enable checksumming of the journal transactions.
138			This will allow the recovery code in e2fsck and the
139			kernel to detect corruption in the kernel.  It is a
140			compatible change and will be ignored by older kernels.
141
142journal_async_commit	Commit block can be written to disk without waiting
143			for descriptor blocks. If enabled older kernels cannot
144			mount the device. This will enable 'journal_checksum'
145			internally.
146
147journal=update		Update the ext4 file system's journal to the current
148			format.
149
150journal_dev=devnum	When the external journal device's major/minor numbers
151			have changed, this option allows the user to specify
152			the new journal location.  The journal device is
153			identified through its new major/minor numbers encoded
154			in devnum.
155
156norecovery		Don't load the journal on mounting.  Note that
157noload			if the filesystem was not unmounted cleanly,
158                     	skipping the journal replay will lead to the
159                     	filesystem containing inconsistencies that can
160                     	lead to any number of problems.
161
162data=journal		All data are committed into the journal prior to being
163			written into the main file system.
164
165data=ordered	(*)	All data are forced directly out to the main file
166			system prior to its metadata being committed to the
167			journal.
168
169data=writeback		Data ordering is not preserved, data may be written
170			into the main file system after its metadata has been
171			committed to the journal.
172
173commit=nrsec	(*)	Ext4 can be told to sync all its data and metadata
174			every 'nrsec' seconds. The default value is 5 seconds.
175			This means that if you lose your power, you will lose
176			as much as the latest 5 seconds of work (your
177			filesystem will not be damaged though, thanks to the
178			journaling).  This default value (or any low value)
179			will hurt performance, but it's good for data-safety.
180			Setting it to 0 will have the same effect as leaving
181			it at the default (5 seconds).
182			Setting it to very large values will improve
183			performance.
184
185barrier=<0|1(*)>	This enables/disables the use of write barriers in
186barrier(*)		the jbd code.  barrier=0 disables, barrier=1 enables.
187nobarrier		This also requires an IO stack which can support
188			barriers, and if jbd gets an error on a barrier
189			write, it will disable again with a warning.
190			Write barriers enforce proper on-disk ordering
191			of journal commits, making volatile disk write caches
192			safe to use, at some performance penalty.  If
193			your disks are battery-backed in one way or another,
194			disabling barriers may safely improve performance.
195			The mount options "barrier" and "nobarrier" can
196			also be used to enable or disable barriers, for
197			consistency with other ext4 mount options.
198
199inode_readahead_blks=n	This tuning parameter controls the maximum
200			number of inode table blocks that ext4's inode
201			table readahead algorithm will pre-read into
202			the buffer cache.  The default value is 32 blocks.
203
204orlov		(*)	This enables the new Orlov block allocator. It is
205			enabled by default.
206
207oldalloc		This disables the Orlov block allocator and enables
208			the old block allocator.  Orlov should have better
209			performance - we'd like to get some feedback if it's
210			the contrary for you.
211
212user_xattr		Enables Extended User Attributes.  Additionally, you
213			need to have extended attribute support enabled in the
214			kernel configuration (CONFIG_EXT4_FS_XATTR).  See the
215			attr(5) manual page and http://acl.bestbits.at/ to
216			learn more about extended attributes.
217
218nouser_xattr		Disables Extended User Attributes.
219
220acl			Enables POSIX Access Control Lists support.
221			Additionally, you need to have ACL support enabled in
222			the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL).
223			See the acl(5) manual page and http://acl.bestbits.at/
224			for more information.
225
226noacl			This option disables POSIX Access Control List
227			support.
228
229bsddf		(*)	Make 'df' act like BSD.
230minixdf			Make 'df' act like Minix.
231
232debug			Extra debugging information is sent to syslog.
233
234abort			Simulate the effects of calling ext4_abort() for
235			debugging purposes.  This is normally used while
236			remounting a filesystem which is already mounted.
237
238errors=remount-ro	Remount the filesystem read-only on an error.
239errors=continue		Keep going on a filesystem error.
240errors=panic		Panic and halt the machine if an error occurs.
241                        (These mount options override the errors behavior
242                        specified in the superblock, which can be configured
243                        using tune2fs)
244
245data_err=ignore(*)	Just print an error message if an error occurs
246			in a file data buffer in ordered mode.
247data_err=abort		Abort the journal if an error occurs in a file
248			data buffer in ordered mode.
249
250grpid			Give objects the same group ID as their creator.
251bsdgroups
252
253nogrpid		(*)	New objects have the group ID of their creator.
254sysvgroups
255
256resgid=n		The group ID which may use the reserved blocks.
257
258resuid=n		The user ID which may use the reserved blocks.
259
260sb=n			Use alternate superblock at this location.
261
262quota			These options are ignored by the filesystem. They
263noquota			are used only by quota tools to recognize volumes
264grpquota		where quota should be turned on. See documentation
265usrquota		in the quota-tools package for more details
266			(http://sourceforge.net/projects/linuxquota).
267
268jqfmt=<quota type>	These options tell filesystem details about quota
269usrjquota=<file>	so that quota information can be properly updated
270grpjquota=<file>	during journal replay. They replace the above
271			quota options. See documentation in the quota-tools
272			package for more details
273			(http://sourceforge.net/projects/linuxquota).
274
275stripe=n		Number of filesystem blocks that mballoc will try
276			to use for allocation size and alignment. For RAID5/6
277			systems this should be the number of data
278			disks *  RAID chunk size in file system blocks.
279
280delalloc	(*)	Defer block allocation until just before ext4
281			writes out the block(s) in question.  This
282			allows ext4 to better allocation decisions
283			more efficiently.
284nodelalloc		Disable delayed allocation.  Blocks are allocated
285			when the data is copied from userspace to the
286			page cache, either via the write(2) system call
287			or when an mmap'ed page which was previously
288			unallocated is written for the first time.
289
290max_batch_time=usec	Maximum amount of time ext4 should wait for
291			additional filesystem operations to be batch
292			together with a synchronous write operation.
293			Since a synchronous write operation is going to
294			force a commit and then a wait for the I/O
295			complete, it doesn't cost much, and can be a
296			huge throughput win, we wait for a small amount
297			of time to see if any other transactions can
298			piggyback on the synchronous write.   The
299			algorithm used is designed to automatically tune
300			for the speed of the disk, by measuring the
301			amount of time (on average) that it takes to
302			finish committing a transaction.  Call this time
303			the "commit time".  If the time that the
304			transaction has been running is less than the
305			commit time, ext4 will try sleeping for the
306			commit time to see if other operations will join
307			the transaction.   The commit time is capped by
308			the max_batch_time, which defaults to 15000us
309			(15ms).   This optimization can be turned off
310			entirely by setting max_batch_time to 0.
311
312min_batch_time=usec	This parameter sets the commit time (as
313			described above) to be at least min_batch_time.
314			It defaults to zero microseconds.  Increasing
315			this parameter may improve the throughput of
316			multi-threaded, synchronous workloads on very
317			fast disks, at the cost of increasing latency.
318
319journal_ioprio=prio	The I/O priority (from 0 to 7, where 0 is the
320			highest priorty) which should be used for I/O
321			operations submitted by kjournald2 during a
322			commit operation.  This defaults to 3, which is
323			a slightly higher priority than the default I/O
324			priority.
325
326auto_da_alloc(*)	Many broken applications don't use fsync() when 
327noauto_da_alloc		replacing existing files via patterns such as
328			fd = open("foo.new")/write(fd,..)/close(fd)/
329			rename("foo.new", "foo"), or worse yet,
330			fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
331			If auto_da_alloc is enabled, ext4 will detect
332			the replace-via-rename and replace-via-truncate
333			patterns and force that any delayed allocation
334			blocks are allocated such that at the next
335			journal commit, in the default data=ordered
336			mode, the data blocks of the new file are forced
337			to disk before the rename() operation is
338			committed.  This provides roughly the same level
339			of guarantees as ext3, and avoids the
340			"zero-length" problem that can happen when a
341			system crashes before the delayed allocation
342			blocks are forced to disk.
343
344noinit_itable		Do not initialize any uninitialized inode table
345			blocks in the background.  This feature may be
346			used by installation CD's so that the install
347			process can complete as quickly as possible; the
348			inode table initialization process would then be
349			deferred until the next time the  file system
350			is unmounted.
351
352init_itable=n		The lazy itable init code will wait n times the
353			number of milliseconds it took to zero out the
354			previous block group's inode table.  This
355			minimizes the impact on the systme performance
356			while file system's inode table is being initialized.
357
358discard			Controls whether ext4 should issue discard/TRIM
359nodiscard(*)		commands to the underlying block device when
360			blocks are freed.  This is useful for SSD devices
361			and sparse/thinly-provisioned LUNs, but it is off
362			by default until sufficient testing has been done.
363
364nouid32			Disables 32-bit UIDs and GIDs.  This is for
365			interoperability  with  older kernels which only
366			store and expect 16-bit values.
367
368resize			Allows to resize filesystem to the end of the last
369			existing block group, further resize has to be done
370			with resize2fs either online, or offline. It can be
371			used only with conjunction with remount.
372
373block_validity		This options allows to enables/disables the in-kernel
374noblock_validity	facility for tracking filesystem metadata blocks
375			within internal data structures. This allows multi-
376			block allocator and other routines to quickly locate
377			extents which might overlap with filesystem metadata
378			blocks. This option is intended for debugging
379			purposes and since it negatively affects the
380			performance, it is off by default.
381
382dioread_lock		Controls whether or not ext4 should use the DIO read
383dioread_nolock		locking. If the dioread_nolock option is specified
384			ext4 will allocate uninitialized extent before buffer
385			write and convert the extent to initialized after IO
386			completes. This approach allows ext4 code to avoid
387			using inode mutex, which improves scalability on high
388			speed storages. However this does not work with
389			data journaling and dioread_nolock option will be
390			ignored with kernel warning. Note that dioread_nolock
391			code path is only used for extent-based files.
392			Because of the restrictions this options comprises
393			it is off by default (e.g. dioread_lock).
394
395i_version		Enable 64-bit inode version support. This option is
396			off by default.
397
398Data Mode
399=========
400There are 3 different data modes:
401
402* writeback mode
403In data=writeback mode, ext4 does not journal data at all.  This mode provides
404a similar level of journaling as that of XFS, JFS, and ReiserFS in its default
405mode - metadata journaling.  A crash+recovery can cause incorrect data to
406appear in files which were written shortly before the crash.  This mode will
407typically provide the best ext4 performance.
408
409* ordered mode
410In data=ordered mode, ext4 only officially journals metadata, but it logically
411groups metadata information related to data changes with the data blocks into a
412single unit called a transaction.  When it's time to write the new metadata
413out to disk, the associated data blocks are written first.  In general,
414this mode performs slightly slower than writeback but significantly faster than journal mode.
415
416* journal mode
417data=journal mode provides full data and metadata journaling.  All new data is
418written to the journal first, and then to its final location.
419In the event of a crash, the journal can be replayed, bringing both data and
420metadata into a consistent state.  This mode is the slowest except when data
421needs to be read from and written to disk at the same time where it
422outperforms all others modes.  Currently ext4 does not have delayed
423allocation support if this data journalling mode is selected.
424
425/proc entries
426=============
427
428Information about mounted ext4 file systems can be found in
429/proc/fs/ext4.  Each mounted filesystem will have a directory in
430/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
431/proc/fs/ext4/dm-0).   The files in each per-device directory are shown
432in table below.
433
434Files in /proc/fs/ext4/<devname>
435..............................................................................
436 File            Content
437 mb_groups       details of multiblock allocator buddy cache of free blocks
438..............................................................................
439
440/sys entries
441============
442
443Information about mounted ext4 file systems can be found in
444/sys/fs/ext4.  Each mounted filesystem will have a directory in
445/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
446/sys/fs/ext4/dm-0).   The files in each per-device directory are shown
447in table below.
448
449Files in /sys/fs/ext4/<devname>
450(see also Documentation/ABI/testing/sysfs-fs-ext4)
451..............................................................................
452 File                         Content
453
454 delayed_allocation_blocks    This file is read-only and shows the number of
455                              blocks that are dirty in the page cache, but
456                              which do not have their location in the
457                              filesystem allocated yet.
458
459 inode_goal                   Tuning parameter which (if non-zero) controls
460                              the goal inode used by the inode allocator in
461                              preference to all other allocation heuristics.
462                              This is intended for debugging use only, and
463                              should be 0 on production systems.
464
465 inode_readahead_blks         Tuning parameter which controls the maximum
466                              number of inode table blocks that ext4's inode
467                              table readahead algorithm will pre-read into
468                              the buffer cache
469
470 lifetime_write_kbytes        This file is read-only and shows the number of
471                              kilobytes of data that have been written to this
472                              filesystem since it was created.
473
474 max_writeback_mb_bump        The maximum number of megabytes the writeback
475                              code will try to write out before move on to
476                              another inode.
477
478 mb_group_prealloc            The multiblock allocator will round up allocation
479                              requests to a multiple of this tuning parameter if
480                              the stripe size is not set in the ext4 superblock
481
482 mb_max_to_scan               The maximum number of extents the multiblock
483                              allocator will search to find the best extent
484
485 mb_min_to_scan               The minimum number of extents the multiblock
486                              allocator will search to find the best extent
487
488 mb_order2_req                Tuning parameter which controls the minimum size
489                              for requests (as a power of 2) where the buddy
490                              cache is used
491
492 mb_stats                     Controls whether the multiblock allocator should
493                              collect statistics, which are shown during the
494                              unmount. 1 means to collect statistics, 0 means
495                              not to collect statistics
496
497 mb_stream_req                Files which have fewer blocks than this tunable
498                              parameter will have their blocks allocated out
499                              of a block group specific preallocation pool, so
500                              that small files are packed closely together.
501                              Each large file will have its blocks allocated
502                              out of its own unique preallocation pool.
503
504 session_write_kbytes         This file is read-only and shows the number of
505                              kilobytes of data that have been written to this
506                              filesystem since it was mounted.
507..............................................................................
508
509Ioctls
510======
511
512There is some Ext4 specific functionality which can be accessed by applications
513through the system call interfaces. The list of all Ext4 specific ioctls are
514shown in the table below.
515
516Table of Ext4 specific ioctls
517..............................................................................
518 Ioctl			      Description
519 EXT4_IOC_GETFLAGS	      Get additional attributes associated with inode.
520			      The ioctl argument is an integer bitfield, with
521			      bit values described in ext4.h. This ioctl is an
522			      alias for FS_IOC_GETFLAGS.
523
524 EXT4_IOC_SETFLAGS	      Set additional attributes associated with inode.
525			      The ioctl argument is an integer bitfield, with
526			      bit values described in ext4.h. This ioctl is an
527			      alias for FS_IOC_SETFLAGS.
528
529 EXT4_IOC_GETVERSION
530 EXT4_IOC_GETVERSION_OLD
531			      Get the inode i_generation number stored for
532			      each inode. The i_generation number is normally
533			      changed only when new inode is created and it is
534			      particularly useful for network filesystems. The
535			      '_OLD' version of this ioctl is an alias for
536			      FS_IOC_GETVERSION.
537
538 EXT4_IOC_SETVERSION
539 EXT4_IOC_SETVERSION_OLD
540			      Set the inode i_generation number stored for
541			      each inode. The '_OLD' version of this ioctl
542			      is an alias for FS_IOC_SETVERSION.
543
544 EXT4_IOC_GROUP_EXTEND	      This ioctl has the same purpose as the resize
545			      mount option. It allows to resize filesystem
546			      to the end of the last existing block group,
547			      further resize has to be done with resize2fs,
548			      either online, or offline. The argument points
549			      to the unsigned logn number representing the
550			      filesystem new block count.
551
552 EXT4_IOC_MOVE_EXT	      Move the block extents from orig_fd (the one
553			      this ioctl is pointing to) to the donor_fd (the
554			      one specified in move_extent structure passed
555			      as an argument to this ioctl). Then, exchange
556			      inode metadata between orig_fd and donor_fd.
557			      This is especially useful for online
558			      defragmentation, because the allocator has the
559			      opportunity to allocate moved blocks better,
560			      ideally into one contiguous extent.
561
562 EXT4_IOC_GROUP_ADD	      Add a new group descriptor to an existing or
563			      new group descriptor block. The new group
564			      descriptor is described by ext4_new_group_input
565			      structure, which is passed as an argument to
566			      this ioctl. This is especially useful in
567			      conjunction with EXT4_IOC_GROUP_EXTEND,
568			      which allows online resize of the filesystem
569			      to the end of the last existing block group.
570			      Those two ioctls combined is used in userspace
571			      online resize tool (e.g. resize2fs).
572
573 EXT4_IOC_MIGRATE	      This ioctl operates on the filesystem itself.
574			      It converts (migrates) ext3 indirect block mapped
575			      inode to ext4 extent mapped inode by walking
576			      through indirect block mapping of the original
577			      inode and converting contiguous block ranges
578			      into ext4 extents of the temporary inode. Then,
579			      inodes are swapped. This ioctl might help, when
580			      migrating from ext3 to ext4 filesystem, however
581			      suggestion is to create fresh ext4 filesystem
582			      and copy data from the backup. Note, that
583			      filesystem has to support extents for this ioctl
584			      to work.
585
586 EXT4_IOC_ALLOC_DA_BLKS	      Force all of the delay allocated blocks to be
587			      allocated to preserve application-expected ext3
588			      behaviour. Note that this will also start
589			      triggering a write of the data blocks, but this
590			      behaviour may change in the future as it is
591			      not necessary and has been done this way only
592			      for sake of simplicity.
593..............................................................................
594
595References
596==========
597
598kernel source:	<file:fs/ext4/>
599		<file:fs/jbd2/>
600
601programs:	http://e2fsprogs.sourceforge.net/
602
603useful links:	http://fedoraproject.org/wiki/ext3-devel
604		http://www.bullopensource.org/ext4/
605		http://ext4.wiki.kernel.org/index.php/Main_Page
606		http://fedoraproject.org/wiki/Features/Ext4