Loading...
1.. SPDX-License-Identifier: GPL-2.0
2
3===============
4Shared Subtrees
5===============
6
7.. Contents:
8 1) Overview
9 2) Features
10 3) Setting mount states
11 4) Use-case
12 5) Detailed semantics
13 6) Quiz
14 7) FAQ
15 8) Implementation
16
17
181) Overview
19-----------
20
21Consider the following situation:
22
23A process wants to clone its own namespace, but still wants to access the CD
24that got mounted recently. Shared subtree semantics provide the necessary
25mechanism to accomplish the above.
26
27It provides the necessary building blocks for features like per-user-namespace
28and versioned filesystem.
29
302) Features
31-----------
32
33Shared subtree provides four different flavors of mounts; struct vfsmount to be
34precise
35
36 a. shared mount
37 b. slave mount
38 c. private mount
39 d. unbindable mount
40
41
422a) A shared mount can be replicated to as many mountpoints and all the
43replicas continue to be exactly same.
44
45 Here is an example:
46
47 Let's say /mnt has a mount that is shared::
48
49 mount --make-shared /mnt
50
51 Note: mount(8) command now supports the --make-shared flag,
52 so the sample 'smount' program is no longer needed and has been
53 removed.
54
55 ::
56
57 # mount --bind /mnt /tmp
58
59 The above command replicates the mount at /mnt to the mountpoint /tmp
60 and the contents of both the mounts remain identical.
61
62 ::
63
64 #ls /mnt
65 a b c
66
67 #ls /tmp
68 a b c
69
70 Now let's say we mount a device at /tmp/a::
71
72 # mount /dev/sd0 /tmp/a
73
74 #ls /tmp/a
75 t1 t2 t3
76
77 #ls /mnt/a
78 t1 t2 t3
79
80 Note that the mount has propagated to the mount at /mnt as well.
81
82 And the same is true even when /dev/sd0 is mounted on /mnt/a. The
83 contents will be visible under /tmp/a too.
84
85
862b) A slave mount is like a shared mount except that mount and umount events
87 only propagate towards it.
88
89 All slave mounts have a master mount which is a shared.
90
91 Here is an example:
92
93 Let's say /mnt has a mount which is shared.
94 # mount --make-shared /mnt
95
96 Let's bind mount /mnt to /tmp
97 # mount --bind /mnt /tmp
98
99 the new mount at /tmp becomes a shared mount and it is a replica of
100 the mount at /mnt.
101
102 Now let's make the mount at /tmp; a slave of /mnt
103 # mount --make-slave /tmp
104
105 let's mount /dev/sd0 on /mnt/a
106 # mount /dev/sd0 /mnt/a
107
108 #ls /mnt/a
109 t1 t2 t3
110
111 #ls /tmp/a
112 t1 t2 t3
113
114 Note the mount event has propagated to the mount at /tmp
115
116 However let's see what happens if we mount something on the mount at /tmp
117
118 # mount /dev/sd1 /tmp/b
119
120 #ls /tmp/b
121 s1 s2 s3
122
123 #ls /mnt/b
124
125 Note how the mount event has not propagated to the mount at
126 /mnt
127
128
1292c) A private mount does not forward or receive propagation.
130
131 This is the mount we are familiar with. Its the default type.
132
133
1342d) A unbindable mount is a unbindable private mount
135
136 let's say we have a mount at /mnt and we make it unbindable::
137
138 # mount --make-unbindable /mnt
139
140 Let's try to bind mount this mount somewhere else::
141
142 # mount --bind /mnt /tmp
143 mount: wrong fs type, bad option, bad superblock on /mnt,
144 or too many mounted file systems
145
146 Binding a unbindable mount is a invalid operation.
147
148
1493) Setting mount states
150-----------------------
151
152 The mount command (util-linux package) can be used to set mount
153 states::
154
155 mount --make-shared mountpoint
156 mount --make-slave mountpoint
157 mount --make-private mountpoint
158 mount --make-unbindable mountpoint
159
160
1614) Use cases
162------------
163
164 A) A process wants to clone its own namespace, but still wants to
165 access the CD that got mounted recently.
166
167 Solution:
168
169 The system administrator can make the mount at /cdrom shared::
170
171 mount --bind /cdrom /cdrom
172 mount --make-shared /cdrom
173
174 Now any process that clones off a new namespace will have a
175 mount at /cdrom which is a replica of the same mount in the
176 parent namespace.
177
178 So when a CD is inserted and mounted at /cdrom that mount gets
179 propagated to the other mount at /cdrom in all the other clone
180 namespaces.
181
182 B) A process wants its mounts invisible to any other process, but
183 still be able to see the other system mounts.
184
185 Solution:
186
187 To begin with, the administrator can mark the entire mount tree
188 as shareable::
189
190 mount --make-rshared /
191
192 A new process can clone off a new namespace. And mark some part
193 of its namespace as slave::
194
195 mount --make-rslave /myprivatetree
196
197 Hence forth any mounts within the /myprivatetree done by the
198 process will not show up in any other namespace. However mounts
199 done in the parent namespace under /myprivatetree still shows
200 up in the process's namespace.
201
202
203 Apart from the above semantics this feature provides the
204 building blocks to solve the following problems:
205
206 C) Per-user namespace
207
208 The above semantics allows a way to share mounts across
209 namespaces. But namespaces are associated with processes. If
210 namespaces are made first class objects with user API to
211 associate/disassociate a namespace with userid, then each user
212 could have his/her own namespace and tailor it to his/her
213 requirements. This needs to be supported in PAM.
214
215 D) Versioned files
216
217 If the entire mount tree is visible at multiple locations, then
218 an underlying versioning file system can return different
219 versions of the file depending on the path used to access that
220 file.
221
222 An example is::
223
224 mount --make-shared /
225 mount --rbind / /view/v1
226 mount --rbind / /view/v2
227 mount --rbind / /view/v3
228 mount --rbind / /view/v4
229
230 and if /usr has a versioning filesystem mounted, then that
231 mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
232 /view/v4/usr too
233
234 A user can request v3 version of the file /usr/fs/namespace.c
235 by accessing /view/v3/usr/fs/namespace.c . The underlying
236 versioning filesystem can then decipher that v3 version of the
237 filesystem is being requested and return the corresponding
238 inode.
239
2405) Detailed semantics
241---------------------
242 The section below explains the detailed semantics of
243 bind, rbind, move, mount, umount and clone-namespace operations.
244
245 Note: the word 'vfsmount' and the noun 'mount' have been used
246 to mean the same thing, throughout this document.
247
2485a) Mount states
249
250 A given mount can be in one of the following states
251
252 1) shared
253 2) slave
254 3) shared and slave
255 4) private
256 5) unbindable
257
258 A 'propagation event' is defined as event generated on a vfsmount
259 that leads to mount or unmount actions in other vfsmounts.
260
261 A 'peer group' is defined as a group of vfsmounts that propagate
262 events to each other.
263
264 (1) Shared mounts
265
266 A 'shared mount' is defined as a vfsmount that belongs to a
267 'peer group'.
268
269 For example::
270
271 mount --make-shared /mnt
272 mount --bind /mnt /tmp
273
274 The mount at /mnt and that at /tmp are both shared and belong
275 to the same peer group. Anything mounted or unmounted under
276 /mnt or /tmp reflect in all the other mounts of its peer
277 group.
278
279
280 (2) Slave mounts
281
282 A 'slave mount' is defined as a vfsmount that receives
283 propagation events and does not forward propagation events.
284
285 A slave mount as the name implies has a master mount from which
286 mount/unmount events are received. Events do not propagate from
287 the slave mount to the master. Only a shared mount can be made
288 a slave by executing the following command::
289
290 mount --make-slave mount
291
292 A shared mount that is made as a slave is no more shared unless
293 modified to become shared.
294
295 (3) Shared and Slave
296
297 A vfsmount can be both shared as well as slave. This state
298 indicates that the mount is a slave of some vfsmount, and
299 has its own peer group too. This vfsmount receives propagation
300 events from its master vfsmount, and also forwards propagation
301 events to its 'peer group' and to its slave vfsmounts.
302
303 Strictly speaking, the vfsmount is shared having its own
304 peer group, and this peer-group is a slave of some other
305 peer group.
306
307 Only a slave vfsmount can be made as 'shared and slave' by
308 either executing the following command::
309
310 mount --make-shared mount
311
312 or by moving the slave vfsmount under a shared vfsmount.
313
314 (4) Private mount
315
316 A 'private mount' is defined as vfsmount that does not
317 receive or forward any propagation events.
318
319 (5) Unbindable mount
320
321 A 'unbindable mount' is defined as vfsmount that does not
322 receive or forward any propagation events and cannot
323 be bind mounted.
324
325
326 State diagram:
327
328 The state diagram below explains the state transition of a mount,
329 in response to various commands::
330
331 -----------------------------------------------------------------------
332 | |make-shared | make-slave | make-private |make-unbindab|
333 --------------|------------|--------------|--------------|-------------|
334 |shared |shared |*slave/private| private | unbindable |
335 | | | | | |
336 |-------------|------------|--------------|--------------|-------------|
337 |slave |shared | **slave | private | unbindable |
338 | |and slave | | | |
339 |-------------|------------|--------------|--------------|-------------|
340 |shared |shared | slave | private | unbindable |
341 |and slave |and slave | | | |
342 |-------------|------------|--------------|--------------|-------------|
343 |private |shared | **private | private | unbindable |
344 |-------------|------------|--------------|--------------|-------------|
345 |unbindable |shared |**unbindable | private | unbindable |
346 ------------------------------------------------------------------------
347
348 * if the shared mount is the only mount in its peer group, making it
349 slave, makes it private automatically. Note that there is no master to
350 which it can be slaved to.
351
352 ** slaving a non-shared mount has no effect on the mount.
353
354 Apart from the commands listed below, the 'move' operation also changes
355 the state of a mount depending on type of the destination mount. Its
356 explained in section 5d.
357
3585b) Bind semantics
359
360 Consider the following command::
361
362 mount --bind A/a B/b
363
364 where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
365 is the destination mount and 'b' is the dentry in the destination mount.
366
367 The outcome depends on the type of mount of 'A' and 'B'. The table
368 below contains quick reference::
369
370 --------------------------------------------------------------------------
371 | BIND MOUNT OPERATION |
372 |************************************************************************|
373 |source(A)->| shared | private | slave | unbindable |
374 | dest(B) | | | | |
375 | | | | | | |
376 | v | | | | |
377 |************************************************************************|
378 | shared | shared | shared | shared & slave | invalid |
379 | | | | | |
380 |non-shared| shared | private | slave | invalid |
381 **************************************************************************
382
383 Details:
384
385 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
386 which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
387 mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
388 are created and mounted at the dentry 'b' on all mounts where 'B'
389 propagates to. A new propagation tree containing 'C1',..,'Cn' is
390 created. This propagation tree is identical to the propagation tree of
391 'B'. And finally the peer-group of 'C' is merged with the peer group
392 of 'A'.
393
394 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
395 which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
396 mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
397 are created and mounted at the dentry 'b' on all mounts where 'B'
398 propagates to. A new propagation tree is set containing all new mounts
399 'C', 'C1', .., 'Cn' with exactly the same configuration as the
400 propagation tree for 'B'.
401
402 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
403 mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
404 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
405 'C3' ... are created and mounted at the dentry 'b' on all mounts where
406 'B' propagates to. A new propagation tree containing the new mounts
407 'C','C1',.. 'Cn' is created. This propagation tree is identical to the
408 propagation tree for 'B'. And finally the mount 'C' and its peer group
409 is made the slave of mount 'Z'. In other words, mount 'C' is in the
410 state 'slave and shared'.
411
412 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
413 invalid operation.
414
415 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
416 unbindable) mount. A new mount 'C' which is clone of 'A', is created.
417 Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
418
419 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
420 which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
421 mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
422 peer-group of 'A'.
423
424 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
425 new mount 'C' which is a clone of 'A' is created. Its root dentry is
426 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
427 slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
428 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
429 mount/unmount on 'A' do not propagate anywhere else. Similarly
430 mount/unmount on 'C' do not propagate anywhere else.
431
432 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
433 invalid operation. A unbindable mount cannot be bind mounted.
434
4355c) Rbind semantics
436
437 rbind is same as bind. Bind replicates the specified mount. Rbind
438 replicates all the mounts in the tree belonging to the specified mount.
439 Rbind mount is bind mount applied to all the mounts in the tree.
440
441 If the source tree that is rbind has some unbindable mounts,
442 then the subtree under the unbindable mount is pruned in the new
443 location.
444
445 eg:
446
447 let's say we have the following mount tree::
448
449 A
450 / \
451 B C
452 / \ / \
453 D E F G
454
455 Let's say all the mount except the mount C in the tree are
456 of a type other than unbindable.
457
458 If this tree is rbound to say Z
459
460 We will have the following tree at the new location::
461
462 Z
463 |
464 A'
465 /
466 B' Note how the tree under C is pruned
467 / \ in the new location.
468 D' E'
469
470
471
4725d) Move semantics
473
474 Consider the following command
475
476 mount --move A B/b
477
478 where 'A' is the source mount, 'B' is the destination mount and 'b' is
479 the dentry in the destination mount.
480
481 The outcome depends on the type of the mount of 'A' and 'B'. The table
482 below is a quick reference::
483
484 ---------------------------------------------------------------------------
485 | MOVE MOUNT OPERATION |
486 |**************************************************************************
487 | source(A)->| shared | private | slave | unbindable |
488 | dest(B) | | | | |
489 | | | | | | |
490 | v | | | | |
491 |**************************************************************************
492 | shared | shared | shared |shared and slave| invalid |
493 | | | | | |
494 |non-shared| shared | private | slave | unbindable |
495 ***************************************************************************
496
497 .. Note:: moving a mount residing under a shared mount is invalid.
498
499 Details follow:
500
501 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
502 mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
503 are created and mounted at dentry 'b' on all mounts that receive
504 propagation from mount 'B'. A new propagation tree is created in the
505 exact same configuration as that of 'B'. This new propagation tree
506 contains all the new mounts 'A1', 'A2'... 'An'. And this new
507 propagation tree is appended to the already existing propagation tree
508 of 'A'.
509
510 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
511 mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
512 are created and mounted at dentry 'b' on all mounts that receive
513 propagation from mount 'B'. The mount 'A' becomes a shared mount and a
514 propagation tree is created which is identical to that of
515 'B'. This new propagation tree contains all the new mounts 'A1',
516 'A2'... 'An'.
517
518 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
519 mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
520 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
521 receive propagation from mount 'B'. A new propagation tree is created
522 in the exact same configuration as that of 'B'. This new propagation
523 tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
524 propagation tree is appended to the already existing propagation tree of
525 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
526 becomes 'shared'.
527
528 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
529 is invalid. Because mounting anything on the shared mount 'B' can
530 create new mounts that get mounted on the mounts that receive
531 propagation from 'B'. And since the mount 'A' is unbindable, cloning
532 it to mount at other mountpoints is not possible.
533
534 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
535 unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
536
537 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
538 is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
539 shared mount.
540
541 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
542 The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
543 continues to be a slave mount of mount 'Z'.
544
545 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
546 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
547 unbindable mount.
548
5495e) Mount semantics
550
551 Consider the following command::
552
553 mount device B/b
554
555 'B' is the destination mount and 'b' is the dentry in the destination
556 mount.
557
558 The above operation is the same as bind operation with the exception
559 that the source mount is always a private mount.
560
561
5625f) Unmount semantics
563
564 Consider the following command::
565
566 umount A
567
568 where 'A' is a mount mounted on mount 'B' at dentry 'b'.
569
570 If mount 'B' is shared, then all most-recently-mounted mounts at dentry
571 'b' on mounts that receive propagation from mount 'B' and does not have
572 sub-mounts within them are unmounted.
573
574 Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
575 each other.
576
577 let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
578 'B1', 'B2' and 'B3' respectively.
579
580 let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
581 mount 'B1', 'B2' and 'B3' respectively.
582
583 if 'C1' is unmounted, all the mounts that are most-recently-mounted on
584 'B1' and on the mounts that 'B1' propagates-to are unmounted.
585
586 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
587 on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
588
589 So all 'C1', 'C2' and 'C3' should be unmounted.
590
591 If any of 'C2' or 'C3' has some child mounts, then that mount is not
592 unmounted, but all other mounts are unmounted. However if 'C1' is told
593 to be unmounted and 'C1' has some sub-mounts, the umount operation is
594 failed entirely.
595
5965g) Clone Namespace
597
598 A cloned namespace contains all the mounts as that of the parent
599 namespace.
600
601 Let's say 'A' and 'B' are the corresponding mounts in the parent and the
602 child namespace.
603
604 If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
605 each other.
606
607 If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
608 'Z'.
609
610 If 'A' is a private mount, then 'B' is a private mount too.
611
612 If 'A' is unbindable mount, then 'B' is a unbindable mount too.
613
614
6156) Quiz
616-------
617
618 A. What is the result of the following command sequence?
619
620 ::
621
622 mount --bind /mnt /mnt
623 mount --make-shared /mnt
624 mount --bind /mnt /tmp
625 mount --move /tmp /mnt/1
626
627 what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
628 Should they all be identical? or should /mnt and /mnt/1 be
629 identical only?
630
631
632 B. What is the result of the following command sequence?
633
634 ::
635
636 mount --make-rshared /
637 mkdir -p /v/1
638 mount --rbind / /v/1
639
640 what should be the content of /v/1/v/1 be?
641
642
643 C. What is the result of the following command sequence?
644
645 ::
646
647 mount --bind /mnt /mnt
648 mount --make-shared /mnt
649 mkdir -p /mnt/1/2/3 /mnt/1/test
650 mount --bind /mnt/1 /tmp
651 mount --make-slave /mnt
652 mount --make-shared /mnt
653 mount --bind /mnt/1/2 /tmp1
654 mount --make-slave /mnt
655
656 At this point we have the first mount at /tmp and
657 its root dentry is 1. Let's call this mount 'A'
658 And then we have a second mount at /tmp1 with root
659 dentry 2. Let's call this mount 'B'
660 Next we have a third mount at /mnt with root dentry
661 mnt. Let's call this mount 'C'
662
663 'B' is the slave of 'A' and 'C' is a slave of 'B'
664 A -> B -> C
665
666 at this point if we execute the following command
667
668 mount --bind /bin /tmp/test
669
670 The mount is attempted on 'A'
671
672 will the mount propagate to 'B' and 'C' ?
673
674 what would be the contents of
675 /mnt/1/test be?
676
6777) FAQ
678------
679
680 Q1. Why is bind mount needed? How is it different from symbolic links?
681 symbolic links can get stale if the destination mount gets
682 unmounted or moved. Bind mounts continue to exist even if the
683 other mount is unmounted or moved.
684
685 Q2. Why can't the shared subtree be implemented using exportfs?
686
687 exportfs is a heavyweight way of accomplishing part of what
688 shared subtree can do. I cannot imagine a way to implement the
689 semantics of slave mount using exportfs?
690
691 Q3 Why is unbindable mount needed?
692
693 Let's say we want to replicate the mount tree at multiple
694 locations within the same subtree.
695
696 if one rbind mounts a tree within the same subtree 'n' times
697 the number of mounts created is an exponential function of 'n'.
698 Having unbindable mount can help prune the unneeded bind
699 mounts. Here is an example.
700
701 step 1:
702 let's say the root tree has just two directories with
703 one vfsmount::
704
705 root
706 / \
707 tmp usr
708
709 And we want to replicate the tree at multiple
710 mountpoints under /root/tmp
711
712 step 2:
713 ::
714
715
716 mount --make-shared /root
717
718 mkdir -p /tmp/m1
719
720 mount --rbind /root /tmp/m1
721
722 the new tree now looks like this::
723
724 root
725 / \
726 tmp usr
727 /
728 m1
729 / \
730 tmp usr
731 /
732 m1
733
734 it has two vfsmounts
735
736 step 3:
737 ::
738
739 mkdir -p /tmp/m2
740 mount --rbind /root /tmp/m2
741
742 the new tree now looks like this::
743
744 root
745 / \
746 tmp usr
747 / \
748 m1 m2
749 / \ / \
750 tmp usr tmp usr
751 / \ /
752 m1 m2 m1
753 / \ / \
754 tmp usr tmp usr
755 / / \
756 m1 m1 m2
757 / \
758 tmp usr
759 / \
760 m1 m2
761
762 it has 6 vfsmounts
763
764 step 4:
765 ::
766 mkdir -p /tmp/m3
767 mount --rbind /root /tmp/m3
768
769 I won't draw the tree..but it has 24 vfsmounts
770
771
772 at step i the number of vfsmounts is V[i] = i*V[i-1].
773 This is an exponential function. And this tree has way more
774 mounts than what we really needed in the first place.
775
776 One could use a series of umount at each step to prune
777 out the unneeded mounts. But there is a better solution.
778 Unclonable mounts come in handy here.
779
780 step 1:
781 let's say the root tree has just two directories with
782 one vfsmount::
783
784 root
785 / \
786 tmp usr
787
788 How do we set up the same tree at multiple locations under
789 /root/tmp
790
791 step 2:
792 ::
793
794
795 mount --bind /root/tmp /root/tmp
796
797 mount --make-rshared /root
798 mount --make-unbindable /root/tmp
799
800 mkdir -p /tmp/m1
801
802 mount --rbind /root /tmp/m1
803
804 the new tree now looks like this::
805
806 root
807 / \
808 tmp usr
809 /
810 m1
811 / \
812 tmp usr
813
814 step 3:
815 ::
816
817 mkdir -p /tmp/m2
818 mount --rbind /root /tmp/m2
819
820 the new tree now looks like this::
821
822 root
823 / \
824 tmp usr
825 / \
826 m1 m2
827 / \ / \
828 tmp usr tmp usr
829
830 step 4:
831 ::
832
833 mkdir -p /tmp/m3
834 mount --rbind /root /tmp/m3
835
836 the new tree now looks like this::
837
838 root
839 / \
840 tmp usr
841 / \ \
842 m1 m2 m3
843 / \ / \ / \
844 tmp usr tmp usr tmp usr
845
8468) Implementation
847-----------------
848
8498A) Datastructure
850
851 4 new fields are introduced to struct vfsmount:
852
853 * ->mnt_share
854 * ->mnt_slave_list
855 * ->mnt_slave
856 * ->mnt_master
857
858 ->mnt_share
859 links together all the mount to/from which this vfsmount
860 send/receives propagation events.
861
862 ->mnt_slave_list
863 links all the mounts to which this vfsmount propagates
864 to.
865
866 ->mnt_slave
867 links together all the slaves that its master vfsmount
868 propagates to.
869
870 ->mnt_master
871 points to the master vfsmount from which this vfsmount
872 receives propagation.
873
874 ->mnt_flags
875 takes two more flags to indicate the propagation status of
876 the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
877 vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
878 replicated.
879
880 All the shared vfsmounts in a peer group form a cyclic list through
881 ->mnt_share.
882
883 All vfsmounts with the same ->mnt_master form on a cyclic list anchored
884 in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
885
886 ->mnt_master can point to arbitrary (and possibly different) members
887 of master peer group. To find all immediate slaves of a peer group
888 you need to go through _all_ ->mnt_slave_list of its members.
889 Conceptually it's just a single set - distribution among the
890 individual lists does not affect propagation or the way propagation
891 tree is modified by operations.
892
893 All vfsmounts in a peer group have the same ->mnt_master. If it is
894 non-NULL, they form a contiguous (ordered) segment of slave list.
895
896 A example propagation tree looks as shown in the figure below.
897 [ NOTE: Though it looks like a forest, if we consider all the shared
898 mounts as a conceptual entity called 'pnode', it becomes a tree]::
899
900
901 A <--> B <--> C <---> D
902 /|\ /| |\
903 / F G J K H I
904 /
905 E<-->K
906 /|\
907 M L N
908
909 In the above figure A,B,C and D all are shared and propagate to each
910 other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
911 mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
912 'E' is also shared with 'K' and they propagate to each other. And
913 'K' has 3 slaves 'M', 'L' and 'N'
914
915 A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
916
917 A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
918
919 E's ->mnt_share links with ->mnt_share of K
920
921 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
922
923 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
924
925 K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
926
927 C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
928
929 J and K's ->mnt_master points to struct vfsmount of C
930
931 and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
932
933 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
934
935
936 NOTE: The propagation tree is orthogonal to the mount tree.
937
9388B Locking:
939
940 ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
941 by namespace_sem (exclusive for modifications, shared for reading).
942
943 Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
944 There are two exceptions: do_add_mount() and clone_mnt().
945 The former modifies a vfsmount that has not been visible in any shared
946 data structures yet.
947 The latter holds namespace_sem and the only references to vfsmount
948 are in lists that can't be traversed without namespace_sem.
949
9508C Algorithm:
951
952 The crux of the implementation resides in rbind/move operation.
953
954 The overall algorithm breaks the operation into 3 phases: (look at
955 attach_recursive_mnt() and propagate_mnt())
956
957 1. prepare phase.
958 2. commit phases.
959 3. abort phases.
960
961 Prepare phase:
962
963 for each mount in the source tree:
964
965 a) Create the necessary number of mount trees to
966 be attached to each of the mounts that receive
967 propagation from the destination mount.
968 b) Do not attach any of the trees to its destination.
969 However note down its ->mnt_parent and ->mnt_mountpoint
970 c) Link all the new mounts to form a propagation tree that
971 is identical to the propagation tree of the destination
972 mount.
973
974 If this phase is successful, there should be 'n' new
975 propagation trees; where 'n' is the number of mounts in the
976 source tree. Go to the commit phase
977
978 Also there should be 'm' new mount trees, where 'm' is
979 the number of mounts to which the destination mount
980 propagates to.
981
982 if any memory allocations fail, go to the abort phase.
983
984 Commit phase
985 attach each of the mount trees to their corresponding
986 destination mounts.
987
988 Abort phase
989 delete all the newly created trees.
990
991 .. Note::
992 all the propagation related functionality resides in the file pnode.c
993
994
995------------------------------------------------------------------------
996
997version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
998
999version 0.2 (Incorporated comments from Al Viro)
1.. SPDX-License-Identifier: GPL-2.0
2
3===============
4Shared Subtrees
5===============
6
7.. Contents:
8 1) Overview
9 2) Features
10 3) Setting mount states
11 4) Use-case
12 5) Detailed semantics
13 6) Quiz
14 7) FAQ
15 8) Implementation
16
17
181) Overview
19-----------
20
21Consider the following situation:
22
23A process wants to clone its own namespace, but still wants to access the CD
24that got mounted recently. Shared subtree semantics provide the necessary
25mechanism to accomplish the above.
26
27It provides the necessary building blocks for features like per-user-namespace
28and versioned filesystem.
29
302) Features
31-----------
32
33Shared subtree provides four different flavors of mounts; struct vfsmount to be
34precise
35
36 a. shared mount
37 b. slave mount
38 c. private mount
39 d. unbindable mount
40
41
422a) A shared mount can be replicated to as many mountpoints and all the
43replicas continue to be exactly same.
44
45 Here is an example:
46
47 Let's say /mnt has a mount that is shared::
48
49 mount --make-shared /mnt
50
51 Note: mount(8) command now supports the --make-shared flag,
52 so the sample 'smount' program is no longer needed and has been
53 removed.
54
55 ::
56
57 # mount --bind /mnt /tmp
58
59 The above command replicates the mount at /mnt to the mountpoint /tmp
60 and the contents of both the mounts remain identical.
61
62 ::
63
64 #ls /mnt
65 a b c
66
67 #ls /tmp
68 a b c
69
70 Now let's say we mount a device at /tmp/a::
71
72 # mount /dev/sd0 /tmp/a
73
74 #ls /tmp/a
75 t1 t2 t3
76
77 #ls /mnt/a
78 t1 t2 t3
79
80 Note that the mount has propagated to the mount at /mnt as well.
81
82 And the same is true even when /dev/sd0 is mounted on /mnt/a. The
83 contents will be visible under /tmp/a too.
84
85
862b) A slave mount is like a shared mount except that mount and umount events
87 only propagate towards it.
88
89 All slave mounts have a master mount which is a shared.
90
91 Here is an example:
92
93 Let's say /mnt has a mount which is shared.
94 # mount --make-shared /mnt
95
96 Let's bind mount /mnt to /tmp
97 # mount --bind /mnt /tmp
98
99 the new mount at /tmp becomes a shared mount and it is a replica of
100 the mount at /mnt.
101
102 Now let's make the mount at /tmp; a slave of /mnt
103 # mount --make-slave /tmp
104
105 let's mount /dev/sd0 on /mnt/a
106 # mount /dev/sd0 /mnt/a
107
108 #ls /mnt/a
109 t1 t2 t3
110
111 #ls /tmp/a
112 t1 t2 t3
113
114 Note the mount event has propagated to the mount at /tmp
115
116 However let's see what happens if we mount something on the mount at /tmp
117
118 # mount /dev/sd1 /tmp/b
119
120 #ls /tmp/b
121 s1 s2 s3
122
123 #ls /mnt/b
124
125 Note how the mount event has not propagated to the mount at
126 /mnt
127
128
1292c) A private mount does not forward or receive propagation.
130
131 This is the mount we are familiar with. Its the default type.
132
133
1342d) A unbindable mount is a unbindable private mount
135
136 let's say we have a mount at /mnt and we make it unbindable::
137
138 # mount --make-unbindable /mnt
139
140 Let's try to bind mount this mount somewhere else::
141
142 # mount --bind /mnt /tmp
143 mount: wrong fs type, bad option, bad superblock on /mnt,
144 or too many mounted file systems
145
146 Binding a unbindable mount is a invalid operation.
147
148
1493) Setting mount states
150
151 The mount command (util-linux package) can be used to set mount
152 states::
153
154 mount --make-shared mountpoint
155 mount --make-slave mountpoint
156 mount --make-private mountpoint
157 mount --make-unbindable mountpoint
158
159
1604) Use cases
161------------
162
163 A) A process wants to clone its own namespace, but still wants to
164 access the CD that got mounted recently.
165
166 Solution:
167
168 The system administrator can make the mount at /cdrom shared::
169
170 mount --bind /cdrom /cdrom
171 mount --make-shared /cdrom
172
173 Now any process that clones off a new namespace will have a
174 mount at /cdrom which is a replica of the same mount in the
175 parent namespace.
176
177 So when a CD is inserted and mounted at /cdrom that mount gets
178 propagated to the other mount at /cdrom in all the other clone
179 namespaces.
180
181 B) A process wants its mounts invisible to any other process, but
182 still be able to see the other system mounts.
183
184 Solution:
185
186 To begin with, the administrator can mark the entire mount tree
187 as shareable::
188
189 mount --make-rshared /
190
191 A new process can clone off a new namespace. And mark some part
192 of its namespace as slave::
193
194 mount --make-rslave /myprivatetree
195
196 Hence forth any mounts within the /myprivatetree done by the
197 process will not show up in any other namespace. However mounts
198 done in the parent namespace under /myprivatetree still shows
199 up in the process's namespace.
200
201
202 Apart from the above semantics this feature provides the
203 building blocks to solve the following problems:
204
205 C) Per-user namespace
206
207 The above semantics allows a way to share mounts across
208 namespaces. But namespaces are associated with processes. If
209 namespaces are made first class objects with user API to
210 associate/disassociate a namespace with userid, then each user
211 could have his/her own namespace and tailor it to his/her
212 requirements. This needs to be supported in PAM.
213
214 D) Versioned files
215
216 If the entire mount tree is visible at multiple locations, then
217 an underlying versioning file system can return different
218 versions of the file depending on the path used to access that
219 file.
220
221 An example is::
222
223 mount --make-shared /
224 mount --rbind / /view/v1
225 mount --rbind / /view/v2
226 mount --rbind / /view/v3
227 mount --rbind / /view/v4
228
229 and if /usr has a versioning filesystem mounted, then that
230 mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and
231 /view/v4/usr too
232
233 A user can request v3 version of the file /usr/fs/namespace.c
234 by accessing /view/v3/usr/fs/namespace.c . The underlying
235 versioning filesystem can then decipher that v3 version of the
236 filesystem is being requested and return the corresponding
237 inode.
238
2395) Detailed semantics
240---------------------
241 The section below explains the detailed semantics of
242 bind, rbind, move, mount, umount and clone-namespace operations.
243
244 Note: the word 'vfsmount' and the noun 'mount' have been used
245 to mean the same thing, throughout this document.
246
2475a) Mount states
248
249 A given mount can be in one of the following states
250
251 1) shared
252 2) slave
253 3) shared and slave
254 4) private
255 5) unbindable
256
257 A 'propagation event' is defined as event generated on a vfsmount
258 that leads to mount or unmount actions in other vfsmounts.
259
260 A 'peer group' is defined as a group of vfsmounts that propagate
261 events to each other.
262
263 (1) Shared mounts
264
265 A 'shared mount' is defined as a vfsmount that belongs to a
266 'peer group'.
267
268 For example::
269
270 mount --make-shared /mnt
271 mount --bind /mnt /tmp
272
273 The mount at /mnt and that at /tmp are both shared and belong
274 to the same peer group. Anything mounted or unmounted under
275 /mnt or /tmp reflect in all the other mounts of its peer
276 group.
277
278
279 (2) Slave mounts
280
281 A 'slave mount' is defined as a vfsmount that receives
282 propagation events and does not forward propagation events.
283
284 A slave mount as the name implies has a master mount from which
285 mount/unmount events are received. Events do not propagate from
286 the slave mount to the master. Only a shared mount can be made
287 a slave by executing the following command::
288
289 mount --make-slave mount
290
291 A shared mount that is made as a slave is no more shared unless
292 modified to become shared.
293
294 (3) Shared and Slave
295
296 A vfsmount can be both shared as well as slave. This state
297 indicates that the mount is a slave of some vfsmount, and
298 has its own peer group too. This vfsmount receives propagation
299 events from its master vfsmount, and also forwards propagation
300 events to its 'peer group' and to its slave vfsmounts.
301
302 Strictly speaking, the vfsmount is shared having its own
303 peer group, and this peer-group is a slave of some other
304 peer group.
305
306 Only a slave vfsmount can be made as 'shared and slave' by
307 either executing the following command::
308
309 mount --make-shared mount
310
311 or by moving the slave vfsmount under a shared vfsmount.
312
313 (4) Private mount
314
315 A 'private mount' is defined as vfsmount that does not
316 receive or forward any propagation events.
317
318 (5) Unbindable mount
319
320 A 'unbindable mount' is defined as vfsmount that does not
321 receive or forward any propagation events and cannot
322 be bind mounted.
323
324
325 State diagram:
326
327 The state diagram below explains the state transition of a mount,
328 in response to various commands::
329
330 -----------------------------------------------------------------------
331 | |make-shared | make-slave | make-private |make-unbindab|
332 --------------|------------|--------------|--------------|-------------|
333 |shared |shared |*slave/private| private | unbindable |
334 | | | | | |
335 |-------------|------------|--------------|--------------|-------------|
336 |slave |shared | **slave | private | unbindable |
337 | |and slave | | | |
338 |-------------|------------|--------------|--------------|-------------|
339 |shared |shared | slave | private | unbindable |
340 |and slave |and slave | | | |
341 |-------------|------------|--------------|--------------|-------------|
342 |private |shared | **private | private | unbindable |
343 |-------------|------------|--------------|--------------|-------------|
344 |unbindable |shared |**unbindable | private | unbindable |
345 ------------------------------------------------------------------------
346
347 * if the shared mount is the only mount in its peer group, making it
348 slave, makes it private automatically. Note that there is no master to
349 which it can be slaved to.
350
351 ** slaving a non-shared mount has no effect on the mount.
352
353 Apart from the commands listed below, the 'move' operation also changes
354 the state of a mount depending on type of the destination mount. Its
355 explained in section 5d.
356
3575b) Bind semantics
358
359 Consider the following command::
360
361 mount --bind A/a B/b
362
363 where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
364 is the destination mount and 'b' is the dentry in the destination mount.
365
366 The outcome depends on the type of mount of 'A' and 'B'. The table
367 below contains quick reference::
368
369 --------------------------------------------------------------------------
370 | BIND MOUNT OPERATION |
371 |************************************************************************|
372 |source(A)->| shared | private | slave | unbindable |
373 | dest(B) | | | | |
374 | | | | | | |
375 | v | | | | |
376 |************************************************************************|
377 | shared | shared | shared | shared & slave | invalid |
378 | | | | | |
379 |non-shared| shared | private | slave | invalid |
380 **************************************************************************
381
382 Details:
383
384 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
385 which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
386 mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
387 are created and mounted at the dentry 'b' on all mounts where 'B'
388 propagates to. A new propagation tree containing 'C1',..,'Cn' is
389 created. This propagation tree is identical to the propagation tree of
390 'B'. And finally the peer-group of 'C' is merged with the peer group
391 of 'A'.
392
393 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
394 which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
395 mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
396 are created and mounted at the dentry 'b' on all mounts where 'B'
397 propagates to. A new propagation tree is set containing all new mounts
398 'C', 'C1', .., 'Cn' with exactly the same configuration as the
399 propagation tree for 'B'.
400
401 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
402 mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
403 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
404 'C3' ... are created and mounted at the dentry 'b' on all mounts where
405 'B' propagates to. A new propagation tree containing the new mounts
406 'C','C1',.. 'Cn' is created. This propagation tree is identical to the
407 propagation tree for 'B'. And finally the mount 'C' and its peer group
408 is made the slave of mount 'Z'. In other words, mount 'C' is in the
409 state 'slave and shared'.
410
411 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a
412 invalid operation.
413
414 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
415 unbindable) mount. A new mount 'C' which is clone of 'A', is created.
416 Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'.
417
418 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C'
419 which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
420 mounted on mount 'B' at dentry 'b'. 'C' is made a member of the
421 peer-group of 'A'.
422
423 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A
424 new mount 'C' which is a clone of 'A' is created. Its root dentry is
425 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a
426 slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of
427 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
428 mount/unmount on 'A' do not propagate anywhere else. Similarly
429 mount/unmount on 'C' do not propagate anywhere else.
430
431 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a
432 invalid operation. A unbindable mount cannot be bind mounted.
433
4345c) Rbind semantics
435
436 rbind is same as bind. Bind replicates the specified mount. Rbind
437 replicates all the mounts in the tree belonging to the specified mount.
438 Rbind mount is bind mount applied to all the mounts in the tree.
439
440 If the source tree that is rbind has some unbindable mounts,
441 then the subtree under the unbindable mount is pruned in the new
442 location.
443
444 eg:
445
446 let's say we have the following mount tree::
447
448 A
449 / \
450 B C
451 / \ / \
452 D E F G
453
454 Let's say all the mount except the mount C in the tree are
455 of a type other than unbindable.
456
457 If this tree is rbound to say Z
458
459 We will have the following tree at the new location::
460
461 Z
462 |
463 A'
464 /
465 B' Note how the tree under C is pruned
466 / \ in the new location.
467 D' E'
468
469
470
4715d) Move semantics
472
473 Consider the following command
474
475 mount --move A B/b
476
477 where 'A' is the source mount, 'B' is the destination mount and 'b' is
478 the dentry in the destination mount.
479
480 The outcome depends on the type of the mount of 'A' and 'B'. The table
481 below is a quick reference::
482
483 ---------------------------------------------------------------------------
484 | MOVE MOUNT OPERATION |
485 |**************************************************************************
486 | source(A)->| shared | private | slave | unbindable |
487 | dest(B) | | | | |
488 | | | | | | |
489 | v | | | | |
490 |**************************************************************************
491 | shared | shared | shared |shared and slave| invalid |
492 | | | | | |
493 |non-shared| shared | private | slave | unbindable |
494 ***************************************************************************
495
496 .. Note:: moving a mount residing under a shared mount is invalid.
497
498 Details follow:
499
500 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
501 mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
502 are created and mounted at dentry 'b' on all mounts that receive
503 propagation from mount 'B'. A new propagation tree is created in the
504 exact same configuration as that of 'B'. This new propagation tree
505 contains all the new mounts 'A1', 'A2'... 'An'. And this new
506 propagation tree is appended to the already existing propagation tree
507 of 'A'.
508
509 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
510 mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
511 are created and mounted at dentry 'b' on all mounts that receive
512 propagation from mount 'B'. The mount 'A' becomes a shared mount and a
513 propagation tree is created which is identical to that of
514 'B'. This new propagation tree contains all the new mounts 'A1',
515 'A2'... 'An'.
516
517 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
518 mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
519 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
520 receive propagation from mount 'B'. A new propagation tree is created
521 in the exact same configuration as that of 'B'. This new propagation
522 tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
523 propagation tree is appended to the already existing propagation tree of
524 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also
525 becomes 'shared'.
526
527 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation
528 is invalid. Because mounting anything on the shared mount 'B' can
529 create new mounts that get mounted on the mounts that receive
530 propagation from 'B'. And since the mount 'A' is unbindable, cloning
531 it to mount at other mountpoints is not possible.
532
533 5. 'A' is a private mount and 'B' is a non-shared(private or slave or
534 unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'.
535
536 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A'
537 is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
538 shared mount.
539
540 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount.
541 The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A'
542 continues to be a slave mount of mount 'Z'.
543
544 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount
545 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
546 unbindable mount.
547
5485e) Mount semantics
549
550 Consider the following command::
551
552 mount device B/b
553
554 'B' is the destination mount and 'b' is the dentry in the destination
555 mount.
556
557 The above operation is the same as bind operation with the exception
558 that the source mount is always a private mount.
559
560
5615f) Unmount semantics
562
563 Consider the following command::
564
565 umount A
566
567 where 'A' is a mount mounted on mount 'B' at dentry 'b'.
568
569 If mount 'B' is shared, then all most-recently-mounted mounts at dentry
570 'b' on mounts that receive propagation from mount 'B' and does not have
571 sub-mounts within them are unmounted.
572
573 Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to
574 each other.
575
576 let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
577 'B1', 'B2' and 'B3' respectively.
578
579 let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
580 mount 'B1', 'B2' and 'B3' respectively.
581
582 if 'C1' is unmounted, all the mounts that are most-recently-mounted on
583 'B1' and on the mounts that 'B1' propagates-to are unmounted.
584
585 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
586 on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
587
588 So all 'C1', 'C2' and 'C3' should be unmounted.
589
590 If any of 'C2' or 'C3' has some child mounts, then that mount is not
591 unmounted, but all other mounts are unmounted. However if 'C1' is told
592 to be unmounted and 'C1' has some sub-mounts, the umount operation is
593 failed entirely.
594
5955g) Clone Namespace
596
597 A cloned namespace contains all the mounts as that of the parent
598 namespace.
599
600 Let's say 'A' and 'B' are the corresponding mounts in the parent and the
601 child namespace.
602
603 If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
604 each other.
605
606 If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
607 'Z'.
608
609 If 'A' is a private mount, then 'B' is a private mount too.
610
611 If 'A' is unbindable mount, then 'B' is a unbindable mount too.
612
613
6146) Quiz
615
616 A. What is the result of the following command sequence?
617
618 ::
619
620 mount --bind /mnt /mnt
621 mount --make-shared /mnt
622 mount --bind /mnt /tmp
623 mount --move /tmp /mnt/1
624
625 what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
626 Should they all be identical? or should /mnt and /mnt/1 be
627 identical only?
628
629
630 B. What is the result of the following command sequence?
631
632 ::
633
634 mount --make-rshared /
635 mkdir -p /v/1
636 mount --rbind / /v/1
637
638 what should be the content of /v/1/v/1 be?
639
640
641 C. What is the result of the following command sequence?
642
643 ::
644
645 mount --bind /mnt /mnt
646 mount --make-shared /mnt
647 mkdir -p /mnt/1/2/3 /mnt/1/test
648 mount --bind /mnt/1 /tmp
649 mount --make-slave /mnt
650 mount --make-shared /mnt
651 mount --bind /mnt/1/2 /tmp1
652 mount --make-slave /mnt
653
654 At this point we have the first mount at /tmp and
655 its root dentry is 1. Let's call this mount 'A'
656 And then we have a second mount at /tmp1 with root
657 dentry 2. Let's call this mount 'B'
658 Next we have a third mount at /mnt with root dentry
659 mnt. Let's call this mount 'C'
660
661 'B' is the slave of 'A' and 'C' is a slave of 'B'
662 A -> B -> C
663
664 at this point if we execute the following command
665
666 mount --bind /bin /tmp/test
667
668 The mount is attempted on 'A'
669
670 will the mount propagate to 'B' and 'C' ?
671
672 what would be the contents of
673 /mnt/1/test be?
674
6757) FAQ
676
677 Q1. Why is bind mount needed? How is it different from symbolic links?
678 symbolic links can get stale if the destination mount gets
679 unmounted or moved. Bind mounts continue to exist even if the
680 other mount is unmounted or moved.
681
682 Q2. Why can't the shared subtree be implemented using exportfs?
683
684 exportfs is a heavyweight way of accomplishing part of what
685 shared subtree can do. I cannot imagine a way to implement the
686 semantics of slave mount using exportfs?
687
688 Q3 Why is unbindable mount needed?
689
690 Let's say we want to replicate the mount tree at multiple
691 locations within the same subtree.
692
693 if one rbind mounts a tree within the same subtree 'n' times
694 the number of mounts created is an exponential function of 'n'.
695 Having unbindable mount can help prune the unneeded bind
696 mounts. Here is an example.
697
698 step 1:
699 let's say the root tree has just two directories with
700 one vfsmount::
701
702 root
703 / \
704 tmp usr
705
706 And we want to replicate the tree at multiple
707 mountpoints under /root/tmp
708
709 step 2:
710 ::
711
712
713 mount --make-shared /root
714
715 mkdir -p /tmp/m1
716
717 mount --rbind /root /tmp/m1
718
719 the new tree now looks like this::
720
721 root
722 / \
723 tmp usr
724 /
725 m1
726 / \
727 tmp usr
728 /
729 m1
730
731 it has two vfsmounts
732
733 step 3:
734 ::
735
736 mkdir -p /tmp/m2
737 mount --rbind /root /tmp/m2
738
739 the new tree now looks like this::
740
741 root
742 / \
743 tmp usr
744 / \
745 m1 m2
746 / \ / \
747 tmp usr tmp usr
748 / \ /
749 m1 m2 m1
750 / \ / \
751 tmp usr tmp usr
752 / / \
753 m1 m1 m2
754 / \
755 tmp usr
756 / \
757 m1 m2
758
759 it has 6 vfsmounts
760
761 step 4:
762 ::
763 mkdir -p /tmp/m3
764 mount --rbind /root /tmp/m3
765
766 I won't draw the tree..but it has 24 vfsmounts
767
768
769 at step i the number of vfsmounts is V[i] = i*V[i-1].
770 This is an exponential function. And this tree has way more
771 mounts than what we really needed in the first place.
772
773 One could use a series of umount at each step to prune
774 out the unneeded mounts. But there is a better solution.
775 Unclonable mounts come in handy here.
776
777 step 1:
778 let's say the root tree has just two directories with
779 one vfsmount::
780
781 root
782 / \
783 tmp usr
784
785 How do we set up the same tree at multiple locations under
786 /root/tmp
787
788 step 2:
789 ::
790
791
792 mount --bind /root/tmp /root/tmp
793
794 mount --make-rshared /root
795 mount --make-unbindable /root/tmp
796
797 mkdir -p /tmp/m1
798
799 mount --rbind /root /tmp/m1
800
801 the new tree now looks like this::
802
803 root
804 / \
805 tmp usr
806 /
807 m1
808 / \
809 tmp usr
810
811 step 3:
812 ::
813
814 mkdir -p /tmp/m2
815 mount --rbind /root /tmp/m2
816
817 the new tree now looks like this::
818
819 root
820 / \
821 tmp usr
822 / \
823 m1 m2
824 / \ / \
825 tmp usr tmp usr
826
827 step 4:
828 ::
829
830 mkdir -p /tmp/m3
831 mount --rbind /root /tmp/m3
832
833 the new tree now looks like this::
834
835 root
836 / \
837 tmp usr
838 / \ \
839 m1 m2 m3
840 / \ / \ / \
841 tmp usr tmp usr tmp usr
842
8438) Implementation
844
8458A) Datastructure
846
847 4 new fields are introduced to struct vfsmount:
848
849 * ->mnt_share
850 * ->mnt_slave_list
851 * ->mnt_slave
852 * ->mnt_master
853
854 ->mnt_share
855 links together all the mount to/from which this vfsmount
856 send/receives propagation events.
857
858 ->mnt_slave_list
859 links all the mounts to which this vfsmount propagates
860 to.
861
862 ->mnt_slave
863 links together all the slaves that its master vfsmount
864 propagates to.
865
866 ->mnt_master
867 points to the master vfsmount from which this vfsmount
868 receives propagation.
869
870 ->mnt_flags
871 takes two more flags to indicate the propagation status of
872 the vfsmount. MNT_SHARE indicates that the vfsmount is a shared
873 vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be
874 replicated.
875
876 All the shared vfsmounts in a peer group form a cyclic list through
877 ->mnt_share.
878
879 All vfsmounts with the same ->mnt_master form on a cyclic list anchored
880 in ->mnt_master->mnt_slave_list and going through ->mnt_slave.
881
882 ->mnt_master can point to arbitrary (and possibly different) members
883 of master peer group. To find all immediate slaves of a peer group
884 you need to go through _all_ ->mnt_slave_list of its members.
885 Conceptually it's just a single set - distribution among the
886 individual lists does not affect propagation or the way propagation
887 tree is modified by operations.
888
889 All vfsmounts in a peer group have the same ->mnt_master. If it is
890 non-NULL, they form a contiguous (ordered) segment of slave list.
891
892 A example propagation tree looks as shown in the figure below.
893 [ NOTE: Though it looks like a forest, if we consider all the shared
894 mounts as a conceptual entity called 'pnode', it becomes a tree]::
895
896
897 A <--> B <--> C <---> D
898 /|\ /| |\
899 / F G J K H I
900 /
901 E<-->K
902 /|\
903 M L N
904
905 In the above figure A,B,C and D all are shared and propagate to each
906 other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
907 mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
908 'E' is also shared with 'K' and they propagate to each other. And
909 'K' has 3 slaves 'M', 'L' and 'N'
910
911 A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
912
913 A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G'
914
915 E's ->mnt_share links with ->mnt_share of K
916
917 'E', 'K', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
918
919 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
920
921 K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
922
923 C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
924
925 J and K's ->mnt_master points to struct vfsmount of C
926
927 and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
928
929 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
930
931
932 NOTE: The propagation tree is orthogonal to the mount tree.
933
9348B Locking:
935
936 ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected
937 by namespace_sem (exclusive for modifications, shared for reading).
938
939 Normally we have ->mnt_flags modifications serialized by vfsmount_lock.
940 There are two exceptions: do_add_mount() and clone_mnt().
941 The former modifies a vfsmount that has not been visible in any shared
942 data structures yet.
943 The latter holds namespace_sem and the only references to vfsmount
944 are in lists that can't be traversed without namespace_sem.
945
9468C Algorithm:
947
948 The crux of the implementation resides in rbind/move operation.
949
950 The overall algorithm breaks the operation into 3 phases: (look at
951 attach_recursive_mnt() and propagate_mnt())
952
953 1. prepare phase.
954 2. commit phases.
955 3. abort phases.
956
957 Prepare phase:
958
959 for each mount in the source tree:
960
961 a) Create the necessary number of mount trees to
962 be attached to each of the mounts that receive
963 propagation from the destination mount.
964 b) Do not attach any of the trees to its destination.
965 However note down its ->mnt_parent and ->mnt_mountpoint
966 c) Link all the new mounts to form a propagation tree that
967 is identical to the propagation tree of the destination
968 mount.
969
970 If this phase is successful, there should be 'n' new
971 propagation trees; where 'n' is the number of mounts in the
972 source tree. Go to the commit phase
973
974 Also there should be 'm' new mount trees, where 'm' is
975 the number of mounts to which the destination mount
976 propagates to.
977
978 if any memory allocations fail, go to the abort phase.
979
980 Commit phase
981 attach each of the mount trees to their corresponding
982 destination mounts.
983
984 Abort phase
985 delete all the newly created trees.
986
987 .. Note::
988 all the propagation related functionality resides in the file pnode.c
989
990
991------------------------------------------------------------------------
992
993version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com)
994
995version 0.2 (Incorporated comments from Al Viro)