(cache) pivot_root できる条件

ふとしたきっかけで man 2 pivot_root の制限に疑問を持ったので、雑にカーネルのコードを読んでみたエントリです。かなり雑にみただけなので間違いの指摘を歓迎します。というか指摘を受けるために書いたようなもの😅

pivot_root の使われ方

コンテナを起動して、コンテナイメージの root をコンテナの root に設定する際、chroot を抜けられないように権限を制御したりしながら chroot を使ったりすることがあります。

一方で、LXC や Docker では pivot_root が使われます。chroot は比較的簡単に使えるのに対して、pivot_root はいくつか制限があります。

man 2 pivot_root すると、その制限について説明があります。

new_root および put_old には以下の制限がある:

ディレクトリでなければならない。

new_root と put_old は現在の root と同じファイルシステムにあってはならない。

put_old は new_root 以下になければならない。すなわち put_old を差す文字列に 1 個以上の ../ を付けることによって new_root と同じディレクトリが得られなければならない。

他のファイルシステムが put_old にマウントされていてはならない。

https://linuxjm.osdn.jp/html/LDP_man-pages/man2/pivot_root.2.html

(「差す」は「指す」の Typo ?)

しかし、この説明は少し微妙です。例えば、LXC ではコンテナイメージの root を、例えば /usr/lib/lxc/rootfs にバインドマウントして、そこに pivot_root します。

バインドマウントですので、言ってみれば new_root は新たにマウントされたファイルシステムとも言えますが、同じファイルシステム上のディレクトリとも言えます。

他に、

 998        /* change into new root fs */
 999        if (fchdir(newroot)) {
1000                SYSERROR("can't chdir to new rootfs '%s'", rootfs);
1001                goto fail;
1002        }
1003
1004        /* pivot_root into our new root fs */
1005        if (pivot_root(".", ".")) {
1006                SYSERROR("pivot_root syscall failed");
1007                goto fail;
1008        }
1009
1010        /*
1011         * at this point the old-root is mounted on top of our new-root
1012         * To unmounted it we must not be chdir'd into it, so escape back
1013         * to old-root
1014         */
1015        if (fchdir(oldroot) < 0) {
1016                SYSERROR("Error entering oldroot");
1017                goto fail;
1018        }
1019        if (umount2(".", MNT_DETACH) < 0) {
1020                SYSERROR("Error detaching old root");
1021                goto fail;
1022        }
1023
1024        if (fchdir(newroot) < 0) {
1025                SYSERROR("Error re-entering newroot");
1026                goto fail;
1027        }

(lxc-2.0.8時点のsrc/lxc/conf.c)

このあたりですね。新しく root としたいコンテナの root (newroot) に移動したあと、pivot_root(".", ".") として、以前の root である oldroot も newroot と同じディレクトリにマウントしてしまいます。その後 oldroot をアンマウントしています。これは「すなわち put_old を指す文字列に 1 個以上の ../ を付けることによって new_root と同じディレクトリが得られなければならない」ではないようにも思えてしまいます。

なので、実際どうなのかカーネルのコードをみてみました。

カーネルのコメント

pivot_root システムコールは fs/namespace.c 内にあります。手元にはなぜか 4.1.15 のソースコードがあるので、それでみると 2941 行目付近からが実装です。

実はここのコメントにも詳細な説明があります。これを読めば一件落着! かも。

/*
 * pivot_root Semantics:
 * Moves the root file system of the current process to the directory put_old,
 * makes new_root as the new root file system of the current process, and sets
 * root/cwd of all processes which had them on the current root to new_root.
 *
 * Restrictions:
 * The new_root and put_old must be directories, and  must not be on the
 * same file  system as the current process root. The put_old  must  be
 * underneath new_root,  i.e. adding a non-zero number of /.. to the string
 * pointed to by put_old must yield the same directory as new_root. No other
 * file system may be mounted on put_old. After all, new_root is a mountpoint.
 *
 * Also, the current root cannot be on the 'rootfs' (initial ramfs) filesystem.
 * See Documentation/filesystems/ramfs-rootfs-initramfs.txt for alternatives
 * in this situation.
 *
 * Notes:
 *  - we don't move root/cwd if they are not at the root (reason: if something
 *    cared enough to change them, it's probably wrong to force them elsewhere)
 *  - it's okay to pick a root that isn't the root of a file system, e.g.
 *    /nfs/my_root where /nfs is the mount point. It must be a mountpoint,
 *    though, so you may need to say mount --bind /nfs/my_root /nfs/my_root
 *    first.
 */

私の超（＝ヒドい）訳を。

/*
 * pivot_root のセマンティクス:
 * カレントプロセスのルートファイルシステムを put_old ディレクトリへ移
 *  動させ、new_root をカレントプロセスの新しいルートファイルシステムに
 *  します。そして、現在のルートを使用しているすべてのプロセスのルート
 *  とカレントワーキングディレクトリを新しいルートに設定する
 *
 * 制限:
 * new_root と put_old はディレクトリでなくてはなりません。そして、
 * 現在のプロセスのルートと同じファイルシステム上にあってはなりません。
 * put_old は new_root の下になくてはなりません。すなわち、put_old
 * が指す文字列に 0 個以外の /.. を追加すると new_root と同じディレク
 * トリにならなくてはいけません。他のファイルシステムが put_old にマ
 * ウントされていてはいけません。結局、new_root はマウントポイントで
 * なくてはなりません。
 *
 * さらに、カレントの root が 'rootfs' (initramfs) となることはできま
 * せん。この場合の代替策は
 * Documentation/filesystems/ramfs-rootfs-initramfs.txt をご覧ください。
 *
 * 注意:
 *  - root にいない場合は、root/cwd に移動しません (理由: 十分に注意し
 *    て、それらを変更するのであれば、別の場所を強制するのはたぶん間違
 *    いでしょう)
 *  - ファイルシステムの root でない root を選択することができます。例
 *    えば、/nfs がマウントポイントである /nfs/my_root を選択できます。
 *    マウントポイントでなくてはなりませんので、mount --bind
 *    /nfs/my_root /nfs/my_root を最初に実行しておく必要があるかもしれ
 *    ません
 */

これでほぼ解決でしょうか :-)

カーネルコード

ですが、カーネルのコードを追って、どのような条件が設定されていうのかを確認しておきましょう。

2966SYSCALL_DEFINE2(pivot_root, const char __user *, new_root,
2967                const char __user *, put_old)
2968{
2969        struct path new, old, parent_path, root_parent, root;
2970        struct mount *new_mnt, *root_mnt, *old_mnt;
2971        struct mountpoint *old_mp, *root_mp;
2972        int error;

文字列で与えられたパスから struct path を取得します。ここで new_root、put_old がディレクトリであるかどうかのチェックをしているようです。また、カレントプロセスの root のパス (= root) を取得します。

2977        error = user_path_dir(new_root, &new);
  :(snip)
2981        error = user_path_dir(put_old, &old);
  :(snip)
2989        get_fs_root(current->fs, &root);

それぞれのパスから、struct mount を取得します。

2996        new_mnt = real_mount(new.mnt);
2997        root_mnt = real_mount(root.mnt);
2998        old_mnt = real_mount(old.mnt);

shared mount 以外

まず最初の条件、

2999        if (IS_MNT_SHARED(old_mnt) ||
3000                IS_MNT_SHARED(new_mnt->mnt_parent) ||
3001                IS_MNT_SHARED(root_mnt->mnt_parent))

new_mnt と root_mnt は struct mount で、そのメンバ mnt_parent はマウントの親子関係がある場合、要は他のマウント配下にマウントポイントがあり、そこにマウントされているような場合に、その親マウントを示すメンバです (たぶん)。(struct mount)

以前の root の移動先のマウント (put_old)、新しい root (new_root)およびカレントプロセスの root がマウントされている親のマウント (ファイルシステム) が shared マウントであってはいけません。

新しいrootのマウントが現プロセスと同じマウント名前空間

ふたつめの条件、

 if (!check_mnt(root_mnt) || !check_mnt(new_mnt))
        goto out4;

check_mnt は、次のように引数 mnt で指定したマウントの名前空間とカレントプロセスのマウント名前空間が等しいかどうかをチェックしています。

 771static inline int check_mnt(struct mount *mnt)
 772{
 773        return mnt->mnt_ns == current->nsproxy->mnt_ns;
 774}

つまり、新しい root のマウントは、カレントプロセスのマウント名前空間に属していなければなりません。

カレントプロセスのrootと、new, oldが異なるマウント

同じファイルシステムでグルグル循環してはいけませんので、

3011        if (new_mnt == root_mnt || old_mnt == root_mnt)
3012                goto out4; /* loop, on the same file system  */

新しい root の mount 構造体と、カレントプロセスの root の mount 構造体が同じ、つまり同じマウント (ファイルシステム) である
古い root の mount 構造体と、カレントプロセスの root の mount 構造体が同じ、つまり同じマウント (ファイルシステム) である

このような場合はエラーとなります。old_mnt と root_mnt は同じじゃないの? と一瞬思ってしまうかもしれませんが、元の root を、新しい root 以下の put_old にマウントする際のマウントを表しています。例えば、マウントポイントが違いますので構造体のインスタンスは別ですね。

そもそもここが同じだと、pivot_root でルートを移動することになりません。

カレントプロセスのrootはマウントポイント

root.mnt->mnt_rootで、カレントプロセスの root マウント (ファイルシステム) の root の dentry を求めています。これがカレントプロセスの root の dentry と異なっている場合はエラーになります。

3014        if (root.mnt->mnt_root != root.dentry)
3015                goto out4; /* not a mountpoint */

ややこしいですが、カレントプロセスの root ディレクトリがマウントポイントと異なっていてはいけません。これは、カレントプロセスが chroot でマウントポイント以外を root としている場合でしょう (たぶん)。

カレントプロセスの root は attach されている

ここはちょっとよくわからないのですが、カレントプロセスの root がマウントの親子関係のツリー内にいるかどうかをチェックしています。

3016        if (!mnt_has_parent(root_mnt))
3017                goto out4; /* not attached */

mnt_has_parent は fs/mount.h 内にあります。

static inline int mnt_has_parent(struct mount *mnt)
{
    return mnt != mnt->mnt_parent;
}

指定した mnt と、自身のメンバである親マウントを表す mnt->mnt_parent が異なっている場合、つまり親マウントがあるかどうかをチェックしています。(構造体の初期化時点で mnt->mnt_parent = mnt という処理があるので、ちゃんと親子関係がなければ mnt == mnt->mnt_parent となるはずです、たぶん)

システム起動時の root を含め、きちんとマウントされていれば、親マウントは存在するんだと思います。というのは、カーネルパラメータで指定した root を起動時にマウントする際、do_move_mountを通りますが、この中でも同じようなチェックをしていて、エラーになったら root がマウントできないはずです (たぶん)。

このあたりで、元の root のマウントを mnt_parent に入れているような処理があります (たぶん、do_mount_move→attach_mnt→mnt_set_mountpointあたりの流れ)。

新しい root はマウントポイントで attach されている

3019        if (new.mnt->mnt_root != new.dentry)
3020                goto out4; /* not a mountpoint */
3021        if (!mnt_has_parent(new_mnt))
3022                goto out4; /* not attached */

先の説明のカレントプロセスのチェックと同じですね。新しい root マウントの root ディレクトリはマウントポイントで、きちんとマウントの親子関係の中に入っている必要があります。

old は new (新たな root) 配下

3023        /* make sure we can reach put_old from new_root */
3024        if (!is_path_reachable(old_mnt, old.dentry, &new))
3025                goto out4;

is_path_reachable は関数名から想像はつきますが、

2916/*
2917 * Return true if path is reachable from root
2918 *
2919 * namespace_sem or mount_lock is held
2920 */
2921bool is_path_reachable(struct mount *mnt, struct dentry *dentry,
2922                         const struct path *root)
2923{
2924        while (&mnt->mnt != root->mnt && mnt_has_parent(mnt)) {
2925                dentry = mnt->mnt_mountpoint;
2926                mnt = mnt->mnt_parent;
2927        }
2928        return &mnt->mnt == root->mnt && is_subdir(dentry, root->dentry);
2929}

という関数です。つまり、

old と new が異なるマウントで old に親マウントがある場合には、old の親マウントを参照

という処理を終えた後に条件判定をしていますので、

old の親 (マウント) が new
old (のマウントポイント) が new (のマウントポイント) のサブディレクトリ

という条件になります。ちなみに is_subdir はこんな。

3302/**
3303 * is_subdir - is new dentry a subdirectory of old_dentry
3304 * @new_dentry: new dentry
3305 * @old_dentry: old dentry
3306 *
3307 * Returns 1 if new_dentry is a subdirectory of the parent (at any depth).
3308 * Returns 0 otherwise.
3309 * Caller must ensure that "new_dentry" is pinned before calling is_subdir()3310 */
3311
3312int is_subdir(struct dentry *new_dentry, struct dentry *old_dentry)
3313{
3314        int result;
3315        unsigned seq;
3316
3317        if (new_dentry == old_dentry)
3318                return 1;

(fs/dcache.cのis_subdir付近付近)

3317行目で、指定されているふたつのディレクトリが同じでも 1 を返してるので、「"/..“ をつけて」とかは説明につけなくても良いのでは?

new はカレントプロセス root 配下

3026        /* make certain new is below the root */
3027        if (!is_path_reachable(new_mnt, new.dentry, &root))
3028                goto out4;

これは上 (old は new 配下) と同じですね。

まとめ

ここまできて、man やコード中のコメントをみても、わかったようなわかってないような気分になるのは、用語の曖昧さと気づきました。

ファイルシステム: マウントポイントにマウントされるマウントの情報と含まれるツリー (=`mount`構造体)

とするとすっきりします。つまり

new_root と put_old は、man 2 pivot_root にあったように:

ディレクトリでなければならない
現在の root ファイルシステムと同じファイルシステム であってはならない
put_old は new_root 以下になければならない
put_old に他のファイルシステムがマウントされていてはいけない

加えて、

new_root はカレントプロセスの root ファイルシステムの root 以下になければならない
カレントプロセスの root ファイルシステムの root はマウントポイントでなければならない
- chroot でマウントポイントから root が移動していてはいけない
new_root の root ディレクトリはマウントポイントでなければならない
old_put にマウントされるファイルシステム、new_mnt の親ファイルシステム、カレントプロセスの root ファイルシステムの親ファイルシステムが shared マウントであってはならない

こんな感じでしょうか。

ファイルシステムの実体 (配下のディレクトリやファイル) は判定には関係なく、マウントそのものとマウントポイントがキーとなります。新しい root はマウントポイントであれば良いので、bind マウントでも良いということになりますね。

pivot_root の条件のところだけ追って力尽きたので、実際のマウントやら移動の部分の解説はありません😅

TenForward

技術ブログ。はてなダイアリーから移転しました

pivot_root できる条件