Login  Register

The XFS real-time subvolume in Linux

classic Classic list List threaded Threaded
18 messages Options Options
Reply | Threaded
Open this post in threaded view
| More

The XFS real-time subvolume in Linux

Dubravko Markic
3 posts
Hello,
       My name is Dubravko Markic and i am working with the XFS filesystem
as part of my project. I am using a Linux operating system, or better yet, a
SuSe CORE 9 system. I have installed XFS filesystem on one of the partitions
on my computer. My XFS filesystem has a default configuration, which means
that no real-time subvolume is enabled.

My question to you is: what do i do in order to enable real-time subvolume
on my XFS filesystem? Also if i understood correctly under Linux one does
not have the GRIO (guranteed rate I/O) option, this is only available under
the IRIX operating system, right? Thanks a bunch

Regards,
Dubo



Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Eric Sandeen
138 posts
Dubravko Markic wrote:
> Hello,
>       My name is Dubravko Markic and i am working with the XFS
> filesystem as part of my project. I am using a Linux operating system,
> or better yet, a SuSe CORE 9 system. I have installed XFS filesystem on
> one of the partitions on my computer. My XFS filesystem has a default
> configuration, which means that no real-time subvolume is enabled.
>
> My question to you is: what do i do in order to enable real-time
> subvolume on my XFS filesystem?

If it is truly not enabled on your SuSE, then you will need to rebuild xfs
after changing CONFIG_XFS_RT.  But:

penguin3:~ # zcat /proc/config.gz  | grep XFS_RT
CONFIG_XFS_RT=y
penguin3:~ # uname -a
Linux penguin3 2.6.5-7.193-debug #1 SMP Wed Jul 20 14:39:18 UTC 2005 i686 i686
i386 GNU/Linux

This is a SLES9SP2 box, but it appears that recent SuSE kernels are RT-enabled.

> Also if i understood correctly under
> Linux one does not have the GRIO (guranteed rate I/O) option, this is
> only available under the IRIX operating system, right? Thanks a bunch

That is basically true, yes.  There is a non-free GRIOV2 product in use with
CXFS, but for your purposes, I think it is safe to say that there is no
standalone GRIO equivalent on Linux.

-Eric


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Andi Kleen
44 posts
Eric Sandeen <[hidden email]> writes:
>
> That is basically true, yes.  There is a non-free GRIOV2 product in
> use with CXFS, but for your purposes, I think it is safe to say that
> there is no standalone GRIO equivalent on Linux.

It's not. In fact it's a standard feature now.

The CFQ2 IO scheduler has IO priorities settable with ionice, including
a RT class with 8 priorities.

It's not available in SLES9 though, only in newer kernels (2.6.13+)
and SUSE releases (like SL10.0)

-Andi


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Eric Sandeen
138 posts
Andi Kleen wrote:
> Eric Sandeen <[hidden email]> writes:
>
>>That is basically true, yes.  There is a non-free GRIOV2 product in
>>use with CXFS, but for your purposes, I think it is safe to say that
>>there is no standalone GRIO equivalent on Linux.
>
>
> It's not. In fact it's a standard feature now.

Well, I stand corrected then :)

> The CFQ2 IO scheduler has IO priorities settable with ionice, including
> a RT class with 8 priorities.

Well, that still sounds a bit different from the original irix GRIO
implemenation, FWIW.

      grio - guaranteed-rate I/O

  DESCRIPTION

      Guaranteed-rate I/O (GRIO) refers to a guarantee made by the system to a
      user process indicating that the given process will receive data from a
      system resource at a predefined rate regardless of any other activity on
      the system.

While 2.6 can set priorities on IO, it does not offer a hard guaranteed IO
rate, does it?  Now, I'm not necessarily saying one scheme is necessarily
better or worse than the other, but they are different, I think.

-Eric


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Steve Lord
28 posts
In reply to this post by Andi Kleen
Andi Kleen wrote:

> Eric Sandeen <[hidden email]> writes:
>
>>That is basically true, yes.  There is a non-free GRIOV2 product in
>>use with CXFS, but for your purposes, I think it is safe to say that
>>there is no standalone GRIO equivalent on Linux.
>
>
> It's not. In fact it's a standard feature now.
>
> The CFQ2 IO scheduler has IO priorities settable with ionice, including
> a RT class with 8 priorities.
>
> It's not available in SLES9 though, only in newer kernels (2.6.13+)
> and SUSE releases (like SL10.0)
>
> -Andi
>

Ah, but is there a bandwidth reservation system to go with it? That is
the missing link here, being able to say 'I need 10 Mbytes/sec until
further notice'.

Steve


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Andi Kleen
44 posts
On Tuesday 04 October 2005 17:41, Steve Lord wrote:

> Andi Kleen wrote:
> > Eric Sandeen <[hidden email]> writes:
> >>That is basically true, yes.  There is a non-free GRIOV2 product in
> >>use with CXFS, but for your purposes, I think it is safe to say that
> >>there is no standalone GRIO equivalent on Linux.
> >
> > It's not. In fact it's a standard feature now.
> >
> > The CFQ2 IO scheduler has IO priorities settable with ionice, including
> > a RT class with 8 priorities.
> >
> > It's not available in SLES9 though, only in newer kernels (2.6.13+)
> > and SUSE releases (like SL10.0)
> >
> > -Andi
>
> Ah, but is there a bandwidth reservation system to go with it? That is
> the missing link here, being able to say 'I need 10 Mbytes/sec until
> further notice'.

Indirectly there is. The RT priority defines how many time slots you
get and the length of the timeslots are configurable using sysfs.  If you know
the bandwidth of the disk you can use that to define an approximation
of the guaranteed bandwidth for a specific RT priority.

On the other hand I don't really see how you can get real bandwidths.
e.g. on most disks the bandwidth varies greatly depending on where
the blocks are allocated and how much seeking it does. If you take
that all in account you'll probably get a pretty slow worst case
as baseline to divide.

There is nothing that actually gives out bandwidths, but you could
do that in user space.

Jens may correct me if I'm wrong on anything.

-Andi


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Steve Lord
28 posts
Andi Kleen wrote:

> On Tuesday 04 October 2005 17:41, Steve Lord wrote:
>
>>Andi Kleen wrote:
>>
>>>Eric Sandeen <[hidden email]> writes:
>>>
>>>>That is basically true, yes.  There is a non-free GRIOV2 product in
>>>>use with CXFS, but for your purposes, I think it is safe to say that
>>>>there is no standalone GRIO equivalent on Linux.
>>>
>>>It's not. In fact it's a standard feature now.
>>>
>>>The CFQ2 IO scheduler has IO priorities settable with ionice, including
>>>a RT class with 8 priorities.
>>>
>>>It's not available in SLES9 though, only in newer kernels (2.6.13+)
>>>and SUSE releases (like SL10.0)
>>>
>>>-Andi
>>
>>Ah, but is there a bandwidth reservation system to go with it? That is
>>the missing link here, being able to say 'I need 10 Mbytes/sec until
>>further notice'.
>
>
> Indirectly there is. The RT priority defines how many time slots you
> get and the length of the timeslots are configurable using sysfs.  If you know
> the bandwidth of the disk you can use that to define an approximation
> of the guaranteed bandwidth for a specific RT priority.
>
> On the other hand I don't really see how you can get real bandwidths.
> e.g. on most disks the bandwidth varies greatly depending on where
> the blocks are allocated and how much seeking it does. If you take
> that all in account you'll probably get a pretty slow worst case
> as baseline to divide.

If you get into this stuff seriously you have to dedicate hardware
all the way from the cpu to the disks, make worst case estimates of how
fast your disks will go. Multiply all this out and say that to record n
streams of HD video without dropping a frame you need this much hardware.
It is a bit of a black magic art, and not something you just go out and
buy a PC to do. It tends to waste a lot of the potential bandwidth of
the hardware, but you try explaining to CNN why their satellite feed just
went black ;-)

Steve

>
> There is nothing that actually gives out bandwidths, but you could
> do that in user space.
>
> Jens may correct me if I'm wrong on anything.
>
> -Andi
>


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Jens Axboe
9 posts
On Tue, Oct 04 2005, Steve Lord wrote:

> Andi Kleen wrote:
> >On Tuesday 04 October 2005 17:41, Steve Lord wrote:
> >
> >>Andi Kleen wrote:
> >>
> >>>Eric Sandeen <[hidden email]> writes:
> >>>
> >>>>That is basically true, yes.  There is a non-free GRIOV2 product in
> >>>>use with CXFS, but for your purposes, I think it is safe to say that
> >>>>there is no standalone GRIO equivalent on Linux.
> >>>
> >>>It's not. In fact it's a standard feature now.
> >>>
> >>>The CFQ2 IO scheduler has IO priorities settable with ionice, including
> >>>a RT class with 8 priorities.
> >>>
> >>>It's not available in SLES9 though, only in newer kernels (2.6.13+)
> >>>and SUSE releases (like SL10.0)
> >>>
> >>>-Andi
> >>
> >>Ah, but is there a bandwidth reservation system to go with it? That is
> >>the missing link here, being able to say 'I need 10 Mbytes/sec until
> >>further notice'.
> >
> >
> >Indirectly there is. The RT priority defines how many time slots you
> >get and the length of the timeslots are configurable using sysfs.  If you
> >know the bandwidth of the disk you can use that to define an approximation
> >of the guaranteed bandwidth for a specific RT priority.
> >
> >On the other hand I don't really see how you can get real bandwidths.
> >e.g. on most disks the bandwidth varies greatly depending on where
> >the blocks are allocated and how much seeking it does. If you take
> >that all in account you'll probably get a pretty slow worst case
> >as baseline to divide.
>
> If you get into this stuff seriously you have to dedicate hardware
> all the way from the cpu to the disks, make worst case estimates of how
> fast your disks will go. Multiply all this out and say that to record n
> streams of HD video without dropping a frame you need this much hardware.
> It is a bit of a black magic art, and not something you just go out and
> buy a PC to do. It tends to waste a lot of the potential bandwidth of
> the hardware, but you try explaining to CNN why their satellite feed just
> went black ;-)

That's exactly why the Linux ioprio stuff has been designed the way it
is right now - it's not overengineered for something we cannot support
anyways. The CFQ io priorities will work well enough for general use, if
you are basing your business on GRIO it's a different game completely. I
don't want to add kernel infrastructure for something that is very
specialized, especially because the code to do so would be 10 times
bigger and more complex that the current stuff..

--
Jens Axboe


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Eric Sandeen
138 posts
Jens Axboe wrote:
> That's exactly why the Linux ioprio stuff has been designed the way it
> is right now - it's not overengineered for something we cannot support
> anyways. The CFQ io priorities will work well enough for general use, if
> you are basing your business on GRIO it's a different game completely. I
> don't want to add kernel infrastructure for something that is very
> specialized, especially because the code to do so would be 10 times
> bigger and more complex that the current stuff..

Jens, I didn't mean to imply that you -should- have done a GRIO-type
design (and I doubt that Steve did, either.)  My only point was that
GRIO and ioprio are two different IO control mechanisms.

-Eric


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Jens Axboe
9 posts
On Wed, Oct 05 2005, Eric Sandeen wrote:

> Jens Axboe wrote:
> >That's exactly why the Linux ioprio stuff has been designed the way it
> >is right now - it's not overengineered for something we cannot support
> >anyways. The CFQ io priorities will work well enough for general use, if
> >you are basing your business on GRIO it's a different game completely. I
> >don't want to add kernel infrastructure for something that is very
> >specialized, especially because the code to do so would be 10 times
> >bigger and more complex that the current stuff..
>
> Jens, I didn't mean to imply that you -should- have done a GRIO-type
> design (and I doubt that Steve did, either.)  My only point was that
> GRIO and ioprio are two different IO control mechanisms.

Oh I agree, was mainly trying to clarify that they pertain to two
different market segments. Sorry if that wasn't clear, it's not a
criticism of GRIO (I don't really know anything about SGI's GRIO).

--
Jens Axboe


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Andi Kleen
44 posts
In reply to this post by Eric Sandeen
On Wednesday 05 October 2005 16:11, Eric Sandeen wrote:

> Jens Axboe wrote:
> > That's exactly why the Linux ioprio stuff has been designed the way it
> > is right now - it's not overengineered for something we cannot support
> > anyways. The CFQ io priorities will work well enough for general use, if
> > you are basing your business on GRIO it's a different game completely. I
> > don't want to add kernel infrastructure for something that is very
> > specialized, especially because the code to do so would be 10 times
> > bigger and more complex that the current stuff..
>
> Jens, I didn't mean to imply that you -should- have done a GRIO-type
> design (and I doubt that Steve did, either.)  My only point was that
> GRIO and ioprio are two different IO control mechanisms.

I suspect for most people they will be pretty much equivalent.

-Andi


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Jens Axboe
9 posts
On Wed, Oct 05 2005, Andi Kleen wrote:

> On Wednesday 05 October 2005 16:11, Eric Sandeen wrote:
> > Jens Axboe wrote:
> > > That's exactly why the Linux ioprio stuff has been designed the way it
> > > is right now - it's not overengineered for something we cannot support
> > > anyways. The CFQ io priorities will work well enough for general use, if
> > > you are basing your business on GRIO it's a different game completely. I
> > > don't want to add kernel infrastructure for something that is very
> > > specialized, especially because the code to do so would be 10 times
> > > bigger and more complex that the current stuff..
> >
> > Jens, I didn't mean to imply that you -should- have done a GRIO-type
> > design (and I doubt that Steve did, either.)  My only point was that
> > GRIO and ioprio are two different IO control mechanisms.
>
> I suspect for most people they will be pretty much equivalent.

Indeed, only for the 'obscure' life-or-death type setups will there be a
real difference.

--
Jens Axboe


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Andi Kleen
44 posts
On Wednesday 05 October 2005 16:20, Jens Axboe wrote:

> Indeed, only for the 'obscure' life-or-death type setups will there be a
> real difference.

Even for those you could in theory do bandwidth allocation on top of
the RT classes with some tweaking of the time slices and knowing the worst
case transfer rate of the HD, couldn't you? Basically it would be an
user space problem.

-Andi


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Jens Axboe
9 posts
On Wed, Oct 05 2005, Andi Kleen wrote:
> On Wednesday 05 October 2005 16:20, Jens Axboe wrote:
>
> > Indeed, only for the 'obscure' life-or-death type setups will there be a
> > real difference.
>
> Even for those you could in theory do bandwidth allocation on top of
> the RT classes with some tweaking of the time slices and knowing the worst
> case transfer rate of the HD, couldn't you? Basically it would be an
> user space problem.

There are still unknowns, the HD still being the biggest one of course.
The problem is that you don't know the worst case HD performance, it
might be doing all sorts of rewriting, calibration, error correct etc
that can still screw you. So I think without definitely knowledge of
what the HD will do in case of errors (or a way to control that which
you definitely can on some drives), it's still pretty hazy. It gets
better, but if you are looking for complete guarantees I don't think
it's good enough.

--
Jens Axboe


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Andi Kleen
44 posts
On Wednesday 05 October 2005 16:58, Jens Axboe wrote:

> There are still unknowns, the HD still being the biggest one of course.
> The problem is that you don't know the worst case HD performance, it
> might be doing all sorts of rewriting, calibration, error correct etc
> that can still screw you. So I think without definitely knowledge of
> what the HD will do in case of errors (or a way to control that which
> you definitely can on some drives), it's still pretty hazy. It gets
> better, but if you are looking for complete guarantees I don't think
> it's good enough.

Yes, but GRIO has exactly the same problem. I assume they need custom
calibration for each IO subsystem.

-Andi


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Jens Axboe
9 posts
On Wed, Oct 05 2005, Andi Kleen wrote:

> On Wednesday 05 October 2005 16:58, Jens Axboe wrote:
>
> > There are still unknowns, the HD still being the biggest one of course.
> > The problem is that you don't know the worst case HD performance, it
> > might be doing all sorts of rewriting, calibration, error correct etc
> > that can still screw you. So I think without definitely knowledge of
> > what the HD will do in case of errors (or a way to control that which
> > you definitely can on some drives), it's still pretty hazy. It gets
> > better, but if you are looking for complete guarantees I don't think
> > it's good enough.
>
> Yes, but GRIO has exactly the same problem. I assume they need custom
> calibration for each IO subsystem.

Indeed it does, and yes if they really want to provide the type of
guarantees that Steve listed, then that needs a custom box with either
custom or known disk firmware options. If not you cannot give absolute
guarantees and expect to always honor them. That's in addition to
anything you may need to change in software, if using Linux you would
need to audit/fix lots of things in the io path.

--
Jens Axboe


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Steve Lord
28 posts
Jens Axboe wrote:

> On Wed, Oct 05 2005, Andi Kleen wrote:
>
>>On Wednesday 05 October 2005 16:58, Jens Axboe wrote:
>>
>>
>>>There are still unknowns, the HD still being the biggest one of course.
>>>The problem is that you don't know the worst case HD performance, it
>>>might be doing all sorts of rewriting, calibration, error correct etc
>>>that can still screw you. So I think without definitely knowledge of
>>>what the HD will do in case of errors (or a way to control that which
>>>you definitely can on some drives), it's still pretty hazy. It gets
>>>better, but if you are looking for complete guarantees I don't think
>>>it's good enough.
>>
>>Yes, but GRIO has exactly the same problem. I assume they need custom
>>calibration for each IO subsystem.
>
>
> Indeed it does, and yes if they really want to provide the type of
> guarantees that Steve listed, then that needs a custom box with either
> custom or known disk firmware options. If not you cannot give absolute
> guarantees and expect to always honor them. That's in addition to
> anything you may need to change in software, if using Linux you would
> need to audit/fix lots of things in the io path.
>

Definitely, from my memory, grio had a tool for measuring a disk subsystem.
I think reality is closer to overspecing the hardware than to providing an
absolute guarantee of bandwidth. Being able to prioritize individual I/O calls
is just part of the picture.

Steve


Reply | Threaded
Open this post in threaded view
| More

Re: The XFS real-time subvolume in Linux

Jens Axboe
9 posts
On Wed, Oct 05 2005, Steve Lord wrote:

> Jens Axboe wrote:
> >On Wed, Oct 05 2005, Andi Kleen wrote:
> >
> >>On Wednesday 05 October 2005 16:58, Jens Axboe wrote:
> >>
> >>
> >>>There are still unknowns, the HD still being the biggest one of course.
> >>>The problem is that you don't know the worst case HD performance, it
> >>>might be doing all sorts of rewriting, calibration, error correct etc
> >>>that can still screw you. So I think without definitely knowledge of
> >>>what the HD will do in case of errors (or a way to control that which
> >>>you definitely can on some drives), it's still pretty hazy. It gets
> >>>better, but if you are looking for complete guarantees I don't think
> >>>it's good enough.
> >>
> >>Yes, but GRIO has exactly the same problem. I assume they need custom
> >>calibration for each IO subsystem.
> >
> >
> >Indeed it does, and yes if they really want to provide the type of
> >guarantees that Steve listed, then that needs a custom box with either
> >custom or known disk firmware options. If not you cannot give absolute
> >guarantees and expect to always honor them. That's in addition to
> >anything you may need to change in software, if using Linux you would
> >need to audit/fix lots of things in the io path.
> >
>
> Definitely, from my memory, grio had a tool for measuring a disk subsystem.
> I think reality is closer to overspecing the hardware than to providing an
> absolute guarantee of bandwidth. Being able to prioritize individual I/O
> calls
> is just part of the picture.

The spec'ing part is the last piece in the puzzle, that will only tell
you what the io subsystem will typically do. It wont tell you how it
behaves in boundary or error conditions. It's trivial to say "oh the
disk does 35MiB/sec at the end zone, lets cap it at 20MiB/sec" and have
it work 99.9% of the time, the tricky part is the last 0.1%. Or whatever
percentage you want to assign to it :-)

--
Jens Axboe