全 47 件のコメント

[–]alexbuzzbee 27 ポイント28 ポイント  (7子コメント)

The missing 1.5 GiB/s is probably kernel overhead and other processes.

Try it in emergency mode for slightly more speed!

[–]kjensenxz[S] 13 ポイント14 ポイント  (6子コメント)

I considered running it in single-user mode, writing a simple ring 0 program to boot off of, or using a custom tiny kernel using it as init, and squeeze as much speed as possible out of the program, but I think I've spent enough time on this, I started writing it somewhere around 4 or 5 hours ago. If anyone would like to take a crack at doing that, I'd love to see how it compares to running on a regular system.

[–]josuahdemangeon 6 ポイント7 ポイント  (5子コメント)

I learned something today !

For the yes command, I still prefer the first implementation. Maybe dd also have such kind of optimizations.

[–]kjensenxz[S] 2 ポイント3 ポイント  (4子コメント)

I really like the readability of the first iteration and NetBSD's, which are very similar, but they just aren't as quick, which makes me wonder if there would be a way to optimize several subsequent calls to a stdio function for the same speed in the library itself. Maybe another time I'll look into that, dd, and cat!

[–]Malor 1 ポイント2 ポイント  (3子コメント)

How often do you need 10 gigabytes a second of 'y'? That's just a bizarre use case. I have never seen a program that can handle the user saying "yes" ten billion times a second.

10 gigs a second of null chars or zeroes or spaces, sure, those might be useful. But overoptimizing "yes" seems a little strange to me.

If anything, you'd want to generalize to emit any character; at that point, the speed might matter.

[–]kjensenxz[S] 1 ポイント2 ポイント  (0子コメント)

You make an excellent point, and yes is meant to do this (send argv instead of "y"), and the programs could easily be modified to send any value based on argv, just by changing the buffer subroutine. I would have added that in the program demos, but I felt it would be in excess.

[–]iluvatar 0 ポイント1 ポイント  (1子コメント)

If anything, you'd want to generalize to emit any character

yes already does this. Indeed, it goes further and repeatedly emits any arbitrary string. It's had this behaviour for at least the 30 years that I've been using it.

[–]Malor 0 ポイント1 ポイント  (0子コメント)

Huh, I never realized that. "yes no" is a thing.

[–]jmtd 21 ポイント22 ポイント  (1子コメント)

It's a shame they didn't finish their kernel, but at least they got yes working at 10GiB/s.

[–]kjensenxz[S] 2 ポイント3 ポイント  (0子コメント)

This should be a fortune

[–]pixel4 8 ポイント9 ポイント  (5子コメント)

I wonder if it would help to page align your heap allocation (buf)

[–]kjensenxz[S] 5 ポイント6 ポイント  (4子コメント)

I used mem_align and actually got worse performance, generally .2 GiB/s slower than elagergren's Go implementation and the C/assembly implementations (modified 4th iteration if you'd like to check):

//char *buf = malloc(TOTAL);
char *buf = aligned_alloc(4096, TOTAL);

[–]patrickbarnes 1 ポイント2 ポイント  (3子コメント)

What happens if you stack allocate your buf?

[–]kjensenxz[S] 0 ポイント1 ポイント  (2子コメント)

That's actually what happens in the assembly code, since it actually compiles the values into the binary. Here's a sample (the .y's repeat for another 500 lines or so):

00000080: 48ff c748 be9b 0040 0000 0000 00ba 0020  H..H...@....... 
00000090: 0000 b801 0000 000f 05eb f779 0a79 0a79  ...........y.y.y
000000a0: 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79  .y.y.y.y.y.y.y.y
000000b0: 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79  .y.y.y.y.y.y.y.y
000000c0: 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79  .y.y.y.y.y.y.y.y

I don't know if the stack has any greater performance than the heap for something like this (we don't really need to do any memory "bookkeeping", and after all, memory is just memory), and it might mean slower initialization of the program, since it would have to read a larger file for the buffer than build one in memory.

[–]Vogtinator 1 ポイント2 ポイント  (1子コメント)

That's actually what happens in the assembly code, since it actually compiles the values into the binary.

That's not the stack, that's .data (or in this case, .text, as not specified otherwise)

To get it on the stack, you would need to:

sub rsp, 8192
mov rdi, rsp
mov rsi, y
mov rdx, 8192
call memcpy

Or something like that.

[–]kjensenxz[S] 1 ポイント2 ポイント  (0子コメント)

Thanks, I thought certain data in .text was put onto the stack (e.g. consts).

[–]jmickeyd 6 ポイント7 ポイント  (1子コメント)

I'm curious about vmsplice performance on Linux. You could potentially have a single page of "y\n"s passed multiple times in the iov. That way you have fewer syscalls without using more ram. Although at some point (possibly already), pv is going to be the bottleneck.

[–]kjensenxz[S] 2 ポイント3 ポイント  (0子コメント)

When I was writing the conclusion, I wondered how much pv was limiting. I took a stab at it with dd, but it was an even worse bottleneck:

$ ./yes | dd of=/dev/null bs=8192
29703569408 bytes (30 GB, 28 GiB) copied, 5.34847 s, 5.6 GB/s

I've seen pv measure as high as 11.2 GiB, which really makes me wonder what the actual percent bottleneck everything is, and if it weren't so late, I would definitely go poking around to check. I'll try to remember to do it tomorrow, of course, everyone and everyone else is invited to also if they're interested!

[–]pixel4 5 ポイント6 ポイント  (7子コメント)

On my MacBook, BUFSIZ is only 1024. But if I make my buffer 16k then things speed up. Maybe it's making better use of L1. shrug

[–]kjensenxz[S] 1 ポイント2 ポイント  (6子コメント)

I'm not sure which architecture your MacBook is (x86_64? ARM? Ancient PPC?), but I noticed that the speed really has to do with the size of your buffer compared to your pages (4096 bytes on x86), and making sure that you can fill up at least one (two is better IIRC). I'm not sure how much it's stored in L1, but if it was, it should be in the hundreds of gigabytes, in which case pv would definitely be the bottleneck.

[–]wrosecrans 5 ポイント6 ポイント  (5子コメント)

It'll be x86_64 (or technically it could be x86 if it is the first gen Core Duo.) The PPC Laptops were all branded "PowerBook" or "iBook," and Apple hasn't shipped an ARM laptop.

[–]kjensenxz[S] 0 ポイント1 ポイント  (4子コメント)

Thanks! I didn't know about the PPC branding or the lack of an ARM; I was thinking the A10 was in the MacBook Air.

[–]wrosecrans 4 ポイント5 ポイント  (3子コメント)

The phones and tablets are all ARM. At this point, the iPadPro with an optional keyboard attached to it is suspiciously similar to a laptop, but not quite. The Mac is currently all x86_64. The MacBook Pro does have a little ARM in it hidden away to control the touchbar panel, but you generally wouldn't program it directly. (Most systems have a couple of little processors like that in them these days. There's probably at least one more in the wifi controller or something.)

Running a normal process in the OS is always on the Intel CPU.

Historical trivia: The PowerPC laptops were called "PowerBook." The PowerPC macs were called "PowerMac." But the original PowerBooks predated the PPC CPU's and were all 68k. It was just coincidence when the CPU and laptop branding lined up with Power in the name.

[–]kjensenxz[S] 3 ポイント4 ポイント  (2子コメント)

The MacBook Pro does have a little ARM in it hidden away to control the touchbar panel, but you generally wouldn't program it directly

Someone put the original Doom on the touch bar, which makes me wonder about the interface with the operating system and hardware, and the specs of it - how fast can it run yes?

[–]jmtd 3 ポイント4 ポイント  (0子コメント)

That is a cute hack, but I think they're still running doom on the CPU but rendering on the bar; not running it on the ARM.

[–]video_descriptionbot 0 ポイント1 ポイント  (0子コメント)

SECTION CONTENT
Title Doom on the MacBook Pro Touch Bar
Description Doom runs on everything… but can it run on the new MacBook Pro Touch Bar? Let's find out!
Length 0:00:58

I am a bot, this is an auto-generated reply | Info | Feedback | Reply STOP to opt out permanently

[–]TotesMessenger 4 ポイント5 ポイント  (0子コメント)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

[–]tiltowaitt 4 ポイント5 ポイント  (3子コメント)

This is pretty interesting. Is there a real-world advantage on modern systems to such speed in the GNU yes?

[–]kjensenxz[S] 6 ポイント7 ポイント  (2子コメント)

I really can't think of any real advantage of yes being faster other than being able to say "look, mine's faster!", since the likelihood of needing 5 billion "y's" per second is almost 0. It might have one or two use cases in which its efficiency is actually useful, perhaps in embedded systems running several operations concurrently? A couple of people have mentioned dd and cat, which makes me wonder if the same thing could be done to either (or both) of them to speed them up as greatly, and I plan on taking a stab at them fairly soon if someone doesn't beat me to it.

[–]shitty_po 3 ポイント4 ポイント  (1子コメント)

dd is somewhat bound by POSIX saying the default block size needs to be 512 bytes.

you can use another, but many people don't know about it.

[–]kjensenxz[S] 1 ポイント2 ポイント  (0子コメント)

Good to know, I would have went hacking at the source and might have accidentally PR'd something non-complaint. It'd make a good exercise for a custom (read: nonstandard) system though.

[–]phedny 3 ポイント4 ポイント  (3子コメント)

I've been able to increase speed using scatter/gather I/O with this implementation. Would love to see how it measures up on the machine you used for the other measurements:

#define LEN 2
#define TOTAL 8192
#define IOVECS 256
int main() {
    char yes[LEN] = {'y', '\n'};
    char *buf = malloc(TOTAL);
    int bufused = 0;
    int i;
    struct iovec iov[IOVECS];
    while (bufused < TOTAL) {
        memcpy(buf+bufused, yes, LEN);
        bufused += LEN;
    }
    for (i = 0; i < IOVECS; i++) {
        iov[i].iov_base = buf;
        iov[i].iov_len = TOTAL;
    }
    while(writev(1, iov, IOVECS));
    return 1;
}

[–]kjensenxz[S] 0 ポイント1 ポイント  (2子コメント)

What's your speed on both GNU yes and your revision? On the OP build machine:

$ gcc yes.c
$ ./yes | pv > /dev/null
... [9.05GiB/s] ...

[–]phedny 1 ポイント2 ポイント  (1子コメント)

I did this on a VPS, so number are not very stable, but around 1GB/s on iteration 4 and around 1.7GB/s on the iovec version. There might be another bottleneck at play here.

[–]kjensenxz[S] 0 ポイント1 ポイント  (0子コメント)

I did this on a VPS

Interesting, I just tried this on my VPS:

$ ./yes | pv > /dev/null #iteration 4
... [ 488MiB/s] ...
$ ./iovecyes | pv > /dev/null
... [ 964MiB/s] ...

Very strange, so I decided to test it in a virtual machine (NetBSD):

$ ./yes | pv > /dev/null
... [ 801MiB/s] ...
$ ./iovecyes | pv >/dev/null
... [ 990 MiB/s] ...

Both of these fluctuated from about 450 to 993. I don't know if my results at this point under a hypervisor can be considered conclusive with the amount of error in their fluctuation nor in the difference when I run them (from the constant fluctuation).

[–]stw 2 ポイント3 ポイント  (1子コメント)

Just a small nitpick: puts appends a newline, so puts("y\n") writes 2 newlines.

[–]kjensenxz[S] 1 ポイント2 ポイント  (0子コメント)

Thanks! I completely overlooked that, and was off by about 50%. I edited the OP to reflect the real values.

[–]SixLegsGood 1 ポイント2 ポイント  (4子コメント)

  1. What happened to the caches? Shouldn't this tiny program and the tiny amount of the OS being exercised fit within the L2 cache? Why then should it be limited to main memory speed?
  2. Is 'pv' a bottleneck? I see a comment below that you tried sending the output through dd to /dev/null. Perhaps try running something like:

    pv < /dev/zero

(although I wouldn't be surprised to find that /dev/zero is slower than yes...)

[–]kjensenxz[S] 1 ポイント2 ポイント  (3子コメント)

What happened to the caches? Shouldn't this tiny program and the tiny amount of the OS being exercised fit within the L2 cache? Why then should it be limited to main memory speed?

This is a great question, in fact, it should fit on L1 in my processor (32K data, 32K instructions). I would assume that it's stuck with memory speed since there is a pipe involved, and now that you mention it, the best way to measure this would probably be to use an internal timer and counter.

Is 'pv' a bottleneck? I see a comment below that you tried sending the output through dd to /dev/null. Perhaps try running something like: pv < /dev/zero

$ pv < /dev/zero
... [4.79MiB/s] ...
$ pv > /dev/null < /dev/zero
... [20.6GiB/s] ...

Honestly, at this point, it's very difficult to say if pv is a bottleneck. Several people have mentioned it, and I've thought about it, and I think the real bottleneck would have to be the pipe, because it has to use memory to send data through it.

[–]SixLegsGood 1 ポイント2 ポイント  (2子コメント)

Wow, thanks for the quick reply and benchmark!

IIRC, back in the day, IRIX used to support a crude form of zero-copy I/O where, if you were reading / writing page-sized chunks of memory that were properly aligned, it would use page table trickery to share the data between processes (or between OS drivers and processes), so that the reads and writes really did do nothing. In practice, the optimisation never seemed to be too useful, there were always too many constraints that made the 'zero-copy' cost more than simple data transfer (the sending process/driver needed to not touch the memory again, the receiver mustn't alter the data in the pages, the trick added extra locking, and on many systems, the cost of updating page tables was slower than just copying the 4kb chunks of memory). But for this particular benchmark, I suspect it could hit a crazy theoretical 'transfer' speed...

[–]kjensenxz[S] 1 ポイント2 ポイント  (1子コメント)

IRIX used to support a crude form of zero-copy I/O where, if you were reading / writing page-sized chunks of memory that were properly aligned, it would use page table trickery to share the data between processes (or between OS drivers and processes), so that the reads and writes really did do nothing.

You know, I have a spare computer and IRIX is available on a torrent site, and this makes me wonder if I could (or should) try to install it and benchmark this application on bare metal (hypervisors seem to completely ruin benchmarking).

[–]SixLegsGood 1 ポイント2 ポイント  (0子コメント)

You'd definitely need to run it on bare metal to test this optimisation, the virtualisation would be emulating all of the pagetable stuff. I think it also only worked on specific SGI hardware (or maybe it was specific to the MIPS architecture?), and there were other restrictions, like the read()s and write()s had to be 4kb (I think) chunks, 4kb aligned, possibly with a spare 4kb page either side of the buffers too. It may also have been restricted to driver<->application transfers, the use case I encountered was in a web server that was writing static files out to the network as fast as possible.

[–]emn13 1 ポイント2 ポイント  (1子コメント)

You state the memory bandwidth is 12.8GB/s - but that's per channel, and my guess is that you're running a dual channel setup (most people are), so 10.2GiB/s is a little less than half the theoretical throughput.

Also, note that because you're writing to /dev/null, it's conceivable no reads ever occur, even at a low level, so full-throughput sequential writes really are achievable.

Oh, and additionally it's not trivially obvious (to the non-OS geek me, anyhow) why this benchmark even needs to hit RAM - is there some cross-process TLB flush going on? After all, you may be writing a lot of memory, but you're doing so in small, very cachable chunks, and you're discarding those immediately - so why can't this all stay within some level of cache?

[–]kjensenxz[S] 1 ポイント2 ポイント  (0子コメント)

You state the memory bandwidth is 12.8GB/s - but that's per channel, and my guess is that you're running a dual channel setup (most people are), so 10.2GiB/s is a little less than half the theoretical throughput.

You're right, I am on a dual channel setup, but as far as I know (not much about RAM), it would only be hitting a single channel.

Also, note that because you're writing to /dev/null, it's conceivable no reads ever occur, even at a low level, so full-throughput sequential writes really are achievable.

Oh, and additionally it's not trivially obvious (to the non-OS geek me, anyhow) why this benchmark even needs to hit RAM - is there some cross-process TLB flush going on? After all, you may be writing a lot of memory, but you're doing so in small, very cachable chunks, and you're discarding those immediately - so why can't this all stay within some level of cache?

As far as I know, the series of "y\n" is in the cache, there's plenty of room in L1 and L2. But since the output of yes is being redirected through a pipe, it does need to be read by the program on the other end (pv), which normally would throw it up on standard out, but discards it to /dev/null. To communicate through a pipe, the standard output of one program has to be buffered into memory that the end program can read, which is achieved through the kernel (pipe is a syscall). Might the halving of the memory speed be from the simultaneous read/writes?

If I implemented a timer and counter in the same program, it would probably never need to leave cache, and would instead see how quickly write() could be called to /dev/null opened as a file descriptor (might make an interesting memory/cache speed benchmark program).

[–]vvhy 0 ポイント1 ポイント  (0子コメント)

Uh oh - look at Kiki!

[–]Malor 0 ポイント1 ポイント  (0子コメント)

I just had another thought: you might be bottlenecking on pv.

You might get your missing performance back if you "yesed" directly to a ramdisk file... or maybe an SSD, if you have one that can handle 11 gigs a second. I'm not sure if any of them do, yet.

[–]crowdedconfirm 0 ポイント1 ポイント  (0子コメント)

Interestingly, yes on my MacBook Air seems to be much slower then the statistics you posted, although for most practical purposes I don't see it making much of a difference.

1.66GiB 0:01:01 [28.9MiB/s] [ <=> ]