(cache)Why is copying 1,000 1MB files so much slower than copying 1 1GB file, given that the same amount of data is being copied?

386BSD patchkit. SVR3 kernel modules. *BSD, Linux, Mac OS X, VMS, Windows. · Author has 29.7K answers and 124.4M answer views

· 5y

Each metadata write requires a commit to stable storage, even if it’s done in bulk, and you do the name, the time, any access control lists, advisory access controls, etc. all in one operation; which typically, you don’t; it’s usually 4 operations, plus any additional ACLs take one or more operations, based on whether there’s inherited container ACLs or not.

Each file close requires a commit to stable storage.

1,000 files is minimum 2,000 sync-to-disk-flush-track-buffer operations, probably closer to 6,000, compared to 2–6 of them for a single large file.

Additionally, even if the directory is st

Each metadata write requires a commit to stable storage, even if it’s done in bulk, and you do the name, the time, any access control lists, advisory access controls, etc. all in one operation; which typically, you don’t; it’s usually 4 operations, plus any additional ACLs take one or more operations, based on whether there’s inherited container ACLs or not.

Each file close requires a commit to stable storage.

1,000 files is minimum 2,000 sync-to-disk-flush-track-buffer operations, probably closer to 6,000, compared to 2–6 of them for a single large file.

Additionally, even if the directory is structured like a B-tree, each new create is going to cause a log2(N) iteration, where N is the depth of the new file in the tree.

Typically copy operations of this style tend to do a depth-first iteration, which means the writes should be a breadth-first iteration, but as a practical matter, they can’t be, so the initial iteration should be breadth first instead.

It typically isn’t, because directory positional data isn’t communicated to user space when traversing a directory tree.

For a bulk copy operation, you’d actually want a flag to pass into the opendir() to force the iteration the other direction, and that would also require modifying the file system.

One of the most obvious examples of the problem is the “ports” tarball in FreeBSD, where you pull it up forwards an lay it down forwards, and it takes forever, but if you modify tar to add a flag to go breadth on write, given you have the metadata in hand, it’s a couple of seconds instead of 20 minutes.

As a purely practical matter, the flag isn’t there by defult, because tar is often used in “copy mode”, e.g.:

tar cf - . | (cd /some/new/place ; tar xvf -)

And the data you’d want to unpack first is not in that order in the pipeline.

Sponsored by Hootsuite

Elevate your social strategy.

Discover the platform that makes managing social media stress-free—and turns it into a growth engine.

Dan Fleury

37 years of software/hardware troubleshooting · Author has 1.2K answers and 3.6M answer views

· 5y

The answers posted have covered this quite well, but lets take a look from a different angle.

Compare this question to many posted here about disk fragmentation and how that effects performance as well. Since many small fragments of files require additional work to retrieve, so do the placement of many small files on a disk when compared to a single large file. The process is the same. Larger”chunks” or “files” will require fewer overhead tasks.

Another answer used mail as an example. Lets use another non-computer example to illustrate this too. Which would you prefer, carrying 1 bottle of water

The answers posted have covered this quite well, but lets take a look from a different angle.

Compare this question to many posted here about disk fragmentation and how that effects performance as well. Since many small fragments of files require additional work to retrieve, so do the placement of many small files on a disk when compared to a single large file. The process is the same. Larger”chunks” or “files” will require fewer overhead tasks.

Another answer used mail as an example. Lets use another non-computer example to illustrate this too. Which would you prefer, carrying 1 bottle of water (2-liter bottles) at a time to a table or bringing a case(12) at a time and opening it (the case) at the table? At most, you could carry 3 or 4. With a case, you could potentially carry 24 (2 cases). Again, look at the overhead here (walking to/from the table). Moving the cases will take less time since you are traveling the pathway only once for every 24 bottles versus 6 times (at best) if you can carry 4 bottles at a time. That is a 4X improvement!!

Makes sense right?

Keith Levkoff

Technology Guru at Emotiva Audio (2011–present) · Author has 5.7K answers and 3.5M answer views

· 5y

I’ve already seen quite a few good answers here.

The general answer is that each file requires a certain amount of “overhead”. Each is handled separately, by the file system, and by your network, and the delays caused by that handling add up. (This is especially true in systems that have been optimized to handle large files - rather than small files - efficiently.)

If you’re talking about network transfers a lot of that overhead involves how the packets are routed from one point to another and the limitations on the equipment involved. For example, if you look at the hardware limitations on a ty

I’ve already seen quite a few good answers here.

The general answer is that each file requires a certain amount of “overhead”. Each is handled separately, by the file system, and by your network, and the delays caused by that handling add up. (This is especially true in systems that have been optimized to handle large files - rather than small files - efficiently.)

If you’re talking about network transfers a lot of that overhead involves how the packets are routed from one point to another and the limitations on the equipment involved. For example, if you look at the hardware limitations on a typical small router, you might find:

maximum throughput = 100 mBits/second
(limited by how fast it can move data)
maximum routing throughput = 100,000 packets/second
(limited by how long it takes to decide where each packet should be sent)

So, if you have one big file, being sent as relatively few large packets, the size is what counts. However, with many small files, the limitation is going to be how long it takes to decide where to send each piece.

Imagine moving one bucket full of sand… from upstairs to downstairs… through a 1/4″ hole in the floor. If you dump the bucket onto the floor, over the hole, the size of the hole will be the limiting factor (network speed). But, if you carry the sand to the hole between your fingers, one pinch at a time, the limiting factor will be the time you spend picking up pinches of sand, walking across the room with them, and dropping them through the hole.

Note that, once you know this, there are ways to optimize the process. For example, if you have 10,000 tiny files, and you know your system will be terribly slow transferring them, then combine them into one large ZIP or RAR archive file, send that across your connection, then estract the separate files at the other end. Even though it will cost you time to combine them into a single file, and extract them at the other end, you may save MORE time by transferring one big ZIP file instead of 10,000 separate files. (This is the way software is often distributed - for this and other reasons.)

The way packet traffic works across a network is often compared to “cutting a book up into individual pages and sending each in a separate envelope”. Note, however, that the post office would never be foolish enough to send envelopes separately - instead they combine envelopes destined for the same block into sacks - and pack sacks destined for the same area into trucks. At some points on a network packets are treated more like molecules of water flowing through pipes… but, at others, they are treated more like envelopes, which can be consolidated, and transported in large batches, significantly cutting down on handling overhead.

Michael Bird

Studied Information Technology at University of Phoenix (Graduated 2004) · Author has 10.2K answers and 12.8M answer views

· 5y

Handshaking and send/receive confirmation between each small file adds significant time to the copy process.

Dmytro Fadyeyenko

DevOps Engineer · Author has 1.3K answers and 2.1M answer views

· 5y

When a file is created, in addition to contents of the file being written, a record is added to filesystem on the drive. This record specifies file name, location of data sectors on the drive, access rights etc.

When many tiny files are transferred, this operation is to be repeated thousands times. When you transfer single file, it is only done once.

More, the record to update (that presents directory data) becomes bigger with each cycle. It takes more and more time to read it, update and write back. No significant difference between saying 10000 and 10001, but noticeable between 1 and 10001.

One

When a file is created, in addition to contents of the file being written, a record is added to filesystem on the drive. This record specifies file name, location of data sectors on the drive, access rights etc.

When many tiny files are transferred, this operation is to be repeated thousands times. When you transfer single file, it is only done once.

More, the record to update (that presents directory data) becomes bigger with each cycle. It takes more and more time to read it, update and write back. No significant difference between saying 10000 and 10001, but noticeable between 1 and 10001.

One more factor is location of file data and directory data on a drive if it is classic disk drive (not flash/SSD). Unce upon a time the drive should position magnetic head to the location where new file will be stored, write the file contents, then position the head to the location where directory data is stored and update the directory, and repeat this sequence for each file in a loop. This was taking a LOOOOOOT of time and produced a lot of mechanical noise when you copied hundreds or thousands files - we even used that to ‘play drums’ on big computers decades ago. Later, operation systems started to cache such changes, so this process became much faster: the cache ‘accumulates’ the data to write and flushes it to the disk periodically, making much less ‘real’ read-write operations and thus spending much less time in repositioning of the heads.

CanadaRAM

Former IT Company owner (1997–2018) · Author has 658 answers and 1.5M answer views

· 5y

Pretend you are a courier: Why does it take longer to deliver 100 1 pound packages to 100 addresses then to deliver one 100 pound package to one address?

each of those files has to have its own file name, address, entry in the catalog.
Then there is the block size. An NTFS drive has a minimum block size of 4 K. A file that is 2 K in size will still take one block - 4K - worth of space. A 4.1K file will take 2 blocks, or 8K, so many small files can lead to a lot of wasted space. Whereas a large file can fill up blocks fully until it gets to the last block, so only a portion of one block is wast

Pretend you are a courier: Why does it take longer to deliver 100 1 pound packages to 100 addresses then to deliver one 100 pound package to one address?

each of those files has to have its own file name, address, entry in the catalog.
Then there is the block size. An NTFS drive has a minimum block size of 4 K. A file that is 2 K in size will still take one block - 4K - worth of space. A 4.1K file will take 2 blocks, or 8K, so many small files can lead to a lot of wasted space. Whereas a large file can fill up blocks fully until it gets to the last block, so only a portion of one block is wasted.

Ted Wrigley

Philosophy, spirituality, science, mathematics, politics... · Author has 4.7K answers and 9.5M answer views

· 8y

The simplest analogy is this…

Imagine you have a library of books, and a card catalog that tells you where the books are on the shelves. Moving a file on a computer is essentially like deciding that the catalog card for a particular book should be in a different drawer. You pull the card out of the drawer it is in and put the card in the new drawer, then you’re done. The book stays exactly where it was in the library, and the card still points to that same shelf, and the only thing that changes is the position of the card.

On the other hand, copying a file is like deciding that you need two diff

The simplest analogy is this…

Imagine you have a library of books, and a card catalog that tells you where the books are on the shelves. Moving a file on a computer is essentially like deciding that the catalog card for a particular book should be in a different drawer. You pull the card out of the drawer it is in and put the card in the new drawer, then you’re done. The book stays exactly where it was in the library, and the card still points to that same shelf, and the only thing that changes is the position of the card.

On the other hand, copying a file is like deciding that you need two different copies of a book on different shelves in the library. Not only do you have to make a new card for the catalog, you also have to get another copy of the book itself and put it out on the shelves, which is much more labor intensive. You’ll now have two catalog cards pointing to two different copies of the same book.

Computers use catalogs just like libraries do. When you request a file from a computer, first it looks in the catalog to find the location of the file’s data, and then it goes and retrieves that data from storage. Moving a file amounts to reorganizing the catalog without touching the data; copying a file duplicates all of the data.

If you move a file across file systems (to different disks, drives, or media) a move is effectively the same as a copy. That’s because each file system has its own catalog and own data area, and cannot refer to catalogs or data areas on other file systems. It has to copy the data into the new file system so that file system’s catalog can access it.

Matthew Stott

7y

There are many factors that impact copy speed. The file system (NTFS, FAT32, HFS+, ext2/ext3/ext4, XFS, JFS, ReiserFS and btrfs) can hinder or speed up the process. Older file systems are single threaded meaning one copy operation at a time instead of many simultaneous copy operations. Other factors are overhead such as metadata about the files on the filesystem require updating as well as expensive journaling operations, or RAID operations across multiple disks, and other hardware overhead, etc. Disk has always been the performance bottleneck. It is where the fastest improvements have been ma

There are many factors that impact copy speed. The file system (NTFS, FAT32, HFS+, ext2/ext3/ext4, XFS, JFS, ReiserFS and btrfs) can hinder or speed up the process. Older file systems are single threaded meaning one copy operation at a time instead of many simultaneous copy operations. Other factors are overhead such as metadata about the files on the filesystem require updating as well as expensive journaling operations, or RAID operations across multiple disks, and other hardware overhead, etc. Disk has always been the performance bottleneck. It is where the fastest improvements have been made recently with SSD replacing traditional metal platters of spinning rust and magnetic fields. PCIe bypassing SATA controllers doubled and quadrupled the speed effectively going as fast a possible. Some operating system file managers need to scan all the files and subdirectories prior to even starting the copy operation. You see this with Windows Explorer or the Mac Finder where it says “Preparing to Copy”. Other systems, typically the command line do not prepare to copy they just start copying.

When copying over a network the factors are increased a great deal. Even though you may have incredible bandwidth available it likely won’t be used to its potential and you can add the file transfer protocol overhead as well (SMB/CIFS, SSH/SFTP, NFS) on top of whatever filesystem and hardware overhead.

The biggest reason that many small files takes longer than a single large file is they are copied one at a time and they complete the copy before reaching maximum speed and then the next file starts to copy and increases speed then it finishes and the next file starts to copy and increases speed. But a large file takes longer to copy and therefore can reach maximum peak speed and maintain that maximum speed before it finishes copying. In this case it’s more efficient and accelerates faster for a longer period of time. It’s like a 0–60 mph acceleration of a car for each file. They complete the copy before the car even reaches 20mph, the car stops then starts again from 0mph accelerating until the copy finishes then comes to a stop. Rinse and repeat for 50,000 small files and it takes forever to complete. Copying a single large file is like a train loaded with cars accelerating to maximum speed from point A to point B. So zipping or otherwise compressing the many small files into a large file prior to copy can speed things up considerably. Looking at network bandwidth graphs while copying you would see many small spikes for each file but a large square wave reaching peak speed continuing to completion. Overall the speed of copy will be faster to complete.

In Unix/Linux/Mac systems you can effectively pipe multiple commands together to compress and pass the files between hosts and decompress on the other side. This is accomplished by tar-gzipping the source files, piping it to netcat (which opens a raw network port and sends to an IP address waiting to receive via netcat) then piping to a tar-gunzip process on the destination host. In this case the compression ratio isn’t as important as the speed with which it can be compressed so bzip2 would be slower. This technique eliminates the network transfer protocol overhead by using raw network ports and it creates a large continuous stream of data during the copy so network bandwidth is allowed to peak and become saturated. This is ideal on an isolated network that is not throttled nor would impact other networked systems. In real life, this technique resulted in file copy operations at maximum speed and completed the copy process of hundreds of gigabytes over gigabit Ethernet hours faster than a normal copy of many mostly small files. The other advantage is free space on the source might not be enough to fit the entire gzipped contents which is why it is gripping on the fly and piping into netcat.

Other fast downloaders or file copy tools offer spinning up multiple threads. The Robocopy utility on Windows can do this but it’s only effective on many small files. But you could spin up 60 threads and it will finish the copy faster. There is the aria2c utility that does the same for HTTP, FTP, SFTP and BitTorrent. I recently used it to download Xcode which is 4.54GB’s and was taking forever due to the App store being swapped right after Apple’s WWDC. I grabbed it from the developer website instead of the app store and found a way to have aria2c to download it very quickly.

Modern 64bit / 128bit file systems that support multiple threaded access snapshots and deduplication can speed things considerably depending on what you are doing. ZFS and Apple’s new AFS can duplicate a huge file in seconds and then allow you to make changes to the duplicate while tracking those changes in snapshots. On ZFS you can send snapshots to a second system and it will only delta copy the differences, saving you time in file transfer. This is how online backups like CrashPlan work. They only copy the changes. Snapshots should eventually speed things up in TimeMachine as well as taking less space on the local disk until you attach the backup drive.

Apple also support Fusion drives which are a combination of an SSD with a larger traditional disk. So you could have a 256GB SSD paired with a 2TB hard disk and the CoreStorage system is smart enough to put your most frequently used files, operating system and applications on the SSD to speed things up while archiving older items on the slower 2TB drive. It all looks like one single disk but it’s not. This is getting improved in High Sierra by using the new AFS Apple File System. It won’t need to use CoreStorage anymore.

Another reason why copying files can be slow is the operating system doesn’t want a single copy operation to swamp the whole system to the point it becomes unresponsive so it’s intentionally throttled. Otherwise your system can seem to hang while it’s so busy with a long copy operations. Servers can be swamped with too many disk operations at once from too many users. Everyone is slowed down at that point. On servers throttling is definitely being used. Especially virtual machine servers. You don’t want one virtual machine stalling performance for all the other virtual machines. Both CPU and Disk are therefore throttled as the resources are being shared.

Rich

Owner of Graphic and Web Design Business at KolbeNet (1997–present) · Author has 1.8K answers and 5.8M answer views

· 6y

Because copying a file is a process.

Let's just deal with this theoretically. We'll also just refer to hard drives. A similar concept applies to other storage devices.

A computer is nothing more than a human has made it to be. Every tiny detail has to be planned by a human.

This is like that old trick of teaching how computers work.

Me: I'm a computer. Program me to get a glass of water.

You: Okay, go to the kitchen.

Me: Invalid command. You have to be more specific.

You: Fine. Get up.

Me: (I get up.)

You: Walk to the kitchen.

Me: Unrecognized argument, “kitchen.”

You: What?

Me: You have to be more specif

Because copying a file is a process.

Let's just deal with this theoretically. We'll also just refer to hard drives. A similar concept applies to other storage devices.

A computer is nothing more than a human has made it to be. Every tiny detail has to be planned by a human.

This is like that old trick of teaching how computers work.

Me: I'm a computer. Program me to get a glass of water.

You: Okay, go to the kitchen.

Me: Invalid command. You have to be more specific.

You: Fine. Get up.

Me: (I get up.)

You: Walk to the kitchen.

Me: Unrecognized argument, “kitchen.”

You: What?

Me: You have to be more specific.

On a hardware level, like processors and hard drives, it would be even more ridiculous than this. “Get up” would be an unrecognized command. You're entering commands into my brain so you would have to tell it to send messages to each muscle necessary to get up. To do that you would have to tell it which muscles those are and what message to send.

Hard drives have two commands in our theoretical world. They had to have hardware designed to know how to carry out those commands. One is “give me data,” the other is “store this data.”

To copy a file, the processor has to say “give me data,” then “store this data,” and repeat that over and over until the file is copied. But how does the hard drive know which data? The processor has to figure that out and tell the hard drive which data to get.

Couldn't you just add a “copy this data” command to the hard drive?

Yes! You could add whatever you want! But it'll take more hardware on the hard drive to do it. Hard drives would have to be more expensive as a result. Copying a file uses so little of the processor's resources that it isn't worth it. The closer a hard drive comes to being it's own mini computer the greater the risk of bugs and failure and the more expensive it will be.

That's why copying files uses processing power. Because it's a very fair trade-off compared to what's required for the hard drive to be more autonomous. Generally, simple process are left to the processor. Complex things that are difficult for the processor and slow it down considerably are moved off into other hardware like a sound chip or a graphics chip.

Then there's the more political issue of the processor worrying about hard drives becoming too powerful and taking over the computer. It's already seen this happen to some degree with GPUs. Now it has to share the spotlight with them. They don't want other devices rising up as well. It threatens their processor privilege so they have to oppress the lowly hard drive. Processors are really quite devicist at their core, and look at how many cores they have these days!

Fredrick Miller

Former Process Automation (1977–2009) · Author has 1.1K answers and 889.1K answer views

· 4y

Each File has an entry for it’s name date information and the logical block where it is located on the disk. Then there is a bitmap file of blocks that are used and empty. When any file is created an entry in the FAT is created and a search of the bitmap is made to find enough blocks to copy the file into it. For performance the bitmap file can be cached so there are not to many reads made to find a free location. At the best there needs to be a read and a write for this operation but for a fragmented disk there can be many reads to find a large enough free space. Now the file is allocated it’

Each File has an entry for it’s name date information and the logical block where it is located on the disk. Then there is a bitmap file of blocks that are used and empty. When any file is created an entry in the FAT is created and a search of the bitmap is made to find enough blocks to copy the file into it. For performance the bitmap file can be cached so there are not to many reads made to find a free location. At the best there needs to be a read and a write for this operation but for a fragmented disk there can be many reads to find a large enough free space. Now the file is allocated it’s logical blocks and the creation date time in the FAT table another write. Now the data is written into the blocks at the end the FAT is updated with date information and then the process in complete. A FAT entry is so important that it is written immediately to the drive. How does a large file differ there is usually only one entry of the starting address and the length of blocks. No waiting for the drive to turn or the head to move to a different block. By the way when a file is moved to the same hard drive only the FAT entry needs to be changed the file does not move only the pointers.

Michael Rutledge

Studied Computer Science & Distributed Storage · Author has 1.2K answers and 1.5M answer views

· 3y

One word. Metadata. Metadata is the bottleneck of the modern storage system.

From a conceptual level, think about a computer filesystem like a book with an index and you’re an editor.

The entire book has 20 chapters, each with 100 pages in each, each chapter is roughy 1MB in size on disk. You do need to check the index, however it’s relatively quick if you’re deleting the entire chapter, to identify and delete them.

The filesystem functions similarly. It has an index, references, and pages that allocate data and even redirect to other portions of the disk when say a page or paragraph is moved.

Ima

One word. Metadata. Metadata is the bottleneck of the modern storage system.

From a conceptual level, think about a computer filesystem like a book with an index and you’re an editor.

The entire book has 20 chapters, each with 100 pages in each, each chapter is roughy 1MB in size on disk. You do need to check the index, however it’s relatively quick if you’re deleting the entire chapter, to identify and delete them.

The filesystem functions similarly. It has an index, references, and pages that allocate data and even redirect to other portions of the disk when say a page or paragraph is moved.

Image Attribution to Reuters

Essentially it comes down to performing at least two orders of magnitude more operations to delete the files, likely sequentially.

This is dramatically exaggerated by system with erasure coding, such as RAID, LRC, and other technologies like copy on write or snapshotting which must be consolidated.

That’s the gist of it.

To go into further conceptual detail on this, you can actually make the 1000 operations faster than the single 1GB delete if you’re very skilled in the operating system, calls it makes, and how it specifically functions with regard to parallel execution, file handles, and disk and filesystems buffers and cache. With the 1000 small files, you can take advantage of parallel operations on a much larger scale.

This is one foundational cornerstone that large scale distributed storage systems rely on in order to store and process large scale, in the petabyte and exabyte scale level efficiently and reliably.

Christopher Burke

programming computers professionally since 1982 · Author has 5.2K answers and 18.1M answer views

· Updated 10y

Several reasons:

Overhead of deleting 10000 directory entries instead of 1 directory entry
Overhead of deallocating a minimum of 10000 regions of disk vs overhead of deallocating a minimum of 1 region of disk
Disk seek time seeking to 10000 directory entries instead of 1 directory entry

There are likely other reasons as well.

Imagine you have two "to-do" lists. One has 10000 small bank accounts that you have to close. The other has one big bank account that you have to close. Can you see how the amount of time it takes might depend more on how many accounts there are, than on how much money is in

Several reasons:

Overhead of deleting 10000 directory entries instead of 1 directory entry
Overhead of deallocating a minimum of 10000 regions of disk vs overhead of deallocating a minimum of 1 region of disk
Disk seek time seeking to 10000 directory entries instead of 1 directory entry

There are likely other reasons as well.

Imagine you have two "to-do" lists. One has 10000 small bank accounts that you have to close. The other has one big bank account that you have to close. Can you see how the amount of time it takes might depend more on how many accounts there are, than on how much money is in the accounts?

David Gray

Studied at Massachusetts Institute of Technology · Author has 1.7K answers and 4.4M answer views

· 5y

It is not simply a matter of moving the file contents. There is also overhead associated with each file being processed. As the diagram (for Unix/Linux) below begins to indicate, there is quite a bit of non-content data involved…which has to be examined & updated.

Note that for cut/copy there is also the need to find space for each file and then to transfer it. Clearly that is a simpler task for a single file than for multiple ones. A single, large file might not even be (or have to be) fragmented.. Even if it does, it isn’t likely to be in 16,000 pieces.

It is not simply a matter of moving the file contents. There is also overhead associated with each file being processed. As the diagram (for Unix/Linux) below begins to indicate, there is quite a bit of non-content data involved…which has to be examined & updated.

Note that for cut/copy there is also the need to find space for each file and then to transfer it. Clearly that is a simpler task for a single file than for multiple ones. A single, large file might not even be (or have to be) fragmented.. Even if it does, it isn’t likely to be in 16,000 pieces.

Thomas Robison

Former Senior Software Engineer (Still fixing computers) (1998–2010) · Author has 887 answers and 1.3M answer views

· 4y

Each file copied requires a notation in the FAT, File Allocation Table, then the Date is written to the disk. Remove 999 writes to the FAT and you speed things up.

Dave Wade-Stein

Senior Instructor at Pluralsight (2015-present) · Author has 1.7K answers and 7.1M answer views

· 5y

For pretty much the same reason it takes longer to ring up three shoppers in the express lane, each of whom have ten items in their baskets, compared to one shopper who has 30 items in the basket.

It takes time to “setup” (key stuff in) and “teardown” (accept payment, coupons, etc.) a customer’s order at the supermarket, so there is more to do:

3 orders x 10 items = 3x setup + 30 items + 3x teardown

1 order x 30 items = 1x setup + 30 items + 1x teardown

In the case of files we have:

16000 files = 16000 setup (find the file) operations + 16000 delete operations (totalling 1GB of deleted files)

1 file

For pretty much the same reason it takes longer to ring up three shoppers in the express lane, each of whom have ten items in their baskets, compared to one shopper who has 30 items in the basket.

It takes time to “setup” (key stuff in) and “teardown” (accept payment, coupons, etc.) a customer’s order at the supermarket, so there is more to do:

3 orders x 10 items = 3x setup + 30 items + 3x teardown

1 order x 30 items = 1x setup + 30 items + 1x teardown

In the case of files we have:

16000 files = 16000 setup (find the file) operations + 16000 delete operations (totalling 1GB of deleted files)

1 file = 1 setup (find the file) + 1 delete operation (which also totals 1GB of deleted files)

So it’s not really the delete operation itself, it’s the stuff in between the deletes–finding the files, marking them deleted in the directory, releasing the storage associated with file information which is not stored inside the file, but somewhere else in the filesystem (owner, file size, permissions, etc.)

Akash Jain

Computing Since 98. NAS, Storage an bit of Python · Author has 603 answers and 1.2M answer views

· 2y

Because each file has an overhead for writing. For 1,000 small files, the system has to write the metadata a 1,000 times.

Lets compare.

One single file of 1GiB - You open it once, start trasferring, Once 1GiB data is transferred, you close the file, write metadata and that’s that.

Now 1,024 files of 1Mib each. You open file one, transfer 1Mib data , close the file, write meta data. Repeat for the second file. ( we are not taking into cosideration multiple threads that the system might start )

Notice how in the second case, you have multiple opening , closing of files and writing of Metadata.

That’s

Because each file has an overhead for writing. For 1,000 small files, the system has to write the metadata a 1,000 times.

Lets compare.

One single file of 1GiB - You open it once, start trasferring, Once 1GiB data is transferred, you close the file, write metadata and that’s that.

Now 1,024 files of 1Mib each. You open file one, transfer 1Mib data , close the file, write meta data. Repeat for the second file. ( we are not taking into cosideration multiple threads that the system might start )

Notice how in the second case, you have multiple opening , closing of files and writing of Metadata.

That’s why we zip or tar the files from webservers before downloading them - to consolidate multiple smaller files into a single large file thats easier and faster to download.

Larye Parkins

Building, troubleshooting, and repairing computers since ‘65 · Author has 5.7K answers and 5.7M answer views

· 3y

Creating 10 files in a file system takes a lot less time than creating 400000 files, no matter how large or small the files are. Allocating disk space, searching the directory for existing entries, and inserting a new entry takes time, and writing each file requires a separate seek operation on the disk, after writing the directory entry.

Think of how long it would take to fill a barrel with water

Creating 10 files in a file system takes a lot less time than creating 400000 files, no matter how large or small the files are. Allocating disk space, searching the directory for existing entries, and inserting a new entry takes time, and writing each file requires a separate seek operation on the disk, after writing the directory entry.

Think of how long it would take to fill a barrel with water taking one cup at a time versus ...

Emmeryn Gelblicht

Undergraduate in Computer Engineering, University of California, San Diego (Graduated 2022) · Author has 876 answers and 1.9M answer views

· 4y

That makes sense. “Files” aren’t really a thing at the hardware level - they’re an abstraction, a concept used by computer software to make it easier for humans to understand what they’re doing. In hardware, a file is little more than a series of 1’s and 0’s on a disk.

Computers track files on a disk using a list of some sorts (in NTFS, used by Windows, this is called the MFT or master file table), where each entry in the list points to the location and size on disk of the file in question.

Deleting a file is a simple matter of removing the file’s entry in the list. The amount of data involved i

That makes sense. “Files” aren’t really a thing at the hardware level - they’re an abstraction, a concept used by computer software to make it easier for humans to understand what they’re doing. In hardware, a file is little more than a series of 1’s and 0’s on a disk.

Computers track files on a disk using a list of some sorts (in NTFS, used by Windows, this is called the MFT or master file table), where each entry in the list points to the location and size on disk of the file in question.

Deleting a file is a simple matter of removing the file’s entry in the list. The amount of data involved in a single list entry is unlikely to exceed even a fraction of a kilobyte.

Copying a file, however, requires reading every single byte of the file from one location, and writing it to the second. That means potentially gigabytes of reads and gigabytes of writes, depending on file size.

I should note, further, that some filesystems support copy-on-write. In a copy-on-write scheme, when a file is copied, the second copy initially points to the same exact data as the first. It is only when data is written to one of the copies, that this copying process is performed; and then only the portion changed is copied. The file table can then be updated so each file gets its own correct version (the original is unchanged, but the edited copy will show the desired changes). This way, the user can still create and edit copied files normally, but the computer saves on unneeded disk accesses.

One such copy-on-write filesystem is btrfs, a common file system used by Linux computers.

William Mussatto

Software for Web Sites (1995–present) · Author has 4.6K answers and 3.3M answer views

· 5y

Each file needs to be setup transmitted and saved separately. Any missing packets must be fixed before the transmission of the next file started or if several (say 5) a run in parallel finished before the next or 6th file started.

Al Klein

51 years developing software · Author has 137.2K answers and 118.5M answer views

· 1y

The initial speed indication includes all of the file that’s been read before you see anything happening, so it’s a lot faster than the computer can actually copy a file. As the speed indication starts matching the actual copy speed, it keeps getting lower. (If you’re copying a 1GB file, and 1MB has been copied before you see any change on screen, the first indication is 1mbps [because the indicated speed can’t be in fractions of a second]. By the time you’ve copies 500MB, that initial 1MB doesn’t change the indicated speed by much. [Initially it was 100% of the amount copied. By 500MB, it’s o

The initial speed indication includes all of the file that’s been read before you see anything happening, so it’s a lot faster than the computer can actually copy a file. As the speed indication starts matching the actual copy speed, it keeps getting lower. (If you’re copying a 1GB file, and 1MB has been copied before you see any change on screen, the first indication is 1mbps [because the indicated speed can’t be in fractions of a second]. By the time you’ve copies 500MB, that initial 1MB doesn’t change the indicated speed by much. [Initially it was 100% of the amount copied. By 500MB, it’s only one half a tenth of a percent - effectively it never happened.])

Julie Frey

Am familiar with most OS systems (Windows and UNIX most strongly) · Author has 10.5K answers and 16.6M answer views

· 1y

Please learn this term first

What is the Master Boot Record (MBR)?

Learn how the Master Boot Record provides information about the hard disk partitions and the OS so it can be loaded for the system boot. Explore how it works.

https://www.techtarget.com/whatis/definition/Master-Boot-Record-MBR

What you get from that lesson is that your hard drive is divided into partitions and sectors. Is is the job of your computer to organize and control all of that.

Mechanics VS Computer Memory

I imagine that much of the time is spent with the hard drive file pointer moving from one spot to another on the disk to go to a source file starting position. Then the file is copied from its stored cell. Then the file pointer must move to the destination location as on the index and stop at the starting position as assigned by the hard drive so

Please learn this term first

What is the Master Boot Record (MBR)?

Learn how the Master Boot Record provides information about the hard disk partitions and the OS so it can be loaded for the system boot. Explore how it works.

https://www.techtarget.com/whatis/definition/Master-Boot-Record-MBR

What you get from that lesson is that your hard drive is divided into partitions and sectors. Is is the job of your computer to organize and control all of that.

Mechanics VS Computer Memory

I imagine that much of the time is spent with the hard drive file pointer moving from one spot to another on the disk to go to a source file starting position. Then the file is copied from its stored cell. Then the file pointer must move to the destination location as on the index and stop at the starting position as assigned by the hard drive software. Then the process pastes file content to predefined cells all of equal size. It is possible that some files have contents located all over the hard drive. (We call that “fragmentation” and it slows down a hard drive.) When you save a file, it looks for an empty cell and starts filling it up. If it fills up a cell and the next cell has content, it has to continue in another free cell. Your computer hard drive file system has to keep track of all of that so that your files don’t clobber each other.

Perhaps you have never used a record player before. To play a song you have to first move the needle to where the song starts and then place it carefully.

A hard drive has to be more precise and faster than a record player spins.

Copying content partly from memory is fast. (A computer program pastes data into memory for faster access.) Using hardware will always be slower as it operates at a correct speed given its mechanical components and electrical inputs.

So a larger file copies faster. We have faster computers today so large files are less of a problem.

If you want to create your own media file on a CD or USB drive it helps a lot to have all of your files together and not fragmented. I partition my disk because it keeps content separated.

Roland Waddilove

I've used Windows for 10+ years · Author has 201 answers and 1.2M answer views

· 7y

When a file is copied:

The source file location on the disk must be found
The file must be opened for reading
A file at the destination must be created
The destination file must be opened for writing
The file contents are copied
The source file must be closed
The destination file must be closed

For one file each step is performed once. For 1024 files steps 1, 2, 3, 4, 6 and 7 must be repeated 1024 times, so it is much slower.

Chris Long

Knows a bit about computers · Author has 2.6K answers and 4.6M answer views

· 3y

Many possible reasons.

A common one is that most hard drives have a substantial RAM cache built in to them. When you write some data, it goes to the RAM cache on the drive first, then the drive controller actually writes it to the physical disk platters over the next second or two.

This makes the write appear to complete very fast, although it hasn't really completed at all.

If you write a LOT of data, that cache fills up, and each new write has to wait for one of the older writes to physically complete before the drive can receive the new data.

That causes the sudden decrease in throughput that y

Many possible reasons.

A common one is that most hard drives have a substantial RAM cache built in to them. When you write some data, it goes to the RAM cache on the drive first, then the drive controller actually writes it to the physical disk platters over the next second or two.

This makes the write appear to complete very fast, although it hasn't really completed at all.

If you write a LOT of data, that cache fills up, and each new write has to wait for one of the older writes to physically complete before the drive can receive the new data.

That causes the sudden decrease in throughput that you see in the 'file copy details’ dialog.

Gerrit Bernard

Studied at Karlsruhe Institute of Technology · Author has 1.7K answers and 1.5M answer views

· 3y

The way the drive holds files is the data is on the disk, and there's an index at the beginning of the disk that tells the computer where each file is. So when you delete a file, the computer just removes the INDEX entry.

The space on the disk can a gibberish of data, as long as the INDEX at the beginning says it's "empty", then it's considered empty. If you were to download another 1 gb game or movie, the disk would likely use the gibberish space, erasing the gibberish and overwriting it with the new file.

So because the actual erasing happens when you put in a NEW file, there's no need to actu

The way the drive holds files is the data is on the disk, and there's an index at the beginning of the disk that tells the computer where each file is. So when you delete a file, the computer just removes the INDEX entry.

The space on the disk can a gibberish of data, as long as the INDEX at the beginning says it's "empty", then it's considered empty. If you were to download another 1 gb game or movie, the disk would likely use the gibberish space, erasing the gibberish and overwriting it with the new file.

So because the actual erasing happens when you put in a NEW file, there's no need to actually erase the old file, there's no need to "make the disk all zeroes" so to speak. It's just unnecessary effort. The computer just deletes the file location from the index, and calls it done.

Peter Ho

Cloud/SRE/Dev/Arch/DevOps at Prudential Financial (company) (2004–present) · Author has 730 answers and 1.2M answer views

· 5y

Copy and paste:

Get file reference
Read data from reference and write to new location.
Display new reference

The entire file content needs to be read and moved.

Cut and paste (on same disk):

Get file reference
Replace original reference with new

Only a reference of a few byte is changed.

Cut and paste (on different disk):

Get file reference
Read data from reference and write to new location.
Replace original reference with new

As you can see, it is reading and writing the entire file is what takes time. A copy & paste to a different disk should take the same amount of time as a cut & paste.

Dustin Bonson

Cat Servant (1979–present) · Author has 717 answers and 359.5K answer views

· 4y

It’s a matter of filesystem overhead mainly.

For each file, the filesystem needs to create an entry, allocate one of more blocks on the drive to it, and finally copy the data to those blocks.

Depending on the filesystem it might also create checksum data so that upon reading the file back it can verify that the data stored matches the data that was originally written. This checksum data also needs to be written to the target disk.

This “housekeeping” needs to be performed on every file written to the disk, and takes approximately the same amount of time regardless of file size.

For a large file, o

It’s a matter of filesystem overhead mainly.

For each file, the filesystem needs to create an entry, allocate one of more blocks on the drive to it, and finally copy the data to those blocks.

Depending on the filesystem it might also create checksum data so that upon reading the file back it can verify that the data stored matches the data that was originally written. This checksum data also needs to be written to the target disk.

This “housekeeping” needs to be performed on every file written to the disk, and takes approximately the same amount of time regardless of file size.

For a large file, once these operations are taken care of all that’s left for the drive to do is copy a series of full blocks of data one after the other from the source drive to the target drive, and drives are able to do this quite quickly because it’s a really simple, predictable operation that can be optimized in their firmware.

SSDs are better at small, random read/write operations than mechanical HDDs are, due to not having to physically move heads around to the desired location on the disk, but they still have to do alot of behind the scenes work when things are happening at random (from the drives standpoint) vs. just reading/writing a continuous stream of data.

For example writing a small amount of checksum data might actually involve reading a much larger “chunk” of existing data, adding the new data to it, erasing the original “chunk”, and replacing it with the new combined “chunk”.

This is because there is a minimum block size they can work with by design. If they make these blocks too small, they have to work with many blocks to handle large files, and this costs performance. Make them too big and you end up having to do more of these read/modify/erase/write cycles to avoid having partially filled blocks which is a waste of space.

Diego Vaccher

C++ Programmer, CCNA, IT Student · Author has 52 answers and 73.5K answer views

· 5y

The answer is already in the question itself.

Windows read the metadata of a file. Metadata is a part of the file that store information about it, like file size, creation date, access date and lots of other information. You can look at some of this information on file properties on windows.

Now, every time windows need to copy a file to another location it needs to read the metadata. This metadata obviously is the same size for any file type. If you have 1000 small files, windows needs to read for each file its metadata, and this takes a lot. Instead, if you have one big file, windows opens the

The answer is already in the question itself.

Windows read the metadata of a file. Metadata is a part of the file that store information about it, like file size, creation date, access date and lots of other information. You can look at some of this information on file properties on windows.

Now, every time windows need to copy a file to another location it needs to read the metadata. This metadata obviously is the same size for any file type. If you have 1000 small files, windows needs to read for each file its metadata, and this takes a lot. Instead, if you have one big file, windows opens the metadata just one time. Thereafter it just need to copy pieces of data from one location to another, and this is faster.

- - - - - - -

Diego Vaccher

Colton McGraw

Self taught programmer

· 4y

There are a lot of reasons!

Location. Computers a very good at repeating one task very fast! So copying one is very straight forward. Ask the OS to retrieve a file at location X. The OS sends a couple of CPU intrupts and the CPU begins streaming bytes from disk to RAM, GPU, over network or wherever iys instructed to send the data. The bytes are also all in a line in the same location. So the CPU doesn't have to waist lots of time with index moving, etc. Insead it starts and the beginning and goes to the end. If its a lot of files however, one of the biggest time consuming parts will be looking

There are a lot of reasons!

Location. Computers a very good at repeating one task very fast! So copying one is very straight forward. Ask the OS to retrieve a file at location X. The OS sends a couple of CPU intrupts and the CPU begins streaming bytes from disk to RAM, GPU, over network or wherever iys instructed to send the data. The bytes are also all in a line in the same location. So the CPU doesn't have to waist lots of time with index moving, etc. Insead it starts and the beginning and goes to the end. If its a lot of files however, one of the biggest time consuming parts will be looking for the files on the disk. Sure you provided the location. Now the OS has to find the disk and sector of the the file. The setup the CPU to read the file, read, and now instead of being done it has to start the process all over again to read the next file.
Fragmentation. This isn't a super big problem now days because most people are beginning to use SSD. However if you use a hard drive, this can be a big problem. Big files can be written to a large unused portion of disk, however as you delete files the disk can become ‘fragmented’ with lots of unused little bits of space. Sometimes a large file might have some parts written to fragmented part meaning the physical head on the hard drive must move somewhere else to find the needed information. But in general the big file will have long uninterrupted sections of data it can read meaning that the data can be read very quickly without stopping. Small files however my be broken up to fill a lot of tiny unused spots mean the disk must stop reading move and start read somewhere else a lot more this eats up more time.

Im sure there are more reasons, these are just what I happened to think of. It also depends on where you copying to to, and how you are copying. Thank you for reading. Have a great day!

Shane Tennent

Computer Repair Technician · Author has 7.5K answers and 40.4M answer views

· 7y

I’m sorry, you really need to edit your deceptive question. The question you really want an answer to is “How do I break into the copy shop computers near my university that have PDF files of all the textbooks so I can steal them?”

Seriously. That’s what you need to change your question to. I’m sure you don’t like the word “steal” but it doesn’t matter if you don’t like it. That’s what you are proposing. You want to steal them, and you want to then distribute them freely.

However, it does seem to show that you don’t really know how the internet works.

For example, you aren’t going to be the one m

I’m sorry, you really need to edit your deceptive question. The question you really want an answer to is “How do I break into the copy shop computers near my university that have PDF files of all the textbooks so I can steal them?”

Seriously. That’s what you need to change your question to. I’m sure you don’t like the word “steal” but it doesn’t matter if you don’t like it. That’s what you are proposing. You want to steal them, and you want to then distribute them freely.

However, it does seem to show that you don’t really know how the internet works.

For example, you aren’t going to be the one making copies of the PDF files for everyone to take. You aren’t going to make 100 copies of the PDF files, so that you can hand 100 books out to 100 people. It isn’t like paper.

All you need is one copy of each book PDF, and any file sharing site. You can use Google drive, post the PDF files in the Google drive cloud storage, and share them publically.

Sharing PDF files on the internet is easy. Your question surrounds this concept as if it was the hard part, or as if it should have the main focus here. It isn’t.

You are blowing off the Breaking and Entering, and the Computer Hacking as if they should be something trivial. You are looking at theft and treating it as if it were a Right.

You could… you know… PURCHASE the books, and then share them from a Google drive account. There’s an option.

EDIT: In case I catch shit for calling this out… You select the file you want to copy, hold the CTRL key and hit C. Then, you go to where you want your copy to reside, and you hold the CTRL key and hit V to paste it into place. CTRL-C copies, CTRL-V pastes.

2nd EDIT: and User-13198429912709580948 is completely right in that every part of your question is really suggesting something illegal. I’m not trying to encourage you to do anything illegal, but the trouble you would be in after breaking into the copy shop or hacking their machines would be FAR GREATER than the trouble you would get in if you purchased the books and then just allowed friends to copy those PDF files.