Sort

Each metadata write requires a commit to stable storage, even if it’s done in bulk, and you do the name, the time, any access control lists, advisory access controls, etc. all in one operation; which typically, you don’t; it’s usually 4 operations, plus any additional ACLs take one or more operations, based on whether there’s inherited container ACLs or not.

Each file close requires a commit to stable storage.

1,000 files is minimum 2,000 sync-to-disk-flush-track-buffer operations, probably closer to 6,000, compared to 2–6 of them for a single large file.

Additionally, even if the directory is st

Each metadata write requires a commit to stable storage, even if it’s done in bulk, and you do the name, the time, any access control lists, advisory access controls, etc. all in one operation; which typically, you don’t; it’s usually 4 operations, plus any additional ACLs take one or more operations, based on whether there’s inherited container ACLs or not.

Each file close requires a commit to stable storage.

1,000 files is minimum 2,000 sync-to-disk-flush-track-buffer operations, probably closer to 6,000, compared to 2–6 of them for a single large file.

Additionally, even if the directory is structured like a B-tree, each new create is going to cause a log2(N) iteration, where N is the depth of the new file in the tree.

Typically copy operations of this style tend to do a depth-first iteration, which means the writes should be a breadth-first iteration, but as a practical matter, they can’t be, so the initial iteration should be breadth first instead.

It typically isn’t, because directory positional data isn’t communicated to user space when traversing a directory tree.

For a bulk copy operation, you’d actually want a flag to pass into the opendir() to force the iteration the other direction, and that would also require modifying the file system.

One of the most obvious examples of the problem is the “ports” tarball in FreeBSD, where you pull it up forwards an lay it down forwards, and it takes forever, but if you modify tar to add a flag to go breadth on write, given you have the metadata in hand, it’s a couple of seconds instead of 20 minutes.

As a purely practical matter, the flag isn’t there by defult, because tar is often used in “copy mode”, e.g.:

  1. tar cf - . | (cd /some/new/place ; tar xvf -) 

And the data you’d want to unpack first is not in that order in the pipeline.

Trade Like a Pro with IBKR: Invest in Global Stocks, Options, Futures & More! Your capital is at risk.

When you copy a single 1GB file, you:

* Read the file’s information from the disk directory
* Locate the file on disk
* Locate free space on the destination
* Write the file’s directory information on the destination
* Read as much of the file as will fit in RAM
* Write what you’ve read to the destination
* Close the file on the destination
* Release the file’s handle on the source

When you

When you copy a single 1GB file, you:

* Read the file’s information from the disk directory
* Locate the file on disk
* Locate free space on the destination
* Write the file’s directory information on the destination
* Read as much of the file as will fit in RAM
* Write what you’ve read to the destination
* Close the file on the destination
* Release the file’s handle on the source

When you copy 1,000 1MB files, you:

* Read the first file’s information from the disk directory
* Locate the first file on disk
* Locate free space on the destination
* Write the first file’s directory information on the destination
* Read the first file into RAM
* Write what you’ve read to the destination
* Close the first file on the destination
* Release the first file’s handle on the source
* Repeat 999 more times

With a thousand files, all the faffing about reading directory information, allocating space on the destination, writing directory information, and so forth can actually take more time and involve more data writing for a bunch of small files than the information in the small files themselves!

Even if you gang up (locate and read the first file, locate and read the second file, locate and read the third file, construct file system entries in memory, allocate space for the files all at once, write the directory entries, write the file data), that’s still a lot of faffing about.

On jo...

There are some reasons why it is harder to copy 1000 small files then 1 larger one, as described by other people.

However, most of the time the slowdown is not only due to inherent difficulty, but also due to a badly written copy program.

So why some copy programs are badly written?

  • Some programs were written in the days of dumb drives haven’t changed since.
  • Some programmers don’t know how to write fast copy programs for modern “smart” drives
  • Some programmers don’t bother writing fast copying because they think that on modern hardware (e.g. a NVMe SSD) the difference is negligible

Somee programs are

There are some reasons why it is harder to copy 1000 small files then 1 larger one, as described by other people.

However, most of the time the slowdown is not only due to inherent difficulty, but also due to a badly written copy program.

So why some copy programs are badly written?

  • Some programs were written in the days of dumb drives haven’t changed since.
  • Some programmers don’t know how to write fast copy programs for modern “smart” drives
  • Some programmers don’t bother writing fast copying because they think that on modern hardware (e.g. a NVMe SSD) the difference is negligible

Somee programs are slow despite being well-written because they try to save memory, so they show a compromise between speed and memory use.

On Windows, generally Explorer copies faster than some file managers such as Far, but slower than specialized programs such as FastCopy.

Why is copying 1,000 1MB files so much slower than copying 1 1GB file, given that the same amount of data is being copied?

Because although you copy (roughly) the same amount of data you spend the overhead effort for copying a file not once but a thousand times. It’s the 999 additional times of overhead that makes this operation so much slower.

Use Octoboard web analytics to monitor real-time traffic and convert website visitors to leads.
Profile photo for Assistant
Assistant

The difference in speed when copying 1,000 1MB files versus a single 1GB file can be attributed to several factors:

  1. File System Overhead: Each file copy operation involves some overhead associated with file system metadata management. This includes creating entries in the file system for each file, updating directory structures, and managing file attributes (like timestamps and permissions). Copying many small files results in a lot of this overhead, which accumulates and slows down the process.
  2. Fragmentation: Small files may not be stored contiguously on the disk, leading to fragmentation. When

The difference in speed when copying 1,000 1MB files versus a single 1GB file can be attributed to several factors:

  1. File System Overhead: Each file copy operation involves some overhead associated with file system metadata management. This includes creating entries in the file system for each file, updating directory structures, and managing file attributes (like timestamps and permissions). Copying many small files results in a lot of this overhead, which accumulates and slows down the process.
  2. Fragmentation: Small files may not be stored contiguously on the disk, leading to fragmentation. When copying many small files, the read/write head of a traditional hard drive has to move around more, which increases seek time and reduces overall transfer speed. In contrast, a single large file is more likely to be contiguous, minimizing seek time.
  3. I/O Operations: Each file copy operation typically involves multiple input/output (I/O) operations. For 1,000 files, there are 1,000 separate I/O operations, whereas for a single 1GB file, there's only one. More I/O operations mean more time spent managing the operations rather than transferring data.
  4. Buffering and Caching: When copying a single large file, the operating system can take advantage of buffering and caching more effectively. It can read larger chunks of data into memory at once, reducing the number of read/write operations. With many small files, the benefits of caching are diminished because each file may be accessed independently.
  5. Network Transfer: If the files are being copied over a network, there may be additional latency per file due to the need to establish connections or negotiate transfer parameters for each individual file. A single large file can often be transferred more efficiently in one go.

In summary, while the total amount of data is the same in both scenarios, the way that data is managed, accessed, and transferred significantly affects the speed of the copy operation.

For the real answer see Franklin Veaux’s .

Here’s an analogy.

Why is it faster to put a single 10,000 page book on the library shelf, and update the library catalogue, than to put 1000 10-page pamphlets on the library shelves and update the catalogue for each one?

All that extra housekeeping.

Just try:

Open the door, carry a bottle of water inside. Get out. Close the door. Do that 10 times.
Or: Open the door, carry 10 bottles of water inside. Get out. Close the door.

You could spare some time by leaving the door open. Systems also do that: those use caches, do not reread the catalog all the time, do not rewrite the catalog entries after each copied block,… But still there will be lots of housekeeping, that cannot be spared.

Indeed, IDEs are sometimes more of a hindrance than a help when you're learning a new programming language. However, this isn't true of CLion, which can be very helpful right off the bat. Let's run through some of the advantages of using CLion:

CLion makes it easy to start a new project. The wizard will generate a simple project structure with stub code that you can explore and run right away.

Then, as you start writing code, CLion will highlight its structure and suggest improvements. This allows you to learn both the language itself and the best code practices from the very beginning.

For many

Indeed, IDEs are sometimes more of a hindrance than a help when you're learning a new programming language. However, this isn't true of CLion, which can be very helpful right off the bat. Let's run through some of the advantages of using CLion:

CLion makes it easy to start a new project. The wizard will generate a simple project structure with stub code that you can explore and run right away.

Then, as you start writing code, CLion will highlight its structure and suggest improvements. This allows you to learn both the language itself and the best code practices from the very beginning.

For many error cases, CLion suggests quick-fixes, which means you can start memorizing the proper solutions right off the bat. For example, the IDE catches typical errors like dangling pointers – the types of errors that might pop up a lot when you’re a newbie but can also be hard to debug.

The building, running, and debugging processes are completely transparent, as are VCS operations. You can control every step, see the exact commands, parameters, and flags that are being used, and experiment with them. If necessary, you can always perform the same actions in the built-in terminal.

CLion allows you to become familiar with a variety of compilers and build systems. It works with gcc, Clang, MSVC compilers, and project formats, such as CMake, Makefile, Meson, and others.

Understanding the underlying assembly is important when you’re learning a language like C++. CLion allows you to examine the assembly of a file without having to build the entire project. You can change the compiler flags, refresh the assembly view, and see the effect immediately.

And, of course, the IDE debugger is very convenient for learning. The IDE will help you investigate any runtime problem, including those that require memory analysis and disassembly.

To summarize, CLion will not overwhelm you with a complex set-up process and will not hide the essentials in the background. You’ll be able to get up and running with a project in no time, while learning best code practices and efficient debugging right out of the gate.

And if you’re a student, you can get it all for free!

Well, having read the comments, and acknowledging not knowing “journaled file systems,” I’d like to add, though, both the writer and upvoter are missing the real power of OS of “CACHING”. Unless you make the copy to un-cached media, the process is totally different, much simpler/ faster, but still much slower than copying 1 chunk of data. I know from practical test results. You re better off, say, zipping all files into 1 before copying.

The point I’d like to make “caching” eliminates individual troublesome/ long data handling/ processing procedure by copying into RAM (1000 times (HDD) or 100 t

Well, having read the comments, and acknowledging not knowing “journaled file systems,” I’d like to add, though, both the writer and upvoter are missing the real power of OS of “CACHING”. Unless you make the copy to un-cached media, the process is totally different, much simpler/ faster, but still much slower than copying 1 chunk of data. I know from practical test results. You re better off, say, zipping all files into 1 before copying.

The point I’d like to make “caching” eliminates individual troublesome/ long data handling/ processing procedure by copying into RAM (1000 times (HDD) or 100 times (SSD) faster).

For pretty much the same reason that it’s a lot quicker to move a pack of 500 sheets of paper from one shelf to another than it is to move each sheet one at a time.

Discover the platform that makes managing social media stress-free—and turns it into a growth engine.

The answers posted have covered this quite well, but lets take a look from a different angle.

Compare this question to many posted here about disk fragmentation and how that effects performance as well. Since many small fragments of files require additional work to retrieve, so do the placement of many small files on a disk when compared to a single large file. The process is the same. Larger”chunks” or “files” will require fewer overhead tasks.

Another answer used mail as an example. Lets use another non-computer example to illustrate this too. Which would you prefer, carrying 1 bottle of water

The answers posted have covered this quite well, but lets take a look from a different angle.

Compare this question to many posted here about disk fragmentation and how that effects performance as well. Since many small fragments of files require additional work to retrieve, so do the placement of many small files on a disk when compared to a single large file. The process is the same. Larger”chunks” or “files” will require fewer overhead tasks.

Another answer used mail as an example. Lets use another non-computer example to illustrate this too. Which would you prefer, carrying 1 bottle of water (2-liter bottles) at a time to a table or bringing a case(12) at a time and opening it (the case) at the table? At most, you could carry 3 or 4. With a case, you could potentially carry 24 (2 cases). Again, look at the overhead here (walking to/from the table). Moving the cases will take less time since you are traveling the pathway only once for every 24 bottles versus 6 times (at best) if you can carry 4 bottles at a time. That is a 4X improvement!!

Makes sense right?

I’ve already seen quite a few good answers here.

The general answer is that each file requires a certain amount of “overhead”. Each is handled separately, by the file system, and by your network, and the delays caused by that handling add up. (This is especially true in systems that have been optimized to handle large files - rather than small files - efficiently.)

If you’re talking about network transfers a lot of that overhead involves how the packets are routed from one point to another and the limitations on the equipment involved. For example, if you look at the hardware limitations on a ty

I’ve already seen quite a few good answers here.

The general answer is that each file requires a certain amount of “overhead”. Each is handled separately, by the file system, and by your network, and the delays caused by that handling add up. (This is especially true in systems that have been optimized to handle large files - rather than small files - efficiently.)

If you’re talking about network transfers a lot of that overhead involves how the packets are routed from one point to another and the limitations on the equipment involved. For example, if you look at the hardware limitations on a typical small router, you might find:

  • maximum throughput = 100 mBits/second
    (limited by how fast it can move data)
  • maximum routing throughput = 100,000 packets/second
    (limited by how long it takes to decide where each packet should be sent)

So, if you have one big file, being sent as relatively few large packets, the size is what counts. However, with many small files, the limitation is going to be how long it takes to decide where to send each piece.

Imagine moving one bucket full of sand… from upstairs to downstairs… through a 1/4″ hole in the floor. If you dump the bucket onto the floor, over the hole, the size of the hole will be the limiting factor (network speed). But, if you carry the sand to the hole between your fingers, one pinch at a time, the limiting factor will be the time you spend picking up pinches of sand, walking across the room with them, and dropping them through the hole.

Note that, once you know this, there are ways to optimize the process. For example, if you have 10,000 tiny files, and you know your system will be terribly slow transferring them, then combine them into one large ZIP or RAR archive file, send that across your connection, then estract the separate files at the other end. Even though it will cost you time to combine them into a single file, and extract them at the other end, you may save MORE time by transferring one big ZIP file instead of 10,000 separate files. (This is the way software is often distributed - for this and other reasons.)

The way packet traffic works across a network is often compared to “cutting a book up into individual pages and sending each in a separate envelope”. Note, however, that the post office would never be foolish enough to send envelopes separately - instead they combine envelopes destined for the same block into sacks - and pack sacks destined for the same area into trucks. At some points on a network packets are treated more like molecules of water flowing through pipes… but, at others, they are treated more like envelopes, which can be consolidated, and transported in large batches, significantly cutting down on handling overhead.

Handshaking and send/receive confirmation between each small file adds significant time to the copy process.

When a file is created, in addition to contents of the file being written, a record is added to filesystem on the drive. This record specifies file name, location of data sectors on the drive, access rights etc.

When many tiny files are transferred, this operation is to be repeated thousands times. When you transfer single file, it is only done once.

More, the record to update (that presents directory data) becomes bigger with each cycle. It takes more and more time to read it, update and write back. No significant difference between saying 10000 and 10001, but noticeable between 1 and 10001.

One

When a file is created, in addition to contents of the file being written, a record is added to filesystem on the drive. This record specifies file name, location of data sectors on the drive, access rights etc.

When many tiny files are transferred, this operation is to be repeated thousands times. When you transfer single file, it is only done once.

More, the record to update (that presents directory data) becomes bigger with each cycle. It takes more and more time to read it, update and write back. No significant difference between saying 10000 and 10001, but noticeable between 1 and 10001.

One more factor is location of file data and directory data on a drive if it is classic disk drive (not flash/SSD). Unce upon a time the drive should position magnetic head to the location where new file will be stored, write the file contents, then position the head to the location where directory data is stored and update the directory, and repeat this sequence for each file in a loop. This was taking a LOOOOOOT of time and produced a lot of mechanical noise when you copied hundreds or thousands files - we even used that to ‘play drums’ on big computers decades ago. Later, operation systems started to cache such changes, so this process became much faster: the cache ‘accumulates’ the data to write and flushes it to the disk periodically, making much less ‘real’ read-write operations and thus spending much less time in repositioning of the heads.

Pretend you are a courier: Why does it take longer to deliver 100 1 pound packages to 100 addresses then to deliver one 100 pound package to one address?

each of those files has to have its own file name, address, entry in the catalog.
Then there is the block size. An NTFS drive has a minimum block size of 4 K. A file that is 2 K in size will still take one block - 4K - worth of space. A 4.1K file will take 2 blocks, or 8K, so many small files can lead to a lot of wasted space. Whereas a large file can fill up blocks fully until it gets to the last block, so only a portion of one block is wast

Pretend you are a courier: Why does it take longer to deliver 100 1 pound packages to 100 addresses then to deliver one 100 pound package to one address?

each of those files has to have its own file name, address, entry in the catalog.
Then there is the block size. An NTFS drive has a minimum block size of 4 K. A file that is 2 K in size will still take one block - 4K - worth of space. A 4.1K file will take 2 blocks, or 8K, so many small files can lead to a lot of wasted space. Whereas a large file can fill up blocks fully until it gets to the last block, so only a portion of one block is wasted.

The simplest analogy is this…

Imagine you have a library of books, and a card catalog that tells you where the books are on the shelves. Moving a file on a computer is essentially like deciding that the catalog card for a particular book should be in a different drawer. You pull the card out of the drawer it is in and put the card in the new drawer, then you’re done. The book stays exactly where it was in the library, and the card still points to that same shelf, and the only thing that changes is the position of the card.

On the other hand, copying a file is like deciding that you need two diff

The simplest analogy is this…

Imagine you have a library of books, and a card catalog that tells you where the books are on the shelves. Moving a file on a computer is essentially like deciding that the catalog card for a particular book should be in a different drawer. You pull the card out of the drawer it is in and put the card in the new drawer, then you’re done. The book stays exactly where it was in the library, and the card still points to that same shelf, and the only thing that changes is the position of the card.

On the other hand, copying a file is like deciding that you need two different copies of a book on different shelves in the library. Not only do you have to make a new card for the catalog, you also have to get another copy of the book itself and put it out on the shelves, which is much more labor intensive. You’ll now have two catalog cards pointing to two different copies of the same book.

Computers use catalogs just like libraries do. When you request a file from a computer, first it looks in the catalog to find the location of the file’s data, and then it goes and retrieves that data from storage. Moving a file amounts to reorganizing the catalog without touching the data; copying a file duplicates all of the data.

If you move a file across file systems (to different disks, drives, or media) a move is effectively the same as a copy. That’s because each file system has its own catalog and own data area, and cannot refer to catalogs or data areas on other file systems. It has to copy the data into the new file system so that file system’s catalog can access it.

There are many factors that impact copy speed. The file system (NTFS, FAT32, HFS+, ext2/ext3/ext4, XFS, JFS, ReiserFS and btrfs) can hinder or speed up the process. Older file systems are single threaded meaning one copy operation at a time instead of many simultaneous copy operations. Other factors are overhead such as metadata about the files on the filesystem require updating as well as expensive journaling operations, or RAID operations across multiple disks, and other hardware overhead, etc. Disk has always been the performance bottleneck. It is where the fastest improvements have been ma

There are many factors that impact copy speed. The file system (NTFS, FAT32, HFS+, ext2/ext3/ext4, XFS, JFS, ReiserFS and btrfs) can hinder or speed up the process. Older file systems are single threaded meaning one copy operation at a time instead of many simultaneous copy operations. Other factors are overhead such as metadata about the files on the filesystem require updating as well as expensive journaling operations, or RAID operations across multiple disks, and other hardware overhead, etc. Disk has always been the performance bottleneck. It is where the fastest improvements have been made recently with SSD replacing traditional metal platters of spinning rust and magnetic fields. PCIe bypassing SATA controllers doubled and quadrupled the speed effectively going as fast a possible. Some operating system file managers need to scan all the files and subdirectories prior to even starting the copy operation. You see this with Windows Explorer or the Mac Finder where it says “Preparing to Copy”. Other systems, typically the command line do not prepare to copy they just start copying.

When copying over a network the factors are increased a great deal. Even though you may have incredible bandwidth available it likely won’t be used to its potential and you can add the file transfer protocol overhead as well (SMB/CIFS, SSH/SFTP, NFS) on top of whatever filesystem and hardware overhead.

The biggest reason that many small files takes longer than a single large file is they are copied one at a time and they complete the copy before reaching maximum speed and then the next file starts to copy and increases speed then it finishes and the next file starts to copy and increases speed. But a large file takes longer to copy and therefore can reach maximum peak speed and maintain that maximum speed before it finishes copying. In this case it’s more efficient and accelerates faster for a longer period of time. It’s like a 0–60 mph acceleration of a car for each file. They complete the copy before the car even reaches 20mph, the car stops then starts again from 0mph accelerating until the copy finishes then comes to a stop. Rinse and repeat for 50,000 small files and it takes forever to complete. Copying a single large file is like a train loaded with cars accelerating to maximum speed from point A to point B. So zipping or otherwise compressing the many small files into a large file prior to copy can speed things up considerably. Looking at network bandwidth graphs while copying you would see many small spikes for each file but a large square wave reaching peak speed continuing to completion. Overall the speed of copy will be faster to complete.

In Unix/Linux/Mac systems you can effectively pipe multiple commands together to compress and pass the files between hosts and decompress on the other side. This is accomplished by tar-gzipping the source files, piping it to netcat (which opens a raw network port and sends to an IP address waiting to receive via netcat) then piping to a tar-gunzip process on the destination host. In this case the compression ratio isn’t as important as the speed with which it can be compressed so bzip2 would be slower. This technique eliminates the network transfer protocol overhead by using raw network ports and it creates a large continuous stream of data during the copy so network bandwidth is allowed to peak and become saturated. This is ideal on an isolated network that is not throttled nor would impact other networked systems. In real life, this technique resulted in file copy operations at maximum speed and completed the copy process of hundreds of gigabytes over gigabit Ethernet hours faster than a normal copy of many mostly small files. The other advantage is free space on the source might not be enough to fit the entire gzipped contents which is why it is gripping on the fly and piping into netcat.

Other fast downloaders or file copy tools offer spinning up multiple threads. The Robocopy utility on Windows can do this but it’s only effective on many small files. But you could spin up 60 threads and it will finish the copy faster. There is the aria2c utility that does the same for HTTP, FTP, SFTP and BitTorrent. I recently used it to download Xcode which is 4.54GB’s and was taking forever due to the App store being swapped right after Apple’s WWDC. I grabbed it from the developer website instead of the app store and found a way to have aria2c to download it very quickly.

Modern 64bit / 128bit file systems that support multiple threaded access snapshots and deduplication can speed things considerably depending on what you are doing. ZFS and Apple’s new AFS can duplicate a huge file in seconds and then allow you to make changes to the duplicate while tracking those changes in snapshots. On ZFS you can send snapshots to a second system and it will only delta copy the differences, saving you time in file transfer. This is how online backups like CrashPlan work. They only copy the changes. Snapshots should eventually speed things up in TimeMachine as well as taking less space on the local disk until you attach the backup drive.

Apple also support Fusion drives which are a combination of an SSD with a larger traditional disk. So you could have a 256GB SSD paired with a 2TB hard disk and the CoreStorage system is smart enough to put your most frequently used files, operating system and applications on the SSD to speed things up while archiving older items on the slower 2TB drive. It all looks like one single disk but it’s not. This is getting improved in High Sierra by using the new AFS Apple File System. It won’t need to use CoreStorage anymore.

Another reason why copying files can be slow is the operating system doesn’t want a single copy operation to swamp the whole system to the point it becomes unresponsive so it’s intentionally throttled. Otherwise your system can seem to hang while it’s so busy with a long copy operations. Servers can be swamped with too many disk operations at once from too many users. Everyone is slowed down at that point. On servers throttling is definitely being used. Especially virtual machine servers. You don’t want one virtual machine stalling performance for all the other virtual machines. Both CPU and Disk are therefore throttled as the resources are being shared.

Because copying a file is a process.

Let's just deal with this theoretically. We'll also just refer to hard drives. A similar concept applies to other storage devices.

A computer is nothing more than a human has made it to be. Every tiny detail has to be planned by a human.

This is like that old trick of teaching how computers work.

Me: I'm a computer. Program me to get a glass of water.

You: Okay, go to the kitchen.

Me: Invalid command. You have to be more specific.

You: Fine. Get up.

Me: (I get up.)

You: Walk to the kitchen.

Me: Unrecognized argument, “kitchen.”

You: What?

Me: You have to be more specif

Because copying a file is a process.

Let's just deal with this theoretically. We'll also just refer to hard drives. A similar concept applies to other storage devices.

A computer is nothing more than a human has made it to be. Every tiny detail has to be planned by a human.

This is like that old trick of teaching how computers work.

Me: I'm a computer. Program me to get a glass of water.

You: Okay, go to the kitchen.

Me: Invalid command. You have to be more specific.

You: Fine. Get up.

Me: (I get up.)

You: Walk to the kitchen.

Me: Unrecognized argument, “kitchen.”

You: What?

Me: You have to be more specific.

On a hardware level, like processors and hard drives, it would be even more ridiculous than this. “Get up” would be an unrecognized command. You're entering commands into my brain so you would have to tell it to send messages to each muscle necessary to get up. To do that you would have to tell it which muscles those are and what message to send.

Hard drives have two commands in our theoretical world. They had to have hardware designed to know how to carry out those commands. One is “give me data,” the other is “store this data.”

To copy a file, the processor has to say “give me data,” then “store this data,” and repeat that over and over until the file is copied. But how does the hard drive know which data? The processor has to figure that out and tell the hard drive which data to get.

Couldn't you just add a “copy this data” command to the hard drive?

Yes! You could add whatever you want! But it'll take more hardware on the hard drive to do it. Hard drives would have to be more expensive as a result. Copying a file uses so little of the processor's resources that it isn't worth it. The closer a hard drive comes to being it's own mini computer the greater the risk of bugs and failure and the more expensive it will be.

That's why copying files uses processing power. Because it's a very fair trade-off compared to what's required for the hard drive to be more autonomous. Generally, simple process are left to the processor. Complex things that are difficult for the processor and slow it down considerably are moved off into other hardware like a sound chip or a graphics chip.

Then there's the more political issue of the processor worrying about hard drives becoming too powerful and taking over the computer. It's already seen this happen to some degree with GPUs. Now it has to share the spotlight with them. They don't want other devices rising up as well. It threatens their processor privilege so they have to oppress the lowly hard drive. Processors are really quite devicist at their core, and look at how many cores they have these days!

Each File has an entry for it’s name date information and the logical block where it is located on the disk. Then there is a bitmap file of blocks that are used and empty. When any file is created an entry in the FAT is created and a search of the bitmap is made to find enough blocks to copy the file into it. For performance the bitmap file can be cached so there are not to many reads made to find a free location. At the best there needs to be a read and a write for this operation but for a fragmented disk there can be many reads to find a large enough free space. Now the file is allocated it’

Each File has an entry for it’s name date information and the logical block where it is located on the disk. Then there is a bitmap file of blocks that are used and empty. When any file is created an entry in the FAT is created and a search of the bitmap is made to find enough blocks to copy the file into it. For performance the bitmap file can be cached so there are not to many reads made to find a free location. At the best there needs to be a read and a write for this operation but for a fragmented disk there can be many reads to find a large enough free space. Now the file is allocated it’s logical blocks and the creation date time in the FAT table another write. Now the data is written into the blocks at the end the FAT is updated with date information and then the process in complete. A FAT entry is so important that it is written immediately to the drive. How does a large file differ there is usually only one entry of the starting address and the length of blocks. No waiting for the drive to turn or the head to move to a different block. By the way when a file is moved to the same hard drive only the FAT entry needs to be changed the file does not move only the pointers.

One word. Metadata. Metadata is the bottleneck of the modern storage system.

From a conceptual level, think about a computer filesystem like a book with an index and you’re an editor.

The entire book has 20 chapters, each with 100 pages in each, each chapter is roughy 1MB in size on disk. You do need to check the index, however it’s relatively quick if you’re deleting the entire chapter, to identify and delete them.

The filesystem functions similarly. It has an index, references, and pages that allocate data and even redirect to other portions of the disk when say a page or paragraph is moved.

Ima

One word. Metadata. Metadata is the bottleneck of the modern storage system.

From a conceptual level, think about a computer filesystem like a book with an index and you’re an editor.

The entire book has 20 chapters, each with 100 pages in each, each chapter is roughy 1MB in size on disk. You do need to check the index, however it’s relatively quick if you’re deleting the entire chapter, to identify and delete them.

The filesystem functions similarly. It has an index, references, and pages that allocate data and even redirect to other portions of the disk when say a page or paragraph is moved.

Image Attribution to Reuters

Essentially it comes down to performing at least two orders of magnitude more operations to delete the files, likely sequentially.

This is dramatically exaggerated by system with erasure coding, such as RAID, LRC, and other technologies like copy on write or snapshotting which must be consolidated.

That’s the gist of it.

To go into further conceptual detail on this, you can actually make the 1000 operations faster than the single 1GB delete if you’re very skilled in the operating system, calls it makes, and how it specifically functions with regard to parallel execution, file handles, and disk and filesystems buffers and cache. With the 1000 small files, you can take advantage of parallel operations on a much larger scale.

This is one foundational cornerstone that large scale distributed storage systems rely on in order to store and process large scale, in the petabyte and exabyte scale level efficiently and reliably.

Several reasons:

  • Overhead of deleting 10000 directory entries instead of 1 directory entry
  • Overhead of deallocating a minimum of 10000 regions of disk vs overhead of deallocating a minimum of 1 region of disk
  • Disk seek time seeking to 10000 directory entries instead of 1 directory entry

There are likely other reasons as well.

Imagine you have two "to-do" lists. One has 10000 small bank accounts that you have to close. The other has one big bank account that you have to close. Can you see how the amount of time it takes might depend more on how many accounts there are, than on how much money is in

Several reasons:

  • Overhead of deleting 10000 directory entries instead of 1 directory entry
  • Overhead of deallocating a minimum of 10000 regions of disk vs overhead of deallocating a minimum of 1 region of disk
  • Disk seek time seeking to 10000 directory entries instead of 1 directory entry

There are likely other reasons as well.

Imagine you have two "to-do" lists. One has 10000 small bank accounts that you have to close. The other has one big bank account that you have to close. Can you see how the amount of time it takes might depend more on how many accounts there are, than on how much money is in the accounts?

It is not simply a matter of moving the file contents. There is also overhead associated with each file being processed. As the diagram (for Unix/Linux) below begins to indicate, there is quite a bit of non-content data involved…which has to be examined & updated.

Note that for cut/copy there is also the need to find space for each file and then to transfer it. Clearly that is a simpler task for a single file than for multiple ones. A single, large file might not even be (or have to be) fragmented.. Even if it does, it isn’t likely to be in 16,000 pieces.

It is not simply a matter of moving the file contents. There is also overhead associated with each file being processed. As the diagram (for Unix/Linux) below begins to indicate, there is quite a bit of non-content data involved…which has to be examined & updated.

Note that for cut/copy there is also the need to find space for each file and then to transfer it. Clearly that is a simpler task for a single file than for multiple ones. A single, large file might not even be (or have to be) fragmented.. Even if it does, it isn’t likely to be in 16,000 pieces.

Each file copied requires a notation in the FAT, File Allocation Table, then the Date is written to the disk. Remove 999 writes to the FAT and you speed things up.

For pretty much the same reason it takes longer to ring up three shoppers in the express lane, each of whom have ten items in their baskets, compared to one shopper who has 30 items in the basket.

It takes time to “setup” (key stuff in) and “teardown” (accept payment, coupons, etc.) a customer’s order at the supermarket, so there is more to do:

3 orders x 10 items = 3x setup + 30 items + 3x teardown

1 order x 30 items = 1x setup + 30 items + 1x teardown

In the case of files we have:

16000 files = 16000 setup (find the file) operations + 16000 delete operations (totalling 1GB of deleted files)

1 file

For pretty much the same reason it takes longer to ring up three shoppers in the express lane, each of whom have ten items in their baskets, compared to one shopper who has 30 items in the basket.

It takes time to “setup” (key stuff in) and “teardown” (accept payment, coupons, etc.) a customer’s order at the supermarket, so there is more to do:

3 orders x 10 items = 3x setup + 30 items + 3x teardown

1 order x 30 items = 1x setup + 30 items + 1x teardown

In the case of files we have:

16000 files = 16000 setup (find the file) operations + 16000 delete operations (totalling 1GB of deleted files)

1 file = 1 setup (find the file) + 1 delete operation (which also totals 1GB of deleted files)

So it’s not really the delete operation itself, it’s the stuff in between the deletes–finding the files, marking them deleted in the directory, releasing the storage associated with file information which is not stored inside the file, but somewhere else in the filesystem (owner, file size, permissions, etc.)

Because each file has an overhead for writing. For 1,000 small files, the system has to write the metadata a 1,000 times.

Lets compare.

One single file of 1GiB - You open it once, start trasferring, Once 1GiB data is transferred, you close the file, write metadata and that’s that.

Now 1,024 files of 1Mib each. You open file one, transfer 1Mib data , close the file, write meta data. Repeat for the second file. ( we are not taking into cosideration multiple threads that the system might start )

Notice how in the second case, you have multiple opening , closing of files and writing of Metadata.

That’s

Because each file has an overhead for writing. For 1,000 small files, the system has to write the metadata a 1,000 times.

Lets compare.

One single file of 1GiB - You open it once, start trasferring, Once 1GiB data is transferred, you close the file, write metadata and that’s that.

Now 1,024 files of 1Mib each. You open file one, transfer 1Mib data , close the file, write meta data. Repeat for the second file. ( we are not taking into cosideration multiple threads that the system might start )

Notice how in the second case, you have multiple opening , closing of files and writing of Metadata.

That’s why we zip or tar the files from webservers before downloading them - to consolidate multiple smaller files into a single large file thats easier and faster to download.

Creating 10 files in a file system takes a lot less time than creating 400000 files, no matter how large or small the files are. Allocating disk space, searching the directory for existing entries, and inserting a new entry takes time, and writing each file requires a separate seek operation on the disk, after writing the directory entry.

Think of how long it would take to fill a barrel with water

Creating 10 files in a file system takes a lot less time than creating 400000 files, no matter how large or small the files are. Allocating disk space, searching the directory for existing entries, and inserting a new entry takes time, and writing each file requires a separate seek operation on the disk, after writing the directory entry.

Think of how long it would take to fill a barrel with water taking one cup at a time versus ...

That makes sense. “Files” aren’t really a thing at the hardware level - they’re an abstraction, a concept used by computer software to make it easier for humans to understand what they’re doing. In hardware, a file is little more than a series of 1’s and 0’s on a disk.

Computers track files on a disk using a list of some sorts (in NTFS, used by Windows, this is called the MFT or master file table), where each entry in the list points to the location and size on disk of the file in question.

Deleting a file is a simple matter of removing the file’s entry in the list. The amount of data involved i

That makes sense. “Files” aren’t really a thing at the hardware level - they’re an abstraction, a concept used by computer software to make it easier for humans to understand what they’re doing. In hardware, a file is little more than a series of 1’s and 0’s on a disk.

Computers track files on a disk using a list of some sorts (in NTFS, used by Windows, this is called the MFT or master file table), where each entry in the list points to the location and size on disk of the file in question.

Deleting a file is a simple matter of removing the file’s entry in the list. The amount of data involved in a single list entry is unlikely to exceed even a fraction of a kilobyte.

Copying a file, however, requires reading every single byte of the file from one location, and writing it to the second. That means potentially gigabytes of reads and gigabytes of writes, depending on file size.

I should note, further, that some filesystems support copy-on-write. In a copy-on-write scheme, when a file is copied, the second copy initially points to the same exact data as the first. It is only when data is written to one of the copies, that this copying process is performed; and then only the portion changed is copied. The file table can then be updated so each file gets its own correct version (the original is unchanged, but the edited copy will show the desired changes). This way, the user can still create and edit copied files normally, but the computer saves on unneeded disk accesses.

One such copy-on-write filesystem is btrfs, a common file system used by Linux computers.

Each file needs to be setup transmitted and saved separately. Any missing packets must be fixed before the transmission of the next file started or if several (say 5) a run in parallel finished before the next or 6th file started.

The initial speed indication includes all of the file that’s been read before you see anything happening, so it’s a lot faster than the computer can actually copy a file. As the speed indication starts matching the actual copy speed, it keeps getting lower. (If you’re copying a 1GB file, and 1MB has been copied before you see any change on screen, the first indication is 1mbps [because the indicated speed can’t be in fractions of a second]. By the time you’ve copies 500MB, that initial 1MB doesn’t change the indicated speed by much. [Initially it was 100% of the amount copied. By 500MB, it’s o

The initial speed indication includes all of the file that’s been read before you see anything happening, so it’s a lot faster than the computer can actually copy a file. As the speed indication starts matching the actual copy speed, it keeps getting lower. (If you’re copying a 1GB file, and 1MB has been copied before you see any change on screen, the first indication is 1mbps [because the indicated speed can’t be in fractions of a second]. By the time you’ve copies 500MB, that initial 1MB doesn’t change the indicated speed by much. [Initially it was 100% of the amount copied. By 500MB, it’s only one half a tenth of a percent - effectively it never happened.])

Please learn this term first

What is the Master Boot Record (MBR)?
Learn how the Master Boot Record provides information about the hard disk partitions and the OS so it can be loaded for the system boot. Explore how it works.

What you get from that lesson is that your hard drive is divided into partitions and sectors. Is is the job of your computer to organize and control all of that.

Mechanics VS Computer Memory

I imagine that much of the time is spent with the hard drive file pointer moving from one spot to another on the disk to go to a source file starting position. Then the file is copied from its stored cell. Then the file pointer must move to the destination location as on the index and stop at the starting position as assigned by the hard drive so

Please learn this term first

What is the Master Boot Record (MBR)?
Learn how the Master Boot Record provides information about the hard disk partitions and the OS so it can be loaded for the system boot. Explore how it works.

What you get from that lesson is that your hard drive is divided into partitions and sectors. Is is the job of your computer to organize and control all of that.

Mechanics VS Computer Memory

I imagine that much of the time is spent with the hard drive file pointer moving from one spot to another on the disk to go to a source file starting position. Then the file is copied from its stored cell. Then the file pointer must move to the destination location as on the index and stop at the starting position as assigned by the hard drive software. Then the process pastes file content to predefined cells all of equal size. It is possible that some files have contents located all over the hard drive. (We call that “fragmentation” and it slows down a hard drive.) When you save a file, it looks for an empty cell and starts filling it up. If it fills up a cell and the next cell has content, it has to continue in another free cell. Your computer hard drive file system has to keep track of all of that so that your files don’t clobber each other.

Perhaps you have never used a record player before. To play a song you have to first move the needle to where the song starts and then place it carefully.

A hard drive has to be more precise and faster than a record player spins.

Copying content partly from memory is fast. (A computer program pastes data into memory for faster access.) Using hardware will always be slower as it operates at a correct speed given its mechanical components and electrical inputs.

So a larger file copies faster. We have faster computers today so large files are less of a problem.

If you want to create your own media file on a CD or USB drive it helps a lot to have all of your files together and not fragmented. I partition my disk because it keeps content separated.

When a file is copied:

  1. The source file location on the disk must be found
  2. The file must be opened for reading
  3. A file at the destination must be created
  4. The destination file must be opened for writing
  5. The file contents are copied
  6. The source file must be closed
  7. The destination file must be closed

For one file each step is performed once. For 1024 files steps 1, 2, 3, 4, 6 and 7 must be repeated 1024 times, so it is much slower.

Many possible reasons.

A common one is that most hard drives have a substantial RAM cache built in to them. When you write some data, it goes to the RAM cache on the drive first, then the drive controller actually writes it to the physical disk platters over the next second or two.

This makes the write appear to complete very fast, although it hasn't really completed at all.

If you write a LOT of data, that cache fills up, and each new write has to wait for one of the older writes to physically complete before the drive can receive the new data.

That causes the sudden decrease in throughput that y

Many possible reasons.

A common one is that most hard drives have a substantial RAM cache built in to them. When you write some data, it goes to the RAM cache on the drive first, then the drive controller actually writes it to the physical disk platters over the next second or two.

This makes the write appear to complete very fast, although it hasn't really completed at all.

If you write a LOT of data, that cache fills up, and each new write has to wait for one of the older writes to physically complete before the drive can receive the new data.

That causes the sudden decrease in throughput that you see in the 'file copy details’ dialog.

The way the drive holds files is the data is on the disk, and there's an index at the beginning of the disk that tells the computer where each file is. So when you delete a file, the computer just removes the INDEX entry.

The space on the disk can a gibberish of data, as long as the INDEX at the beginning says it's "empty", then it's considered empty. If you were to download another 1 gb game or movie, the disk would likely use the gibberish space, erasing the gibberish and overwriting it with the new file.

So because the actual erasing happens when you put in a NEW file, there's no need to actu

The way the drive holds files is the data is on the disk, and there's an index at the beginning of the disk that tells the computer where each file is. So when you delete a file, the computer just removes the INDEX entry.

The space on the disk can a gibberish of data, as long as the INDEX at the beginning says it's "empty", then it's considered empty. If you were to download another 1 gb game or movie, the disk would likely use the gibberish space, erasing the gibberish and overwriting it with the new file.

So because the actual erasing happens when you put in a NEW file, there's no need to actually erase the old file, there's no need to "make the disk all zeroes" so to speak. It's just unnecessary effort. The computer just deletes the file location from the index, and calls it done.

Copy and paste:

  1. Get file reference
  2. Read data from reference and write to new location.
  3. Display new reference

The entire file content needs to be read and moved.

Cut and paste (on same disk):

  1. Get file reference
  2. Replace original reference with new

Only a reference of a few byte is changed.

Cut and paste (on different disk):

  1. Get file reference
  2. Read data from reference and write to new location.
  3. Replace original reference with new

As you can see, it is reading and writing the entire file is what takes time. A copy & paste to a different disk should take the same amount of time as a cut & paste.

It’s a matter of filesystem overhead mainly.

For each file, the filesystem needs to create an entry, allocate one of more blocks on the drive to it, and finally copy the data to those blocks.

Depending on the filesystem it might also create checksum data so that upon reading the file back it can verify that the data stored matches the data that was originally written. This checksum data also needs to be written to the target disk.

This “housekeeping” needs to be performed on every file written to the disk, and takes approximately the same amount of time regardless of file size.

For a large file, o

It’s a matter of filesystem overhead mainly.

For each file, the filesystem needs to create an entry, allocate one of more blocks on the drive to it, and finally copy the data to those blocks.

Depending on the filesystem it might also create checksum data so that upon reading the file back it can verify that the data stored matches the data that was originally written. This checksum data also needs to be written to the target disk.

This “housekeeping” needs to be performed on every file written to the disk, and takes approximately the same amount of time regardless of file size.

For a large file, once these operations are taken care of all that’s left for the drive to do is copy a series of full blocks of data one after the other from the source drive to the target drive, and drives are able to do this quite quickly because it’s a really simple, predictable operation that can be optimized in their firmware.

SSDs are better at small, random read/write operations than mechanical HDDs are, due to not having to physically move heads around to the desired location on the disk, but they still have to do alot of behind the scenes work when things are happening at random (from the drives standpoint) vs. just reading/writing a continuous stream of data.

For example writing a small amount of checksum data might actually involve reading a much larger “chunk” of existing data, adding the new data to it, erasing the original “chunk”, and replacing it with the new combined “chunk”.

This is because there is a minimum block size they can work with by design. If they make these blocks too small, they have to work with many blocks to handle large files, and this costs performance. Make them too big and you end up having to do more of these read/modify/erase/write cycles to avoid having partially filled blocks which is a waste of space.

The answer is already in the question itself.

Windows read the metadata of a file. Metadata is a part of the file that store information about it, like file size, creation date, access date and lots of other information. You can look at some of this information on file properties on windows.

Now, every time windows need to copy a file to another location it needs to read the metadata. This metadata obviously is the same size for any file type. If you have 1000 small files, windows needs to read for each file its metadata, and this takes a lot. Instead, if you have one big file, windows opens the

The answer is already in the question itself.

Windows read the metadata of a file. Metadata is a part of the file that store information about it, like file size, creation date, access date and lots of other information. You can look at some of this information on file properties on windows.

Now, every time windows need to copy a file to another location it needs to read the metadata. This metadata obviously is the same size for any file type. If you have 1000 small files, windows needs to read for each file its metadata, and this takes a lot. Instead, if you have one big file, windows opens the metadata just one time. Thereafter it just need to copy pieces of data from one location to another, and this is faster.

- - - - - - -

Diego Vaccher

There are a lot of reasons!

  1. Location. Computers a very good at repeating one task very fast! So copying one is very straight forward. Ask the OS to retrieve a file at location X. The OS sends a couple of CPU intrupts and the CPU begins streaming bytes from disk to RAM, GPU, over network or wherever iys instructed to send the data. The bytes are also all in a line in the same location. So the CPU doesn't have to waist lots of time with index moving, etc. Insead it starts and the beginning and goes to the end. If its a lot of files however, one of the biggest time consuming parts will be looking

There are a lot of reasons!

  1. Location. Computers a very good at repeating one task very fast! So copying one is very straight forward. Ask the OS to retrieve a file at location X. The OS sends a couple of CPU intrupts and the CPU begins streaming bytes from disk to RAM, GPU, over network or wherever iys instructed to send the data. The bytes are also all in a line in the same location. So the CPU doesn't have to waist lots of time with index moving, etc. Insead it starts and the beginning and goes to the end. If its a lot of files however, one of the biggest time consuming parts will be looking for the files on the disk. Sure you provided the location. Now the OS has to find the disk and sector of the the file. The setup the CPU to read the file, read, and now instead of being done it has to start the process all over again to read the next file.
  2. Fragmentation. This isn't a super big problem now days because most people are beginning to use SSD. However if you use a hard drive, this can be a big problem. Big files can be written to a large unused portion of disk, however as you delete files the disk can become ‘fragmented’ with lots of unused little bits of space. Sometimes a large file might have some parts written to fragmented part meaning the physical head on the hard drive must move somewhere else to find the needed information. But in general the big file will have long uninterrupted sections of data it can read meaning that the data can be read very quickly without stopping. Small files however my be broken up to fill a lot of tiny unused spots mean the disk must stop reading move and start read somewhere else a lot more this eats up more time.

Im sure there are more reasons, these are just what I happened to think of. It also depends on where you copying to to, and how you are copying. Thank you for reading. Have a great day!

I’m sorry, you really need to edit your deceptive question. The question you really want an answer to is “How do I break into the copy shop computers near my university that have PDF files of all the textbooks so I can steal them?

Seriously. That’s what you need to change your question to. I’m sure you don’t like the word “steal” but it doesn’t matter if you don’t like it. That’s what you are proposing. You want to steal them, and you want to then distribute them freely.

However, it does seem to show that you don’t really know how the internet works.

For example, you aren’t going to be the one m

I’m sorry, you really need to edit your deceptive question. The question you really want an answer to is “How do I break into the copy shop computers near my university that have PDF files of all the textbooks so I can steal them?

Seriously. That’s what you need to change your question to. I’m sure you don’t like the word “steal” but it doesn’t matter if you don’t like it. That’s what you are proposing. You want to steal them, and you want to then distribute them freely.

However, it does seem to show that you don’t really know how the internet works.

For example, you aren’t going to be the one making copies of the PDF files for everyone to take. You aren’t going to make 100 copies of the PDF files, so that you can hand 100 books out to 100 people. It isn’t like paper.

All you need is one copy of each book PDF, and any file sharing site. You can use Google drive, post the PDF files in the Google drive cloud storage, and share them publically.

Sharing PDF files on the internet is easy. Your question surrounds this concept as if it was the hard part, or as if it should have the main focus here. It isn’t.

You are blowing off the Breaking and Entering, and the Computer Hacking as if they should be something trivial. You are looking at theft and treating it as if it were a Right.

You could… you know… PURCHASE the books, and then share them from a Google drive account. There’s an option.

EDIT: In case I catch shit for calling this out… You select the file you want to copy, hold the CTRL key and hit C. Then, you go to where you want your copy to reside, and you hold the CTRL key and hit V to paste it into place. CTRL-C copies, CTRL-V pastes.

2nd EDIT: and User-13198429912709580948 is completely right in that every part of your question is really suggesting something illegal. I’m not trying to encourage you to do anything illegal, but the trouble you would be in after breaking into the copy shop or hacking their machines would be FAR GREATER than the trouble you would get in if you purchased the books and then just allowed friends to copy those PDF files.

About · Careers · Privacy · Terms · Contact · Languages · Your Ad Choices · Press ·
© Quora, Inc. 2025