Incorrect WARC-Payload-Digest values when transfer encoding is present

Per WARC/1.0 spec section 5.9:

The payload of an application/http block is its ‘entity-body’ (per [RFC2616]).

The entity-body is the HTTP body without transfer encoding per section 4.3 in RFC 2616. (In the newer RFC 723# family, it's called "payload body" instead and defined in section 3.3 of RFC 7230.)

Just to be clear to avoid confusion: this is the definition of the payload; the WARC record should still contain the exact response sent by the server with transfer encoding intact. But when calculating the WARC-Payload-Digest, the transfer encoding must be stripped.

warcio (like many other tools) passes the response data directly into the payload digester without removing transfer encoding. This means that it produces an invalid WARC-Payload-Digest when the HTTP body is transfer-encoded.

Is this a bug in the standard instead of a bug in warcio and many other tools? Can you point out any common tool that actually obeys the standard?

There are a couple of examples where the standard was changed to match the tools, this might be a new one.

I see from your links that this has been discussed by the WARC standards folks, but it sure doesn't seem like the subset of tools that preserve transfer encoding have adopted that digest algorithm.

One thing this does inspire me to do is to add some code to "warcio check" and "warcio test" to check for and comment on this situation

From the discussion I linked, it seems like the intention was as written, not as implemented. So in that sense, I'd say it's not a bug in the standard.

I haven't investigated every tool; I just did some validation of WARCs created with wpull and warcio, both of which had this issue. From a glance over the relevant code, it seems like wget does remove chunking for the payload digest. However, it also looks like it doesn't handle compression correctly (with which wget has had issues before). Due to the lack of a suitable testing setup, I haven't been able to test that yet though.

Well, if you'd like to look farther, the develop branch of warcio supports "warcio check", which can be used to also check digests -- albeit with warcio's current opinion that digests are computed on the bytes without removing transfer encoding. You can also install the branch from my "warcio test" pullreq (which is a work in progress) and give "warcio test" a spin!

It's my hope that "warcio test" will lead the community to interact about all of these conflicts between the standard and the tools people actually use.

Thanks for flagging this! Hm, I wonder if any tools actually do what the standard suggests in this regard? I think the content-decoding would also need to be decoded.

The change in warcio should be simple, I think just:

- for buf in self._iter_stream(record.raw_stream):
+ for buf in self._iter_stream(record.content_stream()):

Of course, would invalidate previous payload digests, so not sure if this is the right thing to do, especially if other tools also behave the same way now..

Also, I wonder what the implications of this are.. presumably the problem would be for deduplication if the same content is served with different transfer-encoding and/or content-encodings and would unnecessarily be fetched? Or chunks of different size used for same content?
I wonder how likely/prevalent this is in practice.

I ask this because making such a change will make the payload digests mismatch and affect deduplication with any existing records that have transfer- and content- encoding, while keeping it as is might affect some deduplication.

These issues are exactly why I suspect the standard is what's going to change. Those "bad" digests are all over petabytes of response and revisit records, and the cdx indices ...

@wumpus

@wumpus
Yes, I want to play around with this further. I need to set up a test HTTP server that can produce all the different relevant combinations (unchunked+uncompressed [which should be unproblematic in all implementations], chunked+uncompressed, unchunked+compressed, and chunked+compressed). httpbin unfortunately can't do it; it only has compression options for content encoding, not for transfer encoding, and it doesn't seem to be possible to combine it with chunking either. I guess I'll just write something that serves some hardcoded example data for each of these.

@ikreymer
No, the entity-body in RFC 2616 is the (potentially content-encoded) body before applying transfer encoding (if any), so only the transfer encoding needs to be stripped, but the content encoding needs to stay intact. At least that's what the standards say. Whether that actually makes sense, I'm not sure.

I guess the intention is indeed deduplication when the chunk boundaries or something in the compression changes. I've definitely seen varying chunks returned from web servers for identical content, which isn't too surprising since those chunks are often based on a buffer which may get flushed before being full depending on a variety of factors (current CPU/disk speed, concurrent activity, etc.). I have little experience with compression transfer encoding.
But I wonder now whether such deduplication is really a good idea. WARC is all about preserving HTTP requests and responses as accurately as possible. But if revisit records might not actually be identical duplicates, that's a loss of information, namely that the transfer encoding from that server is variable rather than constant. That's only a tiny bit of information and probably not at all useful on its own, but it still seems a bit odd.

I'm aware of the implications for existing archives. But unfortunately, those files do not conform to the standard as it's written currently, and I believe that's a significant issue that needs to be solved, whether that requires changes in the softwares or in the standard. Based on the discussion linked above (or rather, the fact that this exact issue was discussed fairly extensively before), I assumed that a change of the standard is not wanted/needed. But I realise now that this may very well just be an issue of the people working on the standard not knowing how the existing implementations work in this regard.

I'll continue my investigations soon how other tools react to this and file an issue on the WARC specification repo instead if it turns out that the majority of relevant tools don't adhere to the written standard, i.e. if the standard should likely be changed.

I'm planning on writing a jumbo article on what "warcio test" finds, and this will end up being a part of that, if you don't beat me to it. I've got a repo with a pile of warcs from all of the major tools, etc etc.

Thanks for looking into this more deeply. At first glance, removing transfer-encoding but not content-encoding seems to make even less sense then keeping both or removing both.

I wanted to add, the IIPC hosts a weekly open source web archiving call on Tuesdays, alternating between 9am and 6pm EST currently. I think this would be a great conversation to bring up there as well for more immediate feedback. The next one is tomorrow, April 29 at 9am EST, following week will be 6pm. I can also mention this issue on the IIPC slack channel as well if you'd like.

This came up for me recently as well; it broke my expectations about how CDX indexes worked, in that when I fetched files via wayback (without replay) and wrote out to disk, the on-disk blob's digest didn't match the digest in the corresponding CDX line. Our CDX build process simply copies the WARC content digest header as-is without verification or re-calculation, and the browser (and/or wayback earlier) strips any encoding before writing to disk.

Here's a URL which consistently induces the ambiguity: http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2012/Fischer12-BAF.pdf

A notable tool which does "follow the spec" is warcprox, used as part of brozzler, as of 2017 (internetarchive/warcprox@3a0f6e0947)

That's an easy problem to see because many http clients will undo chunked encoding and decompress by default. So you can't depend on the checksum not changing. And even if the community agreed about transfer encoding and checksums, decompression is still going to change it.

@wumpus

@wumpus Sounds great! I remember your request on the wpull repository from a few months ago. Great to hear that this project of yours is coming along nicely. :-)
I assume those WARCs all cover very different content, right? It would be very interesting to do a WARC-writing tool intercomparison where all tools (both current and common versions, I think, but not all historical releases) are run against the same sample server. Perhaps the small static server I hinted at could be used for this, so that it'd be "hey, you're maintaining a software that makes HTTP requests and writes to WARC, could you run it against this server and this list of URLs please?" instead of "oh, I need to hunt down an old version of this OS and that library to get this to run". I'll be looking into writing such a server soon.

@ikreymer At first glance, it seems silly, yes. But Transfer-Encoding is intended for encoding the data on-the-fly whereas Content-Encoding is at-rest compression. So the content-encoded data should never change between requests unless the source data has changed whereas the transfer encoding may vary freely. So I guess it kind-of makes sense.

First time I hear about that call after working with WARCs daily for two years, although I must admit that I never searched for things like this either. Is this publicly accessible?

Please do spread it wherever it makes sense. We should also probably find a better place for this discussion than here, but I have no idea where that might be.

@bnewbold That example URL uses chunked encoding, so it makes perfect sense that this causes issues. Thanks for pointing out how warcprox behaves!

@JustAnotherArchivist

@JustAnotherArchivist Yes, I see your point, though I suppose both could just as easily applied on-the-fly, eg. with nginx gzip on and chunked_transfer_encoding on options.

The IIPC call is open to everyone and there is more info in the IIPC slack at: https://iipc.slack.com #oh-sos channel (I can invite you if you'd like, PM your email)

@bnewbold Good to know that warcprox does follow the standard, was about to look at what it does.

So, I decided to run a few quick tests using different tools and got 3 different digests! However, they were consistent across multiple tries of the same tool.

I tried:

pywb and webrecorder.io, which produces the same digest, but as it turns out, actually strip out transfer-encoding before writing to the WARC
The new warcio capture_http semantics, eg:

from warcio.capture_http import capture_http
import requests

with capture_http('2-warcio.warc.gz') as fh:
    requests.get('http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2012/Fischer12-BAF.pdf')

This does preserve the chunked encoding in the WARC record.
3) warcprox, which also preserves the chunked encoding.
At first glance, the chunking in 3) and 2) appeared to be the same, though the digests were different..

Trying each tool multiple times produces the same results, so that's good news.
And, running warcio extract --payload on each produced exactly same payloads as well, also good.

But, clearly there's a few off, such as 1) stripping out the transfer-encoding altogether) while 2) and 3) should be matching. Also, warcio check appears to be getting confused by the chunked encoding, hitting the 'digest present but not checked' condition. (@wumpus if you have a chance to take a look at that)

I've attached three files here. Will try to investigate further..
1-pywb.warc.gz
2-warcio.warc.gz
3-warcprox.warc.gz

Thanks, will contact you shortly.

1-pywb.warc.gz is arguably the worst because it preserves the Transfer-Encoding header while decoding the body. That means the data consumer will have to detect and work around that. It means that the payload digest is according to spec though.

2-warcio.warc.gz has the expected, spec-violating "payload with transfer encoding" hash.

3-warcprox.warc.gz has the (arguably) correct "payload with transfer encoding removed" hash. But it writes the hash in hexadecimal form rather than the base-32 one which is common in other WARCing tools.

Looks like 3-warcprox.warc.gz has an incorrect hash for the request record though: it is equal to the block hash when it should be sha1('') since there is no request body.

Hi all, apparently nobody care much about this issue but it's actually quite concerning, and I think that warcio shouldn't offer the possibility to check WARC's validity if it is not capable to properly calculate Payload-Digest, it makes warcio report false information, and a crawler writing proper WARC files would fail the check because of warcio not being able to read and write WARC files entirely as defined by the specs.

If nobody wants to fix that issue, it's fine, I don't personally use warcio and this is one of the reasons why, but at least warcio check shouldn't be a thing. https://github.com/webrecorder/warcio#check

Hi all, apparently nobody care much about this issue but it's actually quite concerning, and I think that warcio shouldn't offer the possibility to check WARC's validity if it is not capable to properly calculate Payload-Digest, it makes warcio report false information, and a crawler writing proper WARC files would fail the check because of warcio not being able to read and write WARC files entirely as defined by the specs.

If nobody wants to fix that issue, it's fine, I don't personally use warcio and this is one of the reasons why, but at least warcio check shouldn't be a thing. https://github.com/webrecorder/warcio#check

It's not that 'nobody cares', this is was an ambiguity in the standard, and we needed to be careful in making a backwards incompatible changes like this.. Unfortunately, there hasn't been much discussion of this elsewhere for a while..
But yes, should figure out a solution one way or another. Probably the safest thing to do would be to add a command line switch to alter behavior, rather than changing the default behavior. We can take another look..

It's not that 'nobody cares', this is was an ambiguity in the standard, and we needed to be careful in making a backwards incompatible changes like this.. Unfortunately, there hasn't been much discussion of this elsewhere for a while..

I understand that it is a mistake that is fairly easy to do because it's not clearly stated in the specs, but it's not ambiguous because entity-body is clearly defined in RFC 2616.

But yes, should figure out a solution one way or another. Probably the safest thing to do would be to add a command line switch to alter behavior, rather than changing the default behavior. We can take another look..

I think the priority is to stop giving the possibility to verify WARCs with warcio until this is fixed. And I definitely think the default behavior should be fixed, the "good way" shouldn't be optional, it should be the default and the only way to do that.

If I understand all the messages above, it sounds like people are expecting the standard to change to match what tools like warcio are doing, it is not happening right now and it won't happen tomorrow. We can't know if it will happen one day, but in the meantime I think warcio should be fixed and you should stop offering warcio check until this is fixed.

I also understand that fixing warcio isn't easy. I really do. It's why I put en emphasis on removing the check capability until you figure out how to fix it.

JWAT and https://github.com/nlnwa/warchaeology also mark these digests as incorrect.

I though the purpose of these digests were to validate the content as it is in the payload.
It makes no sense at all to have a digests that needs extra computations in order to validate the payload.
It just add another useless layer with more possibilities for errors in writers and validators.

This is the first time since writing JWAT that I've been made aware of this digest issue.
So as pointed out else where there are massive amounts of existing archives with exact payload digests that could be affected by changing how digests are calculated.

Wget has this issue as well. At Archive Team we use the modified version of Wget at https://github.com/ArchiveTeam/wget-lua. This is used to archive petabytes of data.

This modified Wget is now updated ArchiveTeam/wget-lua#13, and as of version 1.21.3-at.20230623.01 calculates the correct WARC-Payload-Digest value - that is with transfer encoding stripped for calculation of the digest, while preserving the transfer encoding in the WARC record block.

Over the next days this will be rolled out for all Archive Team distributed Warrior projects.

@nclarkekb

Coming back to this after a few years, I think the best solution is to strip out the Transfer-Encoding altogether when writing the WARCs, so that WARCs are written with the Transfer-Encoding removed from the data, and header replaced with something like X-Removed-Transfer-Removed (we might have discussed that in another issue), and the payload digest matches the data with no Transfer-Encoding.

After all these years, haven't really seen a need to access the transfer-encoded data, so I agree with @nclarkekb that it just creates more work for anyone using the data.

WARCs created capture via other methods, such as browser extensions / warcio.js also has the encoding already removed.
The idea that the WARC data represents raw network traffic has not been true with advent of TLS and HTTP/2 and HTTP/3 so we might as well represent the 'canonical' text based HTTP response, no transfer encoding applied and no transfer-encoding header so that the data makes sense.

I also remember that (brief) discussion, but don't know where it happened. I thought it was in iipc/warc-specifications but couldn't find it there. It's certainly a discussion worth having, probably over there.

However, any tools that currently strip the transfer encoding from the record body, transform the HTTP headers, or write HTTP/2 or HTTP/3 data to WARC are undeniably violating the standard. This is not ambiguous in the WARC spec: only HTTP/1.1 with the exact HTTP-layer-level data as sent/received is compatible with WARC versions 1.0 and 1.1. And since this is a format intended for long-term archival, spec compliance is crucial in my opinion.

(Side note: perhaps the IIPC repo should also contain a list of known WARC spec violations found in the wild. It would fit in with the implementation recommendations and the annotated spec, I think. Apart from this issue right here, the angle brackets misunderstanding would also belong there.)

@ikreymer

@ikreymer I very strongly disagree with this idea, and I agree with what @JustAnotherArchivist noted.

If the WARCs captured with "browser extensions / warcio.js" already have encoding removed, then these implementation are in violation of the standard and do not create proper WARCs. Whether this is an issue is up to the individual using the tool, but they should be aware of this, if the issue is known.

Previously (#74 (comment)) the argument was made that

I ask this because making such a change will make the payload digests mismatch and affect deduplication with any existing records that have transfer- and content- encoding, while keeping it as is might affect some deduplication.

If the suggestion is to strip Transfer-Encoding entirely, store this in the WARC, and calculate the payload digest anyway on the stripped content, then what is stopping us from just calculating the payload anyway on the stripped body, while writing the raw body (including Transfer-Encoding) to WARC?

The WARC standard notes clearly what should be stored, and what the digests should be calculated over. For many implementations the standard was not accurately read/analysed. I would say it would be best if implementations are fixed whenever possible and simply start calculating the correct digests moving forward.

The idea that the WARC data represents raw network traffic has not been true with advent of TLS and HTTP/2 and HTTP/3

We should agree on a solution for this in https://github.com/iipc/warc-specifications, and move forward with that.

(Side note: perhaps the IIPC repo should also contain a list of known WARC spec violations found in the wild. It would fit in with the implementation recommendations and the annotated spec, I think. Apart from this issue right here, the angle brackets misunderstanding would also belong there.)

Yes, I agree with this. When people use tools that write WARCs, they will likely assume that these WARCs are entirely following the WARC standard. Some tools do not. Sometimes that is known, sometimes it is yet to be discovered. I think it would be good to make clear to people who use WARC tools in what ways these tools do not follow the WARC standard, and possibly why they do not follow the WARC standard. They can then decide for themselves whether that is a problem, possibly with recommendations from the IIPC on the long term aspect.

Weird that after 4 years of discussion, no one has yet to look at IA's and CC's warcs to see what they do with this issue.

Just putting my 2c, as a developer / maintainer of several different versions of crawlers and wayback machine at Internet Archive...

Our interpretation of the spec is that Payload-Digest is computed after decoding Transfer-Encoding, but before removing Content-Encoding. It is hard to get this in the spec, but there's no ambiguity, I believe. We follow this interpretation in every tool we develop. We even recalculate Payload-Digest with this method during indexing, disregarding Payload-Digest in WARCs (obviously there's a problem with revisits)
This behavior is sometimes inconvenient (digest is affected by Content-Encoding: gzip or not), but makes sense to me because that's how most basic HTTP client would work (they never fail to decode Transfer-Encoding, but may not handle Content-Encoding)
Handling Transfer-Encoding in WARC record should not be difficult, because that's exactly what existing, working, reusable HTTP client implementation does and that's where WARC processor libraries should help. Actually, one thing that makes our WARC processing code harder to implement is the non-compliant WARC producer behavior like stripping transfer encoding while retaining the Transfer-Encoding header. We had to patch HTTP response parser to make it more lenient. Old and buggy web serves are the headache, too.
Original spirit of WARC spec is to capture content exchanged over the network. "Network" may be ambiguous, but looking at how https responses are recorded, it is good enough to understand that introduction of TLS alone does not affect how "traffic" is captured -- i.e. it is not "raw network traffic", but above TLS.
Current WARC spec does not prescribe how HTTP/2 or HTTP/3 traffic should be archived at "network level". WARC standard needs to be extended to support them. HTTP/2 being binary protocol that is logically equivalent of HTTP/1.x does not affect what the current WARC standard prescribes as to archiving HTTP/1.x. HTTP/2 being binary protocol should not justify removal of Transfer-Encoding from HTTP/1.1 response.

Hey, so we are more than a year later now since the last message, and almost 5 years since this issue has been opened, the damages made has been huge so I wonder: still no plan about fixing this library? Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect WARC-Payload-Digest values when transfer encoding is present #74

16 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Incorrect WARC-Payload-Digest values when transfer encoding is present #74

Description

Activity

wumpus commented on Apr 30, 2019

JustAnotherArchivist commented on Apr 30, 2019

wumpus commented on Apr 30, 2019

ikreymer commented on Apr 30, 2019

wumpus commented on Apr 30, 2019

JustAnotherArchivist commented on Apr 30, 2019

wumpus commented on Apr 30, 2019

ikreymer commented on Apr 30, 2019

bnewbold commented on Apr 30, 2019

wumpus commented on Apr 30, 2019

JustAnotherArchivist commented on Apr 30, 2019

ikreymer commented on Apr 30, 2019

JustAnotherArchivist commented on Apr 30, 2019

JustAnotherArchivist commented on Apr 30, 2019

16 remaining items

CorentinB commented on Jun 23, 2022

ikreymer commented on Jun 23, 2022

CorentinB commented on Jun 24, 2022

nclarkekb commented on Apr 12, 2023

Arkiver2 commented on Jun 23, 2023

ikreymer commented on Jun 24, 2023

JustAnotherArchivist commented on Jun 24, 2023

Arkiver2 commented on Jun 24, 2023

wumpus commented on Jun 27, 2023

kngenie commented on Jun 28, 2023

CorentinB commented on Apr 4, 2024

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions