- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 66
Description
Per WARC/1.0 spec section 5.9:
The payload of an application/http block is its ‘entity-body’ (per [RFC2616]).
The entity-body is the HTTP body without transfer encoding per section 4.3 in RFC 2616. (In the newer RFC 723# family, it's called "payload body" instead and defined in section 3.3 of RFC 7230.)
Just to be clear to avoid confusion: this is the definition of the payload; the WARC record should still contain the exact response sent by the server with transfer encoding intact. But when calculating the WARC-Payload-Digest, the transfer encoding must be stripped.
warcio (like many other tools) passes the response data directly into the payload digester without removing transfer encoding. This means that it produces an invalid WARC-Payload-Digest when the HTTP body is transfer-encoded.
Activity
wumpus commentedon Apr 30, 2019
Is this a bug in the standard instead of a bug in warcio and many other tools? Can you point out any common tool that actually obeys the standard?
There are a couple of examples where the standard was changed to match the tools, this might be a new one.
I see from your links that this has been discussed by the WARC standards folks, but it sure doesn't seem like the subset of tools that preserve transfer encoding have adopted that digest algorithm.
One thing this does inspire me to do is to add some code to "warcio check" and "warcio test" to check for and comment on this situation
JustAnotherArchivist commentedon Apr 30, 2019
From the discussion I linked, it seems like the intention was as written, not as implemented. So in that sense, I'd say it's not a bug in the standard.
I haven't investigated every tool; I just did some validation of WARCs created with wpull and warcio, both of which had this issue. From a glance over the relevant code, it seems like wget does remove chunking for the payload digest. However, it also looks like it doesn't handle compression correctly (with which wget has had issues before). Due to the lack of a suitable testing setup, I haven't been able to test that yet though.
wumpus commentedon Apr 30, 2019
Well, if you'd like to look farther, the develop branch of warcio supports "warcio check", which can be used to also check digests -- albeit with warcio's current opinion that digests are computed on the bytes without removing transfer encoding. You can also install the branch from my "warcio test" pullreq (which is a work in progress) and give "warcio test" a spin!
It's my hope that "warcio test" will lead the community to interact about all of these conflicts between the standard and the tools people actually use.
ikreymer commentedon Apr 30, 2019
Thanks for flagging this! Hm, I wonder if any tools actually do what the standard suggests in this regard? I think the content-decoding would also need to be decoded.
The change in warcio should be simple, I think just:
Of course, would invalidate previous payload digests, so not sure if this is the right thing to do, especially if other tools also behave the same way now..
Also, I wonder what the implications of this are.. presumably the problem would be for deduplication if the same content is served with different transfer-encoding and/or content-encodings and would unnecessarily be fetched? Or chunks of different size used for same content?
I wonder how likely/prevalent this is in practice.
I ask this because making such a change will make the payload digests mismatch and affect deduplication with any existing records that have transfer- and content- encoding, while keeping it as is might affect some deduplication.
wumpus commentedon Apr 30, 2019
These issues are exactly why I suspect the standard is what's going to change. Those "bad" digests are all over petabytes of response and revisit records, and the cdx indices ...
JustAnotherArchivist commentedon Apr 30, 2019
@wumpus
Yes, I want to play around with this further. I need to set up a test HTTP server that can produce all the different relevant combinations (unchunked+uncompressed [which should be unproblematic in all implementations], chunked+uncompressed, unchunked+compressed, and chunked+compressed). httpbin unfortunately can't do it; it only has compression options for content encoding, not for transfer encoding, and it doesn't seem to be possible to combine it with chunking either. I guess I'll just write something that serves some hardcoded example data for each of these.
@ikreymer
No, the
entity-bodyin RFC 2616 is the (potentially content-encoded) body before applying transfer encoding (if any), so only the transfer encoding needs to be stripped, but the content encoding needs to stay intact. At least that's what the standards say. Whether that actually makes sense, I'm not sure.I guess the intention is indeed deduplication when the chunk boundaries or something in the compression changes. I've definitely seen varying chunks returned from web servers for identical content, which isn't too surprising since those chunks are often based on a buffer which may get flushed before being full depending on a variety of factors (current CPU/disk speed, concurrent activity, etc.). I have little experience with compression transfer encoding.
But I wonder now whether such deduplication is really a good idea. WARC is all about preserving HTTP requests and responses as accurately as possible. But if revisit records might not actually be identical duplicates, that's a loss of information, namely that the transfer encoding from that server is variable rather than constant. That's only a tiny bit of information and probably not at all useful on its own, but it still seems a bit odd.
I'm aware of the implications for existing archives. But unfortunately, those files do not conform to the standard as it's written currently, and I believe that's a significant issue that needs to be solved, whether that requires changes in the softwares or in the standard. Based on the discussion linked above (or rather, the fact that this exact issue was discussed fairly extensively before), I assumed that a change of the standard is not wanted/needed. But I realise now that this may very well just be an issue of the people working on the standard not knowing how the existing implementations work in this regard.
I'll continue my investigations soon how other tools react to this and file an issue on the WARC specification repo instead if it turns out that the majority of relevant tools don't adhere to the written standard, i.e. if the standard should likely be changed.
wumpus commentedon Apr 30, 2019
I'm planning on writing a jumbo article on what "warcio test" finds, and this will end up being a part of that, if you don't beat me to it. I've got a repo with a pile of warcs from all of the major tools, etc etc.
ikreymer commentedon Apr 30, 2019
Thanks for looking into this more deeply. At first glance, removing transfer-encoding but not content-encoding seems to make even less sense then keeping both or removing both.
I wanted to add, the IIPC hosts a weekly open source web archiving call on Tuesdays, alternating between 9am and 6pm EST currently. I think this would be a great conversation to bring up there as well for more immediate feedback. The next one is tomorrow, April 29 at 9am EST, following week will be 6pm. I can also mention this issue on the IIPC slack channel as well if you'd like.
bnewbold commentedon Apr 30, 2019
This came up for me recently as well; it broke my expectations about how CDX indexes worked, in that when I fetched files via wayback (without replay) and wrote out to disk, the on-disk blob's digest didn't match the digest in the corresponding CDX line. Our CDX build process simply copies the WARC content digest header as-is without verification or re-calculation, and the browser (and/or wayback earlier) strips any encoding before writing to disk.
Here's a URL which consistently induces the ambiguity: http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2012/Fischer12-BAF.pdf
A notable tool which does "follow the spec" is warcprox, used as part of brozzler, as of 2017 (internetarchive/warcprox@3a0f6e0947)
wumpus commentedon Apr 30, 2019
That's an easy problem to see because many http clients will undo chunked encoding and decompress by default. So you can't depend on the checksum not changing. And even if the community agreed about transfer encoding and checksums, decompression is still going to change it.
JustAnotherArchivist commentedon Apr 30, 2019
@wumpus Sounds great! I remember your request on the wpull repository from a few months ago. Great to hear that this project of yours is coming along nicely. :-)
I assume those WARCs all cover very different content, right? It would be very interesting to do a WARC-writing tool intercomparison where all tools (both current and common versions, I think, but not all historical releases) are run against the same sample server. Perhaps the small static server I hinted at could be used for this, so that it'd be "hey, you're maintaining a software that makes HTTP requests and writes to WARC, could you run it against this server and this list of URLs please?" instead of "oh, I need to hunt down an old version of this OS and that library to get this to run". I'll be looking into writing such a server soon.
@ikreymer At first glance, it seems silly, yes. But Transfer-Encoding is intended for encoding the data on-the-fly whereas Content-Encoding is at-rest compression. So the content-encoded data should never change between requests unless the source data has changed whereas the transfer encoding may vary freely. So I guess it kind-of makes sense.
First time I hear about that call after working with WARCs daily for two years, although I must admit that I never searched for things like this either. Is this publicly accessible?
Please do spread it wherever it makes sense. We should also probably find a better place for this discussion than here, but I have no idea where that might be.
@bnewbold That example URL uses chunked encoding, so it makes perfect sense that this causes issues. Thanks for pointing out how warcprox behaves!
ikreymer commentedon Apr 30, 2019
@JustAnotherArchivist Yes, I see your point, though I suppose both could just as easily applied on-the-fly, eg. with nginx
gzip onandchunked_transfer_encoding onoptions.The IIPC call is open to everyone and there is more info in the IIPC slack at: https://iipc.slack.com #oh-sos channel (I can invite you if you'd like, PM your email)
@bnewbold Good to know that warcprox does follow the standard, was about to look at what it does.
So, I decided to run a few quick tests using different tools and got 3 different digests! However, they were consistent across multiple tries of the same tool.
I tried:
This does preserve the chunked encoding in the WARC record.
3) warcprox, which also preserves the chunked encoding.
At first glance, the chunking in 3) and 2) appeared to be the same, though the digests were different..
Trying each tool multiple times produces the same results, so that's good news.
And, running
warcio extract --payloadon each produced exactly same payloads as well, also good.But, clearly there's a few off, such as 1) stripping out the transfer-encoding altogether) while 2) and 3) should be matching. Also,
warcio checkappears to be getting confused by the chunked encoding, hitting the 'digest present but not checked' condition. (@wumpus if you have a chance to take a look at that)I've attached three files here. Will try to investigate further..
1-pywb.warc.gz
2-warcio.warc.gz
3-warcprox.warc.gz
JustAnotherArchivist commentedon Apr 30, 2019
Thanks, will contact you shortly.
1-pywb.warc.gz is arguably the worst because it preserves the Transfer-Encoding header while decoding the body. That means the data consumer will have to detect and work around that. It means that the payload digest is according to spec though.
2-warcio.warc.gz has the expected, spec-violating "payload with transfer encoding" hash.
3-warcprox.warc.gz has the (arguably) correct "payload with transfer encoding removed" hash. But it writes the hash in hexadecimal form rather than the base-32 one which is common in other WARCing tools.
JustAnotherArchivist commentedon Apr 30, 2019
Looks like 3-warcprox.warc.gz has an incorrect hash for the request record though: it is equal to the block hash when it should be
sha1('')since there is no request body.16 remaining items
CorentinB commentedon Jun 23, 2022
Hi all, apparently nobody care much about this issue but it's actually quite concerning, and I think that
warcioshouldn't offer the possibility to check WARC's validity if it is not capable to properly calculate Payload-Digest, it makeswarcioreport false information, and a crawler writing proper WARC files would fail the check because ofwarcionot being able to read and write WARC files entirely as defined by the specs.If nobody wants to fix that issue, it's fine, I don't personally use
warcioand this is one of the reasons why, but at leastwarcio checkshouldn't be a thing. https://github.com/webrecorder/warcio#checkikreymer commentedon Jun 23, 2022
It's not that 'nobody cares', this is was an ambiguity in the standard, and we needed to be careful in making a backwards incompatible changes like this.. Unfortunately, there hasn't been much discussion of this elsewhere for a while..
But yes, should figure out a solution one way or another. Probably the safest thing to do would be to add a command line switch to alter behavior, rather than changing the default behavior. We can take another look..
CorentinB commentedon Jun 24, 2022
I understand that it is a mistake that is fairly easy to do because it's not clearly stated in the specs, but it's not ambiguous because
entity-bodyis clearly defined in RFC 2616.I think the priority is to stop giving the possibility to verify WARCs with
warciountil this is fixed. And I definitely think the default behavior should be fixed, the "good way" shouldn't be optional, it should be the default and the only way to do that.If I understand all the messages above, it sounds like people are expecting the standard to change to match what tools like
warcioare doing, it is not happening right now and it won't happen tomorrow. We can't know if it will happen one day, but in the meantime I thinkwarcioshould be fixed and you should stop offeringwarcio checkuntil this is fixed.I also understand that fixing
warcioisn't easy. I really do. It's why I put en emphasis on removing thecheckcapability until you figure out how to fix it.nclarkekb commentedon Apr 12, 2023
JWAT and https://github.com/nlnwa/warchaeology also mark these digests as incorrect.
I though the purpose of these digests were to validate the content as it is in the payload.
It makes no sense at all to have a digests that needs extra computations in order to validate the payload.
It just add another useless layer with more possibilities for errors in writers and validators.
This is the first time since writing JWAT that I've been made aware of this digest issue.
So as pointed out else where there are massive amounts of existing archives with exact payload digests that could be affected by changing how digests are calculated.
Arkiver2 commentedon Jun 23, 2023
Wget has this issue as well. At Archive Team we use the modified version of Wget at https://github.com/ArchiveTeam/wget-lua. This is used to archive petabytes of data.
This modified Wget is now updated ArchiveTeam/wget-lua#13, and as of version 1.21.3-at.20230623.01 calculates the correct
WARC-Payload-Digestvalue - that is with transfer encoding stripped for calculation of the digest, while preserving the transfer encoding in the WARC record block.Over the next days this will be rolled out for all Archive Team distributed Warrior projects.
ikreymer commentedon Jun 24, 2023
Coming back to this after a few years, I think the best solution is to strip out the Transfer-Encoding altogether when writing the WARCs, so that WARCs are written with the Transfer-Encoding removed from the data, and header replaced with something like X-Removed-Transfer-Removed (we might have discussed that in another issue), and the payload digest matches the data with no Transfer-Encoding.
After all these years, haven't really seen a need to access the transfer-encoded data, so I agree with @nclarkekb that it just creates more work for anyone using the data.
WARCs created capture via other methods, such as browser extensions / warcio.js also has the encoding already removed.
The idea that the WARC data represents raw network traffic has not been true with advent of TLS and HTTP/2 and HTTP/3 so we might as well represent the 'canonical' text based HTTP response, no transfer encoding applied and no transfer-encoding header so that the data makes sense.
JustAnotherArchivist commentedon Jun 24, 2023
I also remember that (brief) discussion, but don't know where it happened. I thought it was in iipc/warc-specifications but couldn't find it there. It's certainly a discussion worth having, probably over there.
However, any tools that currently strip the transfer encoding from the record body, transform the HTTP headers, or write HTTP/2 or HTTP/3 data to WARC are undeniably violating the standard. This is not ambiguous in the WARC spec: only HTTP/1.1 with the exact HTTP-layer-level data as sent/received is compatible with WARC versions 1.0 and 1.1. And since this is a format intended for long-term archival, spec compliance is crucial in my opinion.
(Side note: perhaps the IIPC repo should also contain a list of known WARC spec violations found in the wild. It would fit in with the implementation recommendations and the annotated spec, I think. Apart from this issue right here, the angle brackets misunderstanding would also belong there.)
Arkiver2 commentedon Jun 24, 2023
@ikreymer I very strongly disagree with this idea, and I agree with what @JustAnotherArchivist noted.
If the WARCs captured with "browser extensions / warcio.js" already have encoding removed, then these implementation are in violation of the standard and do not create proper WARCs. Whether this is an issue is up to the individual using the tool, but they should be aware of this, if the issue is known.
Previously (#74 (comment)) the argument was made that
If the suggestion is to strip Transfer-Encoding entirely, store this in the WARC, and calculate the payload digest anyway on the stripped content, then what is stopping us from just calculating the payload anyway on the stripped body, while writing the raw body (including Transfer-Encoding) to WARC?
The WARC standard notes clearly what should be stored, and what the digests should be calculated over. For many implementations the standard was not accurately read/analysed. I would say it would be best if implementations are fixed whenever possible and simply start calculating the correct digests moving forward.
We should agree on a solution for this in https://github.com/iipc/warc-specifications, and move forward with that.
Yes, I agree with this. When people use tools that write WARCs, they will likely assume that these WARCs are entirely following the WARC standard. Some tools do not. Sometimes that is known, sometimes it is yet to be discovered. I think it would be good to make clear to people who use WARC tools in what ways these tools do not follow the WARC standard, and possibly why they do not follow the WARC standard. They can then decide for themselves whether that is a problem, possibly with recommendations from the IIPC on the long term aspect.
wumpus commentedon Jun 27, 2023
Weird that after 4 years of discussion, no one has yet to look at IA's and CC's warcs to see what they do with this issue.
kngenie commentedon Jun 28, 2023
Just putting my 2c, as a developer / maintainer of several different versions of crawlers and wayback machine at Internet Archive...
Content-Encoding: gzipor not), but makes sense to me because that's how most basic HTTP client would work (they never fail to decode Transfer-Encoding, but may not handle Content-Encoding)CorentinB commentedon Apr 4, 2024
Hey, so we are more than a year later now since the last message, and almost 5 years since this issue has been opened, the damages made has been huge so I wonder: still no plan about fixing this library? Thanks.