Skip to content

Conversation

Frogging101
Copy link

@Frogging101 Frogging101 commented Mar 26, 2017

I remembered to:

  • Update or add unit tests if needed
  • Update or add documentation/comments if needed
  • Made sure stray files or whitespace didn't get committed
  • If significant changes, branch from develop and set to merge into develop
  • Read the guidelines for contributing

Changes:

  • Fix a problem where HTTP headers with values containing extra leading/trailing whitespace would cause wpull to calculate an incorrect payload hash.
  • Fix another problem that I noticed after adding the test case for the above issue, where CDX records would be missing status code and MIME type if the HTTP headers had multiple lines (i.e. anything more than the status line).

@chfoo chfoo added the todo label Aug 30, 2017
…espace

The payload offset was being obtained by taking len(response.to_bytes()),
but since leading/trailing whitespace is discarded from the response
class's name/value pairs, the length of the generated string would not
necessarily reflect the actual size of the received headers, leading to a
checksum being calculated from the wrong position in the file.

To prevent this, the WARC recorder will now independently figure out where
the headers really end in the file.
The regex used to find the end of the HTTP headers would not match if
there was a newline in between, so get_http_header would return nothing
and the status code and MIME type fields would be empty in the CDX record.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants