Eric van der Vlist
|
5e2b674092
|
Store the craw log into the archive
|
2012-05-04 19:49:41 +02:00 |
Eric van der Vlist
|
9bce34f7c6
|
Rewriting links in HTML and CSS resources within WARC archives
|
2012-04-27 18:29:15 +02:00 |
Eric van der Vlist
|
5b162a64df
|
WARC mail extract loop
|
2012-04-27 17:34:18 +02:00 |
Eric van der Vlist
|
466d4473ce
|
Generating a resource index to facilitate further processing.
|
2012-04-27 17:04:17 +02:00 |
Eric van der Vlist
|
675ed04aba
|
Download and convert the crawl log
|
2012-04-26 17:08:28 +02:00 |
Eric van der Vlist
|
be1a361ab9
|
Implementing yet another WARC parser (the heritrix one didn't work well with Orbeon due to http client library conflicts).
|
2012-04-26 09:48:43 +02:00 |
Eric van der Vlist
|
22c3028c38
|
First stab of WARC packaging.
|
2012-04-23 11:26:59 +02:00 |