Eric van der Vlist
|
bf2980567a
|
Cleaning the algorithm to compute friendly local names.
|
2012-04-28 18:36:16 +02:00 |
Eric van der Vlist
|
a7c3525ef6
|
Hmmm... HTML should be serialized as HTML, of course!
|
2012-04-28 16:52:28 +02:00 |
Eric van der Vlist
|
c79bd8e49c
|
Forcing HTML content type for XHTML documents
|
2012-04-28 09:42:21 +02:00 |
Eric van der Vlist
|
9bce34f7c6
|
Rewriting links in HTML and CSS resources within WARC archives
|
2012-04-27 18:29:15 +02:00 |
Eric van der Vlist
|
5b162a64df
|
WARC mail extract loop
|
2012-04-27 17:34:18 +02:00 |
Eric van der Vlist
|
466d4473ce
|
Generating a resource index to facilitate further processing.
|
2012-04-27 17:04:17 +02:00 |
Eric van der Vlist
|
675ed04aba
|
Download and convert the crawl log
|
2012-04-26 17:08:28 +02:00 |
Eric van der Vlist
|
be1a361ab9
|
Implementing yet another WARC parser (the heritrix one didn't work well with Orbeon due to http client library conflicts).
|
2012-04-26 09:48:43 +02:00 |
Eric van der Vlist
|
307b6d2a72
|
Adding whois records
|
2012-04-23 12:11:17 +02:00 |
Eric van der Vlist
|
22c3028c38
|
First stab of WARC packaging.
|
2012-04-23 11:26:59 +02:00 |
Eric van der Vlist
|
51c2058aa6
|
Queue an action to package the Heritrix WARC.
|
2012-04-23 11:09:36 +02:00 |
Eric van der Vlist
|
b346236789
|
Adding a mechanism to delay actions in the queue.
|
2012-04-22 18:56:15 +02:00 |
Eric van der Vlist
|
3bcb813cb7
|
Unpause Heritrix job.
|
2012-04-22 17:59:39 +02:00 |
Eric van der Vlist
|
f25a9246bc
|
Modifying the way the Heritrix (spring) config file is generated since it seems to be picky on whitespaces and indentation...
|
2012-04-22 16:27:16 +02:00 |
Eric van der Vlist
|
a3fa073667
|
Update to follow changes to Orbeon Forms experimental features...
|
2012-04-22 08:44:12 +02:00 |
Eric van der Vlist
|
a1dc635607
|
Update to follow changes to Orbeon Forms experimental features...
|
2012-04-22 00:01:51 +02:00 |
Eric van der Vlist
|
57daa703da
|
Now building and launching Heritrix jobs...
|
2012-04-21 23:42:16 +02:00 |
Eric van der Vlist
|
be2f974a4c
|
Update to follow changes to Orbeon Forms experimental features...
|
2012-04-21 22:51:58 +02:00 |
Eric van der Vlist
|
c4c4108025
|
Starting to write pipeline actions that interact with an Heritrix server
|
2012-04-20 20:39:00 +02:00 |
Eric van der Vlist
|
ad35672603
|
Still work in progress, but the WARC archive now validates with warc-tools' warcvalid.py...
|
2012-04-15 00:12:29 +02:00 |
Eric van der Vlist
|
ba51ddfb0b
|
Starting to support content lengths in warc archives
|
2012-04-14 22:32:33 +02:00 |
Eric van der Vlist
|
9d99928c60
|
Removing the last action from the queue
|
2012-04-13 19:17:20 +02:00 |
Eric van der Vlist
|
01a66903f3
|
First version that can produce a packaged archive.
|
2012-04-13 19:08:04 +02:00 |
Eric van der Vlist
|
5ac9ea90bb
|
Packaging resources that have not been rewritten...
|
2012-04-13 18:42:32 +02:00 |
Eric van der Vlist
|
0e7bdd1de4
|
Adding a basic squeleton to generate what should ultimately be a WARC archive
|
2012-04-13 18:01:53 +02:00 |
Eric van der Vlist
|
3d18e9d8a4
|
Adding a mechanism to avoid to archive multiple times the same resource for a single archive set.
|
2012-04-13 13:05:25 +02:00 |
Eric van der Vlist
|
cf97a98416
|
Fist version supporting CSS rewriting
|
2012-04-13 12:27:04 +02:00 |
Eric van der Vlist
|
750ccaac7c
|
Dummy (passthrough) implementation of the CSS support...
|
2012-04-13 11:58:38 +02:00 |
Eric van der Vlist
|
16cc943d48
|
Refactoring before supporting CSS
|
2012-04-13 11:16:40 +02:00 |
Eric van der Vlist
|
11027c068a
|
Moving action pipelines in their own directory
|
2012-04-13 10:53:25 +02:00 |