Eric van der Vlist
|
b9c833fd17
|
Removing intermediary directories
|
2020-05-01 12:09:23 +02:00 |
Eric van der Vlist
|
f907af85c7
|
Fixing #9
|
2014-01-11 22:37:00 +01:00 |
Eric van der Vlist
|
5acb10101f
|
Rewriting resources with no archived out links
|
2012-05-09 19:38:21 +02:00 |
Eric van der Vlist
|
4473ad6e15
|
Support HTML @background
|
2012-05-04 19:57:24 +02:00 |
Eric van der Vlist
|
94d335170f
|
Map application/xhtml+xml to .html
|
2012-05-04 19:52:56 +02:00 |
Eric van der Vlist
|
5e2b674092
|
Store the craw log into the archive
|
2012-05-04 19:49:41 +02:00 |
Eric van der Vlist
|
c25b18f9f5
|
Support HTML embed/@src
|
2012-05-04 19:43:20 +02:00 |
Eric van der Vlist
|
16ef7979b0
|
Trying to guess content types
|
2012-04-28 23:12:20 +02:00 |
Eric van der Vlist
|
bc581fabf9
|
Adapting relative links to match the structure of the browsable archive
|
2012-04-28 22:29:43 +02:00 |
Eric van der Vlist
|
bf2980567a
|
Cleaning the algorithm to compute friendly local names.
|
2012-04-28 18:36:16 +02:00 |
Eric van der Vlist
|
cfaf8ae9c2
|
Adding XSLTUnit tests for the local-name function.
|
2012-04-28 17:29:52 +02:00 |
Eric van der Vlist
|
a7c3525ef6
|
Hmmm... HTML should be serialized as HTML, of course!
|
2012-04-28 16:52:28 +02:00 |
Eric van der Vlist
|
c79bd8e49c
|
Forcing HTML content type for XHTML documents
|
2012-04-28 09:42:21 +02:00 |
Eric van der Vlist
|
9bce34f7c6
|
Rewriting links in HTML and CSS resources within WARC archives
|
2012-04-27 18:29:15 +02:00 |
Eric van der Vlist
|
5b162a64df
|
WARC mail extract loop
|
2012-04-27 17:34:18 +02:00 |
Eric van der Vlist
|
466d4473ce
|
Generating a resource index to facilitate further processing.
|
2012-04-27 17:04:17 +02:00 |
Eric van der Vlist
|
675ed04aba
|
Download and convert the crawl log
|
2012-04-26 17:08:28 +02:00 |
Eric van der Vlist
|
6f64c7f8a9
|
Handling payload content types
|
2012-04-26 14:13:24 +02:00 |
Eric van der Vlist
|
be1a361ab9
|
Implementing yet another WARC parser (the heritrix one didn't work well with Orbeon due to http client library conflicts).
|
2012-04-26 09:48:43 +02:00 |
Eric van der Vlist
|
307b6d2a72
|
Adding whois records
|
2012-04-23 12:11:17 +02:00 |
Eric van der Vlist
|
22c3028c38
|
First stab of WARC packaging.
|
2012-04-23 11:26:59 +02:00 |
Eric van der Vlist
|
51c2058aa6
|
Queue an action to package the Heritrix WARC.
|
2012-04-23 11:09:36 +02:00 |
Eric van der Vlist
|
b346236789
|
Adding a mechanism to delay actions in the queue.
|
2012-04-22 18:56:15 +02:00 |
Eric van der Vlist
|
3bcb813cb7
|
Unpause Heritrix job.
|
2012-04-22 17:59:39 +02:00 |
Eric van der Vlist
|
f25a9246bc
|
Modifying the way the Heritrix (spring) config file is generated since it seems to be picky on whitespaces and indentation...
|
2012-04-22 16:27:16 +02:00 |
Eric van der Vlist
|
a3fa073667
|
Update to follow changes to Orbeon Forms experimental features...
|
2012-04-22 08:44:12 +02:00 |
Eric van der Vlist
|
a1dc635607
|
Update to follow changes to Orbeon Forms experimental features...
|
2012-04-22 00:01:51 +02:00 |
Eric van der Vlist
|
57daa703da
|
Now building and launching Heritrix jobs...
|
2012-04-21 23:42:16 +02:00 |
Eric van der Vlist
|
be2f974a4c
|
Update to follow changes to Orbeon Forms experimental features...
|
2012-04-21 22:51:58 +02:00 |
Eric van der Vlist
|
c4c4108025
|
Starting to write pipeline actions that interact with an Heritrix server
|
2012-04-20 20:39:00 +02:00 |
Eric van der Vlist
|
ad35672603
|
Still work in progress, but the WARC archive now validates with warc-tools' warcvalid.py...
|
2012-04-15 00:12:29 +02:00 |
Eric van der Vlist
|
ba51ddfb0b
|
Starting to support content lengths in warc archives
|
2012-04-14 22:32:33 +02:00 |
Eric van der Vlist
|
9d99928c60
|
Removing the last action from the queue
|
2012-04-13 19:17:20 +02:00 |
Eric van der Vlist
|
01a66903f3
|
First version that can produce a packaged archive.
|
2012-04-13 19:08:04 +02:00 |
Eric van der Vlist
|
5ac9ea90bb
|
Packaging resources that have not been rewritten...
|
2012-04-13 18:42:32 +02:00 |
Eric van der Vlist
|
0e7bdd1de4
|
Adding a basic squeleton to generate what should ultimately be a WARC archive
|
2012-04-13 18:01:53 +02:00 |
Eric van der Vlist
|
3d18e9d8a4
|
Adding a mechanism to avoid to archive multiple times the same resource for a single archive set.
|
2012-04-13 13:05:25 +02:00 |
Eric van der Vlist
|
cf97a98416
|
Fist version supporting CSS rewriting
|
2012-04-13 12:27:04 +02:00 |
Eric van der Vlist
|
750ccaac7c
|
Dummy (passthrough) implementation of the CSS support...
|
2012-04-13 11:58:38 +02:00 |
Eric van der Vlist
|
16cc943d48
|
Refactoring before supporting CSS
|
2012-04-13 11:16:40 +02:00 |
Eric van der Vlist
|
11027c068a
|
Moving action pipelines in their own directory
|
2012-04-13 10:53:25 +02:00 |
Eric van der Vlist
|
a0bd1a56fd
|
Adding a priority mechanism
|
2012-04-12 14:06:23 +02:00 |
Eric van der Vlist
|
6b10b3e51c
|
Removing an xsl:message.
|
2012-04-12 12:56:21 +02:00 |
Eric van der Vlist
|
fd2ca8f305
|
Adding timestamps to the archive indexes
|
2012-04-12 12:42:55 +02:00 |
Eric van der Vlist
|
c71d5b202d
|
Starting to implement a version based on Orbeon's XPL or the archiver.
|
2012-04-12 11:19:46 +02:00 |
Eric van der Vlist
|
0424eedb2e
|
Adding credential to the logo
|
2012-04-11 15:27:04 +02:00 |
Eric van der Vlist
|
bbe3c7fa0c
|
Logos by Michel Duperrier
|
2012-03-21 22:25:45 +01:00 |
Eric van der Vlist
|
eef5297f98
|
Quick fix for Wikipedia archives issues #6.
|
2012-01-28 11:16:17 +01:00 |
Eric van der Vlist
|
6332cf69a5
|
Adding an empty archives directory in git.
|
2012-01-27 10:42:55 +01:00 |
Eric van der Vlist
|
158172880f
|
Support wget 1.11 (ticket #5)
|
2012-01-26 22:16:57 +01:00 |