Eric van der Vlist
|
a7c3525ef6
|
Hmmm... HTML should be serialized as HTML, of course!
|
2012-04-28 16:52:28 +02:00 |
Eric van der Vlist
|
c79bd8e49c
|
Forcing HTML content type for XHTML documents
|
2012-04-28 09:42:21 +02:00 |
Eric van der Vlist
|
9bce34f7c6
|
Rewriting links in HTML and CSS resources within WARC archives
|
2012-04-27 18:29:15 +02:00 |
Eric van der Vlist
|
5b162a64df
|
WARC mail extract loop
|
2012-04-27 17:34:18 +02:00 |
Eric van der Vlist
|
466d4473ce
|
Generating a resource index to facilitate further processing.
|
2012-04-27 17:04:17 +02:00 |
Eric van der Vlist
|
675ed04aba
|
Download and convert the crawl log
|
2012-04-26 17:08:28 +02:00 |
Eric van der Vlist
|
6f64c7f8a9
|
Handling payload content types
|
2012-04-26 14:13:24 +02:00 |
Eric van der Vlist
|
be1a361ab9
|
Implementing yet another WARC parser (the heritrix one didn't work well with Orbeon due to http client library conflicts).
|
2012-04-26 09:48:43 +02:00 |
Eric van der Vlist
|
307b6d2a72
|
Adding whois records
|
2012-04-23 12:11:17 +02:00 |
Eric van der Vlist
|
22c3028c38
|
First stab of WARC packaging.
|
2012-04-23 11:26:59 +02:00 |
Eric van der Vlist
|
51c2058aa6
|
Queue an action to package the Heritrix WARC.
|
2012-04-23 11:09:36 +02:00 |
Eric van der Vlist
|
b346236789
|
Adding a mechanism to delay actions in the queue.
|
2012-04-22 18:56:15 +02:00 |
Eric van der Vlist
|
3bcb813cb7
|
Unpause Heritrix job.
|
2012-04-22 17:59:39 +02:00 |
Eric van der Vlist
|
f25a9246bc
|
Modifying the way the Heritrix (spring) config file is generated since it seems to be picky on whitespaces and indentation...
|
2012-04-22 16:27:16 +02:00 |
Eric van der Vlist
|
a3fa073667
|
Update to follow changes to Orbeon Forms experimental features...
|
2012-04-22 08:44:12 +02:00 |
Eric van der Vlist
|
a1dc635607
|
Update to follow changes to Orbeon Forms experimental features...
|
2012-04-22 00:01:51 +02:00 |
Eric van der Vlist
|
57daa703da
|
Now building and launching Heritrix jobs...
|
2012-04-21 23:42:16 +02:00 |
Eric van der Vlist
|
be2f974a4c
|
Update to follow changes to Orbeon Forms experimental features...
|
2012-04-21 22:51:58 +02:00 |
Eric van der Vlist
|
c4c4108025
|
Starting to write pipeline actions that interact with an Heritrix server
|
2012-04-20 20:39:00 +02:00 |
Eric van der Vlist
|
ad35672603
|
Still work in progress, but the WARC archive now validates with warc-tools' warcvalid.py...
|
2012-04-15 00:12:29 +02:00 |
Eric van der Vlist
|
ba51ddfb0b
|
Starting to support content lengths in warc archives
|
2012-04-14 22:32:33 +02:00 |
Eric van der Vlist
|
9d99928c60
|
Removing the last action from the queue
|
2012-04-13 19:17:20 +02:00 |
Eric van der Vlist
|
01a66903f3
|
First version that can produce a packaged archive.
|
2012-04-13 19:08:04 +02:00 |
Eric van der Vlist
|
5ac9ea90bb
|
Packaging resources that have not been rewritten...
|
2012-04-13 18:42:32 +02:00 |
Eric van der Vlist
|
0e7bdd1de4
|
Adding a basic squeleton to generate what should ultimately be a WARC archive
|
2012-04-13 18:01:53 +02:00 |
Eric van der Vlist
|
3d18e9d8a4
|
Adding a mechanism to avoid to archive multiple times the same resource for a single archive set.
|
2012-04-13 13:05:25 +02:00 |
Eric van der Vlist
|
cf97a98416
|
Fist version supporting CSS rewriting
|
2012-04-13 12:27:04 +02:00 |
Eric van der Vlist
|
750ccaac7c
|
Dummy (passthrough) implementation of the CSS support...
|
2012-04-13 11:58:38 +02:00 |
Eric van der Vlist
|
16cc943d48
|
Refactoring before supporting CSS
|
2012-04-13 11:16:40 +02:00 |
Eric van der Vlist
|
11027c068a
|
Moving action pipelines in their own directory
|
2012-04-13 10:53:25 +02:00 |
Eric van der Vlist
|
a0bd1a56fd
|
Adding a priority mechanism
|
2012-04-12 14:06:23 +02:00 |
Eric van der Vlist
|
6b10b3e51c
|
Removing an xsl:message.
|
2012-04-12 12:56:21 +02:00 |
Eric van der Vlist
|
fd2ca8f305
|
Adding timestamps to the archive indexes
|
2012-04-12 12:42:55 +02:00 |
Eric van der Vlist
|
c71d5b202d
|
Starting to implement a version based on Orbeon's XPL or the archiver.
|
2012-04-12 11:19:46 +02:00 |
Eric van der Vlist
|
0424eedb2e
|
Adding credential to the logo
|
2012-04-11 15:27:04 +02:00 |
Eric van der Vlist
|
bbe3c7fa0c
|
Logos by Michel Duperrier
|
2012-03-21 22:25:45 +01:00 |
Eric van der Vlist
|
eef5297f98
|
Quick fix for Wikipedia archives issues #6.
|
2012-01-28 11:16:17 +01:00 |
Eric van der Vlist
|
6332cf69a5
|
Adding an empty archives directory in git.
|
2012-01-27 10:42:55 +01:00 |
Eric van der Vlist
|
158172880f
|
Support wget 1.11 (ticket #5)
|
2012-01-26 22:16:57 +01:00 |
Eric van der Vlist
|
be2719ed73
|
Support wget 1.11 (ticket #5)
|
2012-01-26 22:15:38 +01:00 |
Eric van der Vlist
|
7543bba0b3
|
Include wp-admin/includes/plugin.php when needed.
|
2011-06-05 00:11:24 +02:00 |
Eric van der Vlist
|
1033614814
|
#4 detection of the encoding used in the archives.
|
2011-06-04 20:06:00 +02:00 |
Eric van der Vlist
|
5a50ccf29c
|
#3: supporting other filenames than index.html (enhancement)
|
2011-06-04 17:18:47 +02:00 |
Eric van der Vlist
|
136dad5a15
|
#3: supporting other filenames than index.html
|
2011-06-04 16:33:57 +02:00 |
Eric van der Vlist
|
dad0250e5a
|
#2: trying to implement a semaphore with wp_options...
|
2011-06-04 15:55:15 +02:00 |
Eric van der Vlist
|
30fae5a621
|
Suppression des dernières références à perwarc (ancienne implémentation en bash)
|
2011-06-04 09:40:37 +02:00 |
Eric van der Vlist
|
3c55dce51f
|
Implementing the archive retrieval using wget.
|
2011-06-04 00:40:16 +02:00 |
Eric van der Vlist
|
bb2bfbace3
|
Checking that we can execute wget (no need of passthru for us).
|
2011-06-03 20:36:25 +02:00 |
Eric van der Vlist
|
b34168327f
|
Checking that we can execute wget.
|
2011-06-03 20:34:54 +02:00 |
Eric van der Vlist
|
5b7eecf33e
|
Checking that we can create an archives sub directory.
|
2011-06-03 20:02:15 +02:00 |