Commit Graph

54 Commits

Author SHA1 Message Date
Eric van der Vlist c79bd8e49c Forcing HTML content type for XHTML documents 2012-04-28 09:42:21 +02:00
Eric van der Vlist 9bce34f7c6 Rewriting links in HTML and CSS resources within WARC archives 2012-04-27 18:29:15 +02:00
Eric van der Vlist 5b162a64df WARC mail extract loop 2012-04-27 17:34:18 +02:00
Eric van der Vlist 466d4473ce Generating a resource index to facilitate further processing. 2012-04-27 17:04:17 +02:00
Eric van der Vlist 675ed04aba Download and convert the crawl log 2012-04-26 17:08:28 +02:00
Eric van der Vlist 6f64c7f8a9 Handling payload content types 2012-04-26 14:13:24 +02:00
Eric van der Vlist be1a361ab9 Implementing yet another WARC parser (the heritrix one didn't work well with Orbeon due to http client library conflicts). 2012-04-26 09:48:43 +02:00
Eric van der Vlist 307b6d2a72 Adding whois records 2012-04-23 12:11:17 +02:00
Eric van der Vlist 22c3028c38 First stab of WARC packaging. 2012-04-23 11:26:59 +02:00
Eric van der Vlist 51c2058aa6 Queue an action to package the Heritrix WARC. 2012-04-23 11:09:36 +02:00
Eric van der Vlist b346236789 Adding a mechanism to delay actions in the queue. 2012-04-22 18:56:15 +02:00
Eric van der Vlist 3bcb813cb7 Unpause Heritrix job. 2012-04-22 17:59:39 +02:00
Eric van der Vlist f25a9246bc Modifying the way the Heritrix (spring) config file is generated since it seems to be picky on whitespaces and indentation... 2012-04-22 16:27:16 +02:00
Eric van der Vlist a3fa073667 Update to follow changes to Orbeon Forms experimental features... 2012-04-22 08:44:12 +02:00
Eric van der Vlist a1dc635607 Update to follow changes to Orbeon Forms experimental features... 2012-04-22 00:01:51 +02:00
Eric van der Vlist 57daa703da Now building and launching Heritrix jobs... 2012-04-21 23:42:16 +02:00
Eric van der Vlist be2f974a4c Update to follow changes to Orbeon Forms experimental features... 2012-04-21 22:51:58 +02:00
Eric van der Vlist c4c4108025 Starting to write pipeline actions that interact with an Heritrix server 2012-04-20 20:39:00 +02:00
Eric van der Vlist ad35672603 Still work in progress, but the WARC archive now validates with warc-tools' warcvalid.py... 2012-04-15 00:12:29 +02:00
Eric van der Vlist ba51ddfb0b Starting to support content lengths in warc archives 2012-04-14 22:32:33 +02:00
Eric van der Vlist 9d99928c60 Removing the last action from the queue 2012-04-13 19:17:20 +02:00
Eric van der Vlist 01a66903f3 First version that can produce a packaged archive. 2012-04-13 19:08:04 +02:00
Eric van der Vlist 5ac9ea90bb Packaging resources that have not been rewritten... 2012-04-13 18:42:32 +02:00
Eric van der Vlist 0e7bdd1de4 Adding a basic squeleton to generate what should ultimately be a WARC archive 2012-04-13 18:01:53 +02:00
Eric van der Vlist 3d18e9d8a4 Adding a mechanism to avoid to archive multiple times the same resource for a single archive set. 2012-04-13 13:05:25 +02:00
Eric van der Vlist cf97a98416 Fist version supporting CSS rewriting 2012-04-13 12:27:04 +02:00
Eric van der Vlist 750ccaac7c Dummy (passthrough) implementation of the CSS support... 2012-04-13 11:58:38 +02:00
Eric van der Vlist 16cc943d48 Refactoring before supporting CSS 2012-04-13 11:16:40 +02:00
Eric van der Vlist 11027c068a Moving action pipelines in their own directory 2012-04-13 10:53:25 +02:00
Eric van der Vlist a0bd1a56fd Adding a priority mechanism 2012-04-12 14:06:23 +02:00
Eric van der Vlist 6b10b3e51c Removing an xsl:message. 2012-04-12 12:56:21 +02:00
Eric van der Vlist fd2ca8f305 Adding timestamps to the archive indexes 2012-04-12 12:42:55 +02:00
Eric van der Vlist c71d5b202d Starting to implement a version based on Orbeon's XPL or the archiver. 2012-04-12 11:19:46 +02:00
Eric van der Vlist 0424eedb2e Adding credential to the logo 2012-04-11 15:27:04 +02:00
Eric van der Vlist bbe3c7fa0c Logos by Michel Duperrier 2012-03-21 22:25:45 +01:00
Eric van der Vlist eef5297f98 Quick fix for Wikipedia archives issues #6. 2012-01-28 11:16:17 +01:00
Eric van der Vlist 6332cf69a5 Adding an empty archives directory in git. 2012-01-27 10:42:55 +01:00
Eric van der Vlist 158172880f Support wget 1.11 (ticket #5) 2012-01-26 22:16:57 +01:00
Eric van der Vlist be2719ed73 Support wget 1.11 (ticket #5) 2012-01-26 22:15:38 +01:00
Eric van der Vlist 7543bba0b3 Include wp-admin/includes/plugin.php when needed. 2011-06-05 00:11:24 +02:00
Eric van der Vlist 1033614814 #4 detection of the encoding used in the archives. 2011-06-04 20:06:00 +02:00
Eric van der Vlist 5a50ccf29c #3: supporting other filenames than index.html (enhancement) 2011-06-04 17:18:47 +02:00
Eric van der Vlist 136dad5a15 #3: supporting other filenames than index.html 2011-06-04 16:33:57 +02:00
Eric van der Vlist dad0250e5a #2: trying to implement a semaphore with wp_options... 2011-06-04 15:55:15 +02:00
Eric van der Vlist 30fae5a621 Suppression des dernières références à perwarc (ancienne implémentation en bash) 2011-06-04 09:40:37 +02:00
Eric van der Vlist 3c55dce51f Implementing the archive retrieval using wget. 2011-06-04 00:40:16 +02:00
Eric van der Vlist bb2bfbace3 Checking that we can execute wget (no need of passthru for us). 2011-06-03 20:36:25 +02:00
Eric van der Vlist b34168327f Checking that we can execute wget. 2011-06-03 20:34:54 +02:00
Eric van der Vlist 5b7eecf33e Checking that we can create an archives sub directory. 2011-06-03 20:02:15 +02:00
Eric van der Vlist 3fe3b3652a Checking that the Broken Link Checker plugin is installed. 2011-06-03 19:35:18 +02:00