Commit Graph

41 Commits

Author SHA1 Message Date
Eric van der Vlist 94d335170f Map application/xhtml+xml to .html 2012-05-04 19:52:56 +02:00
Eric van der Vlist 5e2b674092 Store the craw log into the archive 2012-05-04 19:49:41 +02:00
Eric van der Vlist c25b18f9f5 Support HTML embed/@src 2012-05-04 19:43:20 +02:00
Eric van der Vlist 16ef7979b0 Trying to guess content types 2012-04-28 23:12:20 +02:00
Eric van der Vlist bc581fabf9 Adapting relative links to match the structure of the browsable archive 2012-04-28 22:29:43 +02:00
Eric van der Vlist bf2980567a Cleaning the algorithm to compute friendly local names. 2012-04-28 18:36:16 +02:00
Eric van der Vlist cfaf8ae9c2 Adding XSLTUnit tests for the local-name function. 2012-04-28 17:29:52 +02:00
Eric van der Vlist a7c3525ef6 Hmmm... HTML should be serialized as HTML, of course! 2012-04-28 16:52:28 +02:00
Eric van der Vlist c79bd8e49c Forcing HTML content type for XHTML documents 2012-04-28 09:42:21 +02:00
Eric van der Vlist 9bce34f7c6 Rewriting links in HTML and CSS resources within WARC archives 2012-04-27 18:29:15 +02:00
Eric van der Vlist 5b162a64df WARC mail extract loop 2012-04-27 17:34:18 +02:00
Eric van der Vlist 466d4473ce Generating a resource index to facilitate further processing. 2012-04-27 17:04:17 +02:00
Eric van der Vlist 675ed04aba Download and convert the crawl log 2012-04-26 17:08:28 +02:00
Eric van der Vlist 6f64c7f8a9 Handling payload content types 2012-04-26 14:13:24 +02:00
Eric van der Vlist be1a361ab9 Implementing yet another WARC parser (the heritrix one didn't work well with Orbeon due to http client library conflicts). 2012-04-26 09:48:43 +02:00
Eric van der Vlist 307b6d2a72 Adding whois records 2012-04-23 12:11:17 +02:00
Eric van der Vlist 22c3028c38 First stab of WARC packaging. 2012-04-23 11:26:59 +02:00
Eric van der Vlist 51c2058aa6 Queue an action to package the Heritrix WARC. 2012-04-23 11:09:36 +02:00
Eric van der Vlist b346236789 Adding a mechanism to delay actions in the queue. 2012-04-22 18:56:15 +02:00
Eric van der Vlist 3bcb813cb7 Unpause Heritrix job. 2012-04-22 17:59:39 +02:00
Eric van der Vlist f25a9246bc Modifying the way the Heritrix (spring) config file is generated since it seems to be picky on whitespaces and indentation... 2012-04-22 16:27:16 +02:00
Eric van der Vlist a3fa073667 Update to follow changes to Orbeon Forms experimental features... 2012-04-22 08:44:12 +02:00
Eric van der Vlist a1dc635607 Update to follow changes to Orbeon Forms experimental features... 2012-04-22 00:01:51 +02:00
Eric van der Vlist 57daa703da Now building and launching Heritrix jobs... 2012-04-21 23:42:16 +02:00
Eric van der Vlist be2f974a4c Update to follow changes to Orbeon Forms experimental features... 2012-04-21 22:51:58 +02:00
Eric van der Vlist c4c4108025 Starting to write pipeline actions that interact with an Heritrix server 2012-04-20 20:39:00 +02:00
Eric van der Vlist ad35672603 Still work in progress, but the WARC archive now validates with warc-tools' warcvalid.py... 2012-04-15 00:12:29 +02:00
Eric van der Vlist ba51ddfb0b Starting to support content lengths in warc archives 2012-04-14 22:32:33 +02:00
Eric van der Vlist 9d99928c60 Removing the last action from the queue 2012-04-13 19:17:20 +02:00
Eric van der Vlist 01a66903f3 First version that can produce a packaged archive. 2012-04-13 19:08:04 +02:00
Eric van der Vlist 5ac9ea90bb Packaging resources that have not been rewritten... 2012-04-13 18:42:32 +02:00
Eric van der Vlist 0e7bdd1de4 Adding a basic squeleton to generate what should ultimately be a WARC archive 2012-04-13 18:01:53 +02:00
Eric van der Vlist 3d18e9d8a4 Adding a mechanism to avoid to archive multiple times the same resource for a single archive set. 2012-04-13 13:05:25 +02:00
Eric van der Vlist cf97a98416 Fist version supporting CSS rewriting 2012-04-13 12:27:04 +02:00
Eric van der Vlist 750ccaac7c Dummy (passthrough) implementation of the CSS support... 2012-04-13 11:58:38 +02:00
Eric van der Vlist 16cc943d48 Refactoring before supporting CSS 2012-04-13 11:16:40 +02:00
Eric van der Vlist 11027c068a Moving action pipelines in their own directory 2012-04-13 10:53:25 +02:00
Eric van der Vlist a0bd1a56fd Adding a priority mechanism 2012-04-12 14:06:23 +02:00
Eric van der Vlist 6b10b3e51c Removing an xsl:message. 2012-04-12 12:56:21 +02:00
Eric van der Vlist fd2ca8f305 Adding timestamps to the archive indexes 2012-04-12 12:42:55 +02:00
Eric van der Vlist c71d5b202d Starting to implement a version based on Orbeon's XPL or the archiver. 2012-04-12 11:19:46 +02:00