Tuesday, July 09, 2013

Some Thoughts about Archive.org Digital Texts

A few months ago I was asked to undertake a project for Tyndale House which involved searching through their catalogue for out-of-copyright books and try to link these with electronic versions already on-line. This naturally brought me to archive.org to search through its massive collection of on-line texts. On the basis of that experience I thought that it might be helpful to write down a few thoughts on the subject that may spark a discussion.

First of all, here are some of many positive features of archive.org texts:

  1. There is a huge amount of material available. Well over 90% of the 450 titles I searched for were already there.
  2. This material can be downloaded in a wide range of formats, including, PDF, DJVU, TEXT, HTML  and Kindle compatible files.
  3. The site is supported by an enthusiastic user-base who are constantly adding new material.
Now, some issues that need to be considered.
  1. Some books that are still under copyright in the UK because they were printed there are listed as being in the Public Domain on archive.org because it is hosted in the United States. In order to prevent them being downloaded outside the US Google Books (linked from  archive.org) has blocked non-US IP addresses from accessing them - which of course can always be circumvented using a US-based proxy.
  2. Some material that is in the Public Domain in the UK is being blocked by Google Books..
  3. The first two points serve as a reminder that users cannot rely on the accuracy of the copyright declaration on the site outside of the US - you need to double check everything.
  4. Some scans are incomplete and/or of poor quality.
  5. Scans to PDF are often very large files. By reprocessing the files it is possible to reduce the file size by 50% in one trial I conducted.
  6. The search facility is fine if you know the exact title of the work you are after. However, if you misspell it or get a word wrong then the book you are after will not appear in the results.
  7. Perhaps as a result of (6) the usage statistics listed next to certain titles showing the number of downloads are often surprisingly low.
Please "weigh" rather than just "count" the points above, as the benefits of the site far outweigh the negative issues. For me they indicate a number of opportunities to make this work further:
  • Important UK-published theological books in the Public Domain could be re-scanned and hosted so as to avoid the unnecessary blocks on accessing them.
  • Poor quality scans can be replaced.
  • When serving users on dial-up or slow access Internet connections there is scope for reprocessing selected works and hosting them elsewhere to reduce the file sizes.
  • The site lends itself to being linked with specialist bibliographies (such as those provided by the TheologyOnTheWeb sites) linked directly to material hosted on archive.org. This gets round the problem of searches when the material is not being blocked.
What has been your experience with using archive.org? Can you suggest any other ways in which the wealth of material there can be better used?

No comments: