“Yes We Scan”

In September 2011, the White House launched an online petition web site, We the People, where anyone can post an idea asking the Obama administration to take action on a range of issues, get signatures, and get a response from their government.

It’s an experiment in democracy, which is generating new ideas and improving on old ideas every day. One of those rising ideas is “Yes We Scan.”

Yes We Scan is an effort by the Center for American Progress and Public.Resource.org to promote digitization of all government information in an effort to make it more accessible to the world.

Here at the National Archives, we house the nation’s permanent records, and we think increasing access to our collections in this way is a great idea. Our most recent efforts to do this ourselves as part of our OpenGov initiative, include the Citizen Archivist project, a Wikipedian in Residence, Tag it Tuesdays, and Scanathons. We are also moving forward on implementing the President’s recent Memorandum on Managing Government Records, which focuses on the need to update policies and practices for the digital age.

Wikipedia “ExtravaSCANza” at the National Archives in College Park, MD.  January 6, 2012. Source: Wikimedia Commons

But all those things aren’t enough. Yes We Scan calls for a national strategy, and even a Federal Scanning Commission, to figure out what it would take to digitize the holdings of many federal entities, from the Library of Congress to the Government Printing Office to the Smithsonian Institution.

These things, though important, are not the same as a national strategy and the commitment of every federal agency and cultural institution in solving this problem. There are many questions to be answered – what should the National Archives’ priorities be? Do we focus on preserving deteriorating paper records, still bound with red ribbons from two centuries ago? Do we make digital copies of Vietnam Era film footage? Should we focus on preserving those older paper records while citizens volunteer to digitize more recent, and better preserved, records?

 Wikipedia “ExtravaSCANza” at the National Archives in College Park, MD.  January 6, 2012.  Source: Wikimedia Commons

These are all questions that the idea behind Yes We Scan touches on. There are many more – we can start a national conversation by discussing them here first. What are your thoughts on how the National Archives and other agencies should proceed? What questions should we be asking ourselves?


34 thoughts on ““Yes We Scan”

  1. I think the impulse and the idea behind Yes We Scan are a strong foundation, but I fail to see how such a sweeping strategy could be implemented (not to mention funded). The plain truth is that the vast majority of information held by institutions like NARA, LOC, and the GPO does not have much of an audience in the general public- why spend valuable time and funding digitizing items for access when no one will be accessing them? Those researchers who need to access them will make the necessary efforts to do so (as they have for decades).

    That being said, the options you present are good ones. It is reasonable to digitize items where there is high interest or where preservation concerns demand it, and to take advantage of volunteer labor where it is available. However, in my opinion, Yes We Scan’s mission to digitize the entire holdings of multiple federal entities is unreasonable and unneccessary (unless they are offering to perform the work). This is not an issue of transparency or lack thereof, but of making the best use of what resources are available.

    Thanks for an interesting post!

  2. I think it would be really interesting to see how a Scanning Commission would quantify the effort it would take to scan all of the holdings and make them accessible. In quantifying the effort it would take, I hope it wouldn’t be put in terms that paralyze our efforts to actually do it. We need creative thinking and economies of scale, not paralyzing statements of scale. In regards to electronic records, I often used to hear folks saying that if you printed out a terabyte of information it would be as tall as the Empire State Building. This statement was often repeated as a way to grasp the size of a terabyte, but it did nothing to help us deal with the scale — never mind that we would never want to print out that terabyte and that it’s managed differently than paper. We just need to be cautious that when we set out to quantify the effort it would take to scan everything we don’t get caught up in awe of the size of the task.

  3. The stated goal is admirable, “… digitization of all government information in an effort to make it more accessible to the world.” However, as stated, in this economy it is untenable. This project is as important to the economic and social infrastructure of the United States as is the renewed highway system and energy grid. It will be a more difficult sell to the politicians and taxpayers than the more visible roads and electric lights. Such a daunting cost and unclear return on public investment may hinder the project.
    As an exercise in information, governance I would suggest that the Federal Scanning Commission (created soon?) go for a quick win by prioritizing the digitization, based on public interest and utility. It is my opinion that a demonstration of results would gain public and budgetary support and act as a proof of concept of the value of digitization, which will lead to private sector involvement and support. Certain digitization projects may lead to direct citizen benefit, such as reduced access costs, improved ability to apply for federal money, etc. Other projects may generate sufficient interest to allow profitable advertizing linkage.

  4. Based on eight hours of scanning NARA photographs, and eight years of cataloging monographs, I’d like to offer a few observations.

    1. Most of the NARA photographs I worked on actually do contain source material about specific places at a specific time. Now that we have geotagging, it’s possible to link photos to their locations, and use a smartphone or VR app to have an on-the-spot, augmented reality view of what places looked like in the past. Geotagging adds a new twist on the traditional historical value of photos, and people may well find very creative ways to use NARA images that we haven’t discovered yet.

    2. Adding some dedicated public scanners to the reading rooms of NARA, LOC, and other agencies, and throwing some more “scanning parties” would be great!

    This sort of work can be pretty dreary and monotonous when you’re sitting by yourself, but when you’re working at a table with other people, chatting and sharing your discoveries, it turns into a lot of fun! The scan-a-thon was a lot like an old-fashioned quilting bee or [http://en.wikipedia.org/wiki/Bee_%28gathering%29 husking bee]. Reminded me of how my urban homesteader brother-in-law threw a bushel of freshly harvested hops in the middle of the dining room table, and had eight of us turn to to clean it.

    Scan-a-thons would be a great work project for folks who need to get out of the house and interact with people. You might well find that folks of all ages would enjoy working on scan-a-thons for minimum wage or as volunteers, if you set it up as a social gathering.

    3. On the other hand, there is a case to be made for the factory-work approach, (especially for books), even if the tedium of this work results in poor scans at times.

    Some years ago, I heard a fellow estimate that he could get everything at the Library of Congress scanned for something on the order of $36 million dollars, if you zipped through the project with an industrial strength workforce, proper equipment, and didn’t get stuck trying to sort out the copyrighted vs non-copyrighted materials.

    If it appears possible to make a scanned backup copy of the LC physical holdings for under $100 million, you’d think this project would be underway already!

    4. Given that NSF has worked out all that cyberinfrastructure for e-science out in San Diego, they’d be a logical partner in this effort.

    5. NDIPP, the National Digital Infrastructure Preservation Project, and homegrown digitization projects have done a lot of experimentation, but are not able to systematically provide a comprehensive bulk upload of the nation’s legacy data from printed/recorded to digital formats.

    6. A non-governmental standards body could determine the technical requirements necessary to ensure high quality, preservation, and veracity of data, incorporating the needs of different sorts of stakeholders.

    But again, it’s hard to see how such a standards body would provide the needed bulk uploading capacity.

    7. After knocking myself out for 8 years cataloging books, it’s clear to me that when it comes to organizing the world’s information, you need lots and lots of people to do it. Different sets of eyes add something new to the keywords/metadata, and adding more sets of hands helps sort the physical material that holds the information.

    The Wikipedia model is a great way to get lots of people together to work on organizing information, and could be really useful for adding the metadata.

    But again, it’s hard to see how the current Wikipedia volunteer base can provide enough sets of hands to systematically work through that bulk upload from physical data …

    8. A Federal Scanning Project would make a lot of sense, and is really the only way to ensure that the bulk of the nation’s printed / recorded cultural heritage gets converted to digital form.

    Piecemeal efforts to digitize will leave some very unpredictable gaps in our knowledge. These gaps aren’t an issue for those of us over 50, who understand that everything “really isn’t on the internet!” However, these gaps in knowledge are likely to result in some very major gaffes and screwups for the younger people coming up, who will get the wool pulled over their eyes on a regular basis, when key information is hidden from them in print archives and behind pay-per-view firewalls.

    9. A Federal Scanning Project has precedent– much of the indexing of our 19th century newspapers took place as a result of WPA efforts to put educated people who were unemployed to work serving the nation.

    We could put lots of folks to work, go full-force at digitizing the nation’s backup copy of our cultural heritage, and let Congress sort it out how to release it with the copyright laws. Why not? Americans used to build a Liberty Ship in 3 days. All it takes is the collective determination to get things done.

    10. Conclusion: Though you asked for a comment, not a dissertation, the prospect of a Federal Scanning Commission or Federal Scanning Project is pretty exciting. Best of luck!

  5. We don’t know until we try. Please sign the petition https://wwws.whitehouse.gov/petitions/!/petition/start-national-effort-digitize-all-public-government-info/15vthgVB

    As to the comment that govt information does not have much of an audience: that’s just flat out wrong. Govt information ranges from the eclectic (http://freegovinfo.info/best) to the critical (http://www.ncbi.nlm.nih.gov/pubmed/), from the scientific (http://www.osti.gov/) to environmental (http://www.epa.gov/enviro/) and statistical (http://fedstats.gov/) and information from all 3 branches of govt (http://www.gpo.gov/fdsys/). There’s an amazing array of information published by our govt both online and distributed to more than 1200 federal depository libraries around the country — check out http://www.fdlp.gov/ to find out where. (disclosure: I’m a govt information librarian)

  6. I agree with Jessica’s opinion that the public is uninterested in the overwhelming amount of information contained in the billions of documents held by federal repositories. I’m not even sure what’s to be gained by spending the time and money on trying to establish what it might cost to scan it all. At best, whatever the figure, it’s a very rough idea. And with the scale we’re talking about, it’s bound to be way, way off.

    The federal government cannot seem to get its act together to figure out ways to repair our crumbling infrastructure. Does anybody really think that the proposed scanning endeavor would ever come to fruition?!

  7. While the thought of scanning all government records appears to be good for the people, perhaps the right answer is to first create an intervention. Throughout the whole of the US Government we have the issue of paper documents yet being created due to lack of standards for electronic publishing, signing, retaining.
    Start at the NARA by embracing a PDF/A standards (ISO 19005-1). Ensure the standard is referenced in the NARA accepted practices. (http://www.archives.gov/records-mgmt/policy/guidance-regulations.html) Provide government agencies with the authority to embrace this practice. Then facilitate the intervention by requiring records to be scanned to the PDF/A standard as the process to be followed.
    Finally, end the madness. As records are submitted for long term or permanent storage refuse to accept paper generated after the PDF/A process is established as policy.

  8. Some years ago NARA surveyed its own existing digital resources with the hope that they be consolidated in a central repository, presumably ARC. These includes tens of thousands of scans of NARA holdings generated for exhibits, print publications, publicity, marketing, preservation, web use and reference and have been accumalated in day to day routine business operations, not major digitization projects. Today, we would add social media to the collection. To my knowledge, there was no follow through and these digital assets remain dispersed through NARA, some online, some not and may or may not survive. For the vast majority, I would assume there is minimal archival metadata attached which would add to their value and facilitate their addition to ARC. Therein lies a problem and an opportunity. The opportunity is to devise a structured system that whenever NARA holdings are scanned, both the images and basic archival metadata are captured and funneled to ARC. This would be most easily enforced in our own Digital labs and other NARA operated equipment.

    Many will dismiss this as cherrypicking. In fact, these images usually represent the value added work of a researcher or other individual who has waded through the records to identify items of particular value in an exhibit, an aricle or book, or some other end use with public value. The question to us at NARA is whether we leverage this work into a system available and useful to all. In recent years, Exhibits, the ARC Team and custodial units have worked together to make available in ARC scans and metadata collected for recent exhibit projects. Unfortunately, I do not have the brainpower to structure such a NARA wide system or the political skills to make it happen, but maybe someone out there does.

  9. First, I agree with the original blog post that we need a national strategy. Second, I agree that scanning everything may not be feasible at this time and in this economy; even performing the analysis requested by the “Yes We Scan” petition would be very costly and likely infeasible.

    Approaches taken by cultural heritage institutions may serve as possible partial solutions to the scanning portion of this issue. One, to capture what is being scanned already. Two, to “scan on demand” — whatever is requested in the archives is scanned, a box at a time. If no metadata to speak of is created, then at least the contents can be linked to from the finding aid, and online if permissible. Three: If resources are available to scan more, then begin with the collections which historically have the greatest demand and work towards those which don’t.

    Glen has a good point in that we need to educate creators of content that if they want the information they produce to have a prayer of being accessible for very long, we need that info in archival formats. So the next big issue is selecting appropriate formats, making it easy to create content using them, and educate, educate, educate content creators. This won’t help with social media capture, obviously, but hopefully we will then at least be able to retain support for more vital documents.

    Then there’s the 3rd part of the issue, also spelled out in the “Yes We Scan” petition. We must have planning and implementation of both a governmental and a national digital preservation strategy. Much has been accomplished with NDIIPP and the effort to develop trusted digital repositories, but we are far from having a complete strategy in place and available to all who need to make use of it. The DPOE is an effort to spread awareness, but it will also require the infrastructure and sustainable funding on a national scale to make a real national digital preservation program a viable reality. Collaborations are a starting point — but every chain is only as strong as its weakest link. If the plan developed is as viable as the internet, then that is perhaps the best we can accomplish.

  10. I’ve said this before, I’ll say it again. As a projects archivist, I would love to have a legal sized scanner at my desk, hooked up, and have the ability to scan and upload to (a less clunky version of) ARC.

  11. I agree with Jody’s point, we need to capture what is being scanned already.

    And to Glen’s point, we need to educate, educate, educate, including our own internal NARA staff about this need too.

    Often records are scanned for customers as part of FRC reference or reimbursable FRC services that will one day accession to the Archives. We should be following our own guidance and policies when creating digital formats, even when doing at the request of our customers.

  12. Nice idea, but I agree some of the comments others have already made, re: cost and amount of interest the public would have in some of our records. Also, and not sure if this was mentioned already, but I would think that these scanned documents would have to be indexed with basic metadata. I can’t see how having billions of scanned images up online with nothing to hang them on is going to help a researcher.

  13. http://www.ldschurchnews.com/articles/48623/Digitizing-hastens-at-microfilm-vault.html

    The government should consult with the Mormon Church in Salt Lake City. They have recently announced that they are scanning their awesome collection of microfilm rolls held in their granite mountain vault.

    The following is from the LDS Newsletter:
    Digitizing hastens at microfilm vault; Computerizing Church’s 2.4 million rolls is now feasible.
    “New, faster technology announced here March 3 will cut decades off the expected completion date, which remains unannounced.” About two years ago, a new approach to scanning began with a team under the direction of Derek B. Dobson, product manager. A pre-study on what would most benefit expert and novice researchers alike helped the team crystallize its objectives. From the study, one thing loomed largest: easy access.

    “We would like to have people be successful (in their research) in short bursts of time,” he said. He emphasized that in scanning the microfilm, there is no latitude for missed images, those too dark or light, blurry or clipped. And the results had to come much faster.

    The result was FamilySearch Scanning 1.0, a state-of-the-art computer system, called a breakthrough, “cutting edge in the industry,” said Brother Dobson.

  14. What a great way to preserve our history. Researchers & historians will really enjoy this what a great idea and if there is a natural disaster then we have everything backed up!!

  15. It is not a waste of time or money Phil….for researchers who lay the foundation of understanding that leads to solutions, innovations, the future…historical preservation is essential–unless we want to keep repeating what has already been done before–unless we want to keep stumbling through the dark wondering why, how, what, when, where? I realize that history often repeats itself, but maybe if we study it more we would progress. History is a foundational element for progress and understanding for individuals as well as nations. Unfortunately, for most students history was taught in the schools as a time line of facts which had little relevance. I hope that the NARA can make this happen. I know this nation has pressing issues, but that doesn’t mean we should or have to lose our history in order to survive the crisis of today, which is nothing new in the history of the world.

  16. So even if everything was scanned how would it be stored and then retrieved? I already see lots and lots of records on the National Archives website that are “cataloged” but not in a way that a researcher can find them. Archivists are predispossed to store things by who created them, not what they are about, and historians search for things based on what they are about without regard to who made them. This is a serious problem. Archivists and researchers need to come to common understanding of search parameters so that documents can be recovered. And a third ‘actor” in this drama is the computer programmers who know nothing about how research is conducted and care even less about the provenance of records. I still share with fellow historians the story of seeing that according to Footnote.com over 60,000 men served the Union Army from the State of South Carolina, an obvious error since there were no organized Union army regiments from S.C. When I examined the Pension applications cards for a handful of these men I figured out that Footnote’s people had scanned the USCI file, pertaining to the United States Colored Infantry, as “South Carolina” after ignoring the U and the I because they don’t know enough history to recognize what are for historians obvious errors. And I can’t help but believe that hundreds of descendents of these Colored soldiers have written in their family histories that their ancestors served in South Carolina regiments that in fact didn’t exist or worse that these soldiers lived in South Carolina.

    “Fortunately” I am long retired and will no longer be alive when historians fifty years from now are trying to write the history of modern times for which they can find no documentation. Seriously, is anyone saving email, tweets, and other electronic communication?

  17. Many of the comments above imply that it should be an all or nothing approach to scanning or even old records versus new records. I think NARA has enough experience in providing reference that we can make good decisions in setting our own digitization priorities based on researcher and citizen needs.

    I do agree that we need better ways to capture and provide access to “born digital” records and to manage the many images that are made daily to fulfill researcher, exhibit, or other ad hoc requests.

  18. I know plenty of moving image archivists who would like to work on a digital archives project for the U.S. govt. However, “hiring” volunteers is a pathetic strategy in this terrible economy we’re in.

    I agree this is a potential “public works” WPA-type national infrastructure project! Let’s put people back to work without requiring deeply discounted labor.

  19. I think that digitizing all documents is a great idea, and I would presume it to take significantly less money than many other federal programs. Information and ideas should be free for all, not commodified. If the public can access the Library’s contents for free in person, I don’t see why not in person.

  20. I was hired by my agency specifically to deal with this in our office. We were buried in 50 years accumulation of old record and reference material, and I was brought in to organize, archive, and clear out the clutter. After a year and a half of research, interviews with program staff, cataloguing and indexing, we managed to reduce our paper bulk by 90%, digitized all pertinent materials (my agency’s Disposition Schedule does not allow electronic records because it hasn’t been updated since 1988), and turned over 100 feet of permanent record to the good folks at NARA. If the President flipped the switch on the OMB’s December 2009 Open Government Directive to publish all FOIA-subject materials on open web sites, my office could comply immediately.

    If this experience taught me anything, it’s that this effort is worthwhile (productivity and information-sharing have skyrocketed due to the ready-to-hand e-Files Library we created) and that it has to be carried out at the local office level. There simply is no way to do this from a top down approach without creating havoc.

  21. Patricia, you said: ” for most students history was taught in the schools as a time line of facts” and you are so correct. While I always enjoyed history and have read many historically based novels, my appreciation for the history of this country as come from doing genealogical research. What a history we truly have – and so many know so little.

    Mr. Davenport, I have to agree. My mother kept every letter (or at least a whole bunch of them!) from her siblings and friends/family during her life. When she died, I found her hope chest filled to the brim with these documents that contain not only her history but that of those she cared about. It’s a treasure that is simply irreplaceable. She left 13 photo albums as well.

    And all of them need to be sorted, scanned and preserved so that my grandchildren’s grandchildren will have some point of view about their history.

    And it is going to be nothing less than a daunting task. As with any task of this magnitude, breaking it down into smaller parts makes it much less daunting.

    And as with any project, you need to know
    What is the scope of the project
    What are the tools needed and available to complete the project
    What resources will be needed (financial/physical)
    How can the project be delegated to ensure success

    I heartily agree with the previous comment regarding how to catalog the material. There is much cross over between content and author and the content can often involve multiple subjects. So not only should the data be properly cataloged, it also needs to be searchable.

    The State Historical Society of Missouri has a good system in place with cross indexing of their documents: http://shs.umsystem.edu

    Rather than reinventing the wheel, I’d suggest some collaboration with those who have done this type of work to determine best practices, and what mistakes were identified during the process so they are not repeated.

    Not sure what is currently in place for those documents being created as we speak – I’m sure there are server backups – why not embed in any document being created at least the basics – which agency/department/section/etc. So at least there is basic indexing from the get go.

    I love the GPS idea. I have tons of old post cards my aunt received from a traveling salesman beau (yes, he was!) that were sent during the early 1900s. They are just a fabulous picture of towns and places all across the country and I intend to share them along with my mother’s photos our our family website. Embedding the GPS info in them is a great idea!

    And I’m going to go back to the WPA – there are literally thousands upon thousands of documents these workers transferred from ripped, torn, partially burned documents that were saved because of this effort. It’s bad enough we lost so many records when the British burned the court houses and the records in WA when they came blazing through in 1812.

    I think there is no question that the project must begin with the documents that are most in danger of being lost forever.

    This is our heritage folks. When these documents are lost, they are lost forever. I believe this is probably one of the most important projects the National Archives can undertake …and I am quite certain a call for volunteers will garner a huge response. Beyond that, leave it up to the professionals to decide once they have categorized their inventory.

    Work with the various state’s librarian’s and historical societies to delegate these tasks around the country and you will get the help you need.

    And as far a funding – enough information to confirm from the search and this is the record you’re looking for ….and a reasonable fee to download it. I’d happily pay the fee to avoid the cost of a trip around the country to find these documents.

    Just my two (or three -:) ) cents.

  22. To me the question you are asking seems to sound like “how should we preserve the stuff we’ve already got?” An important question, but it’s a closed problem: there is a limited (albeit almost unimaginably VAST) amount of it.

    But that vast closed problem is growing exponentially each year, to the point where you already have to perform triage and decide which swathes should be simply ignored and allowed to die.

    So the open problem is the one to tackle first, since it lets you “cap the bottle” and avoid that: “how can we encourage local and national government bodies to preserve the data they produce?”

    Data production is growing exponentially. All that information needs to be both public and intelligently searchable and mineable. Millions of lines of code, photos, artworks, and hours of video are created by the government each year. Where’s it all going? Not into some central searchable repository, that’s for sure.

    WHY NOT?

    *That*, as I understood it, was the point behind the petitions that were responded to here: https://wwws.whitehouse.gov/petitions/!/response/digitizing-federal-public-records – that right now, people’s life’s works are simply being shredded, deleted, and recorded over, because there is no central policy or repository for preserving the NEW stuff.

    Screw the old stuff. Our parents made it, out children can scan it. Our first responsibility should be dealing with the data we make ourselves.

  23. I love this idea. I think having fragile records handled by professionals, and others by volunteer citizens, would be a good implementation. Assuming there’s a comprehensive and searchable online database listing all the materials, the public could submit online requests for scans. This would be analogous to requesting an interlibrary loan, except digital access becomes available not just for the original requester, but for every future person interested in the material. In addition, this would effectively identify which materials are of interest to the public, so that the archivists don’t waste time and resources blindly going from beginning to end of the catalog.

  24. I have been party to a major project of digitizing printed documents thought to be in the public interest and, in the end, we realized that there was only one way to go about it: start at the beginning and just get to work.

    Naturally, in any collection of data this massive, there will be plenty of things – in this case, millions of things – that someone, somewhere will say is of no reasonable interest to the public and that time and resources will be better spent elsewhere.

    But there are two simple reasons why this position doesn’t hold water.

    To start, these records belong to us. If they were generated or owned or housed by our government, then they are ours and we have a right to access them. Paper and film are simply too temporary to ensure that right for future generations, and they will be lost if not preserved. That is a guarantee.

    More important still is the fact that one never really knows what they might need until the occasion presents itself. An 1850s paper on the decline of prairie dog populations would have easily been said not worth the government’s best efforts of preservation. Perspective is different 160 years later, when they’re all but gone. Even our best judgments today may become abject folly in the eyes of those who will come after us.

    Prioritizing will only distract. It is much harder to debate and determine what is most important than it is to determine what is oldest.

    Find the first thing you have. Note the year. Then get everything else from that year. Have one group scan it and another categorize, tag and file it. Then move on to the next.

  25. What should be digitized?
    Pentagon Papers, global climate data beinning to present, all data on the POTUS including televised debates, all global conflict data involving the U.S. or it’s allies supplied by the UN, Oxfam, and Amnesty International.

  26. Two observations I’d like to make:

    First, about the cost of scanning. Yes, it costs money to scan and catalog and maintain a scanned collection. HOWEVER, compare this to the cost of maintaining physical storage for the equivalent paper and microforms and I think it’s possible we might realize quite a bit of savings in personnel, storage containers, maintenance of facilities, and facilities that might be sold off.

    Second, about priorities. Many libraries are moving to an “on-demand” purchase system, particularly for ebooks — a more patron-driven system. I think it might be good to take a two-pronged approach — have one division dedicated to scanning collections methodically based on a rubric of condition and importance, and another dedicated to scanning based on customer demand, starting with items which have been most requested in paper.

  27. I’d like to suggest that scanning all of the NARA finding aids is a priority. I realize that some of them are old and that some are out of date (so there should be a clear explanation of when the aid was created and how likely it is to be up to date). Never the less anyone planning to do research at any of the NARA facilities would greatly appreciate this because at least we’d be able to do some planning before we arrive.

  28. Hey Patricia: Aimlessly scanning documents with no metadata and no way to house them is A WASTE OF MONEY AND TIME!

  29. The primary focus right now should be on developing a good process for digitization and archiving, one that will scale well and can work for organizations in the Federal government other than just NARA.
    If everyone can be persuaded not only to scan their paper documents and ensure they are easily accessible (i.e. tagged with the proper metadata, in a central location), but to create them digitally to begin with, then at least the amount of materials to scan will stop growing at such an alarming rate.

    As for how to prioritize digitization of existing documents, I would suggest a two-fold approach. Have archivist experts begin working on scanning the most fragile documents, to ensure they are preserved while it is still possible.
    At the same time, enact a policy that all new documents are scanned when being placed into the archives, and require that any documents already in the archives that are not already digitized are scanned after viewing. That way the public will organically determine which pre-existing documents are most important to scan.

  30. The comments above relating to “Yes We Scan” are excellent.

    Key issues including cost, prioritization, standardization, organization, and accessibility of the existing mass of material are presented.

    As a technologist, I offer the following thought. A sliver of funding could be allocated to a DARPA-like office to solicit ideas, then award, and manage funding to encourage creation of technological solutions to archival issues. One track could be hardware oriented – perhaps to develop low cost, high thruput scanners. An example is this DIY effort – http://diybookscanner.org/. Another track would be to fund the development of software that would address the indexing/classification issues. Indeed, nothing will replace a human, but the development of software to permit the focus the available resources on specific tasks that the machine is not yet capable of is a worthy goal. This capability would quickly evolve if seeded by modest funding. For example, the capture of research “debris” from other research efforts is likely a good area to explore. I doubt our economy can afford to employ additional people to support indexing efforts. Our infrastructure issues are more pressing and provide a higher ROI.

    I suggest that the development efforts of the internet giants and other U.S. agencies to house and access data (the Cloud) are sufficient to support massive archives. Funds need not be allocated for development of this element. The technology is there and it becomes an issue of to buy or to rent capacity.

    Anyhow, food for thought.


  31. Using volunteers is an amazing way to work on this huge project for many reasons: 1) There are huge numbers of retirees who would love to get out of the house and contribute to a big project. The social aspect of these sorts of things is amazing; 2) Unpaid internships for students are common – why not create a bunch of these and teach students about digitization, scanning, and technology?; 3) Teaching the unemployed new technology skills for digitization and management of digital records would be a great service.

    As for the comments about no interest from the public, most people have never considered what NARA holds. Education about the holdings would go a long ways.

    And these documents are part of our infrastructure, part of who we are as a country. Just because a few people think they are boring is beside the point of the need to preserve this record.

  32. Has anyone considered, for a project of this magnitude, incorporating social tagging/metadata to help create another later of organization and structure to NARA’s digital holdings? LOC’s The Commons project on Flickr has met with at least some success (http://blogs.loc.gov/loc/2008/12/library-releases-report-on-flickr-pilot/), and it would help address the issues David Paul raised–the potential disconnect between archivists’ and historians’ priorities and search techniques. Likewise, our perspectives of the holdings will continue to evolve and change, and as new scholarship develops, new descriptors about these holdings will have to develop, as well. The costs in manpower could be less if applying advanced metadata (beyond, say, creator and subject headings) was in the hands of the researchers.

  33. Aside from the economic realities of initiating such an effort, NARA has millions of records that are quite old, fragile, and in many cases falling apart. Proper stabilization and preservation of these records before scanning will be required–we’re also responsible for preserving these records folks.

Comments are closed.