History and Historians, Now in Actual Work

Ex Readex: Not Much?

Or, Caveat NewsBank

UPDATE: See the subsequent post for the thrilling reveal!

In harmony with one of the recent memes floating around the world of digital history — the happy attention to some of what historians don’t know about database design, how particular databases are missing parts of texts, within particular series, and proposals for how we might directly address this issue, as a collective, I thought it might be worthwhile to add my own experience to the pile.

Briefly stated: one of the standard databases, Readex’s America’s Historical Newspapers, seems to have a shockingly low number of texts available for the Jacksonian-to-Antebellum period — more, even, than their own product descriptions (which emphasize the coverage of particular years) would lead you to believe. Here’s a picture:

(see below for a table with the raw numbers)

But before I get too far in here, let me emphasize the caveat: I say “seems” for a couple of reasons.

First and foremost: I may be completely misunderstanding something about how searches work in this database. The y-axis on the above graph is the number of “hits” a blank or wildcard (* or ?) search in the fulltext field of the database returned for 5 year intervals (blanks and wildcards returned equal numbers). This may or may not be the same as the number of “documents” (articles) or “images” in the database; though I should think it would be.

Second, the results I’m getting seem to run counter to some of the statistics Readex itself provides about the component databases searched by “America’s Historical Newspapers.” See, for example, what they say about the number of images in each of the seven (7!) component series of “Early American Newspapers” in the product description. The numbers of images available seem out of whack with what you’d expect…but since these are such big dates ranges, it could be that what I’ve found is still true for this period; I don’t know. There also seems to be a discrepancy between the these figures and those produced on different search pages available on the Readex site.

On another level, though, all this jibes with something I’ve long suspected — the digitization of print materials from the U.S. follows an uneven U-shaped curve, where the trough is roughly 1800 to 1850. Broadly speaking, it seems like every possible scrap of material from the colonial and revolutionary era has been digitized, extending, in the case of the Founder’s Paper’s projects (e.g. Rotunda) far into manuscript materials. Then, just as the print explosion begins in the U.S., digitized materials drop off, picking up again with the Civil War, and increasing as we approach the 20th-century. That seems to be borne out here (or, at least as far as I was willing to go with the data entry).

This curve is in many ways totally understandable; there are fewer colonial and revolutionary periodicals, so why not be complete about it? And obviously there is more interest in the more recent past (perhaps the post-war stuff is digitized because it’s close enough that it might be good for local histories and genealogical work?). But on another level, it’s troubling; especially given how historians are beginning to use this and like databases to talk about the appearance of particular terms. Comprehensiveness, esp. relative comprehensiveness really matters there.

That’s how I happened on this case. I came across this oddity while trying to control for changes in the size of the database while tracking changes in the occurrence of a particular set of terms.1

What really shocks me about the numbers I’ve pulled out of AHN — which is, to my knowledge, far and away the most comprehensive database for this period there is — was how much the absolute number of articles scanned is lower over time. I figured, at best, that the coverage was reduced only in terms of geographic range, or narrowed by a focus on particular publishers; only New York, Philadelphia and Boston well-covered, for example, and not the vast West and South. But apparently (and again, I want to emphasize the tentative nature of my conclusion here), that was dead wrong.

The upshot: for given values of the “Early Republic,” digitization is still a ways away, and we should not trust any database’s comprehensiveness — even if, at first glance (or, in my case, continuous usage over aargh, years) seems to suggest that it contains a lot of material.

Okay, so now some blegs: Any thoughts on this? What am I getting wrong? As I said, I can’t help but think this puts a major crimp in what we can use these databases for, in terms of reliability — but I’d be glad to have any mistakes I’m making here pointed out, the sooner the better.

1.) If you’re interested, the string I was searching was this horrible stew of syntax:

(“East Indies” OR “East India” OR “East Indian” OR China OR Chinese OR Orient OR Orient*) NEAR25(specie OR silver OR dollar? OR currency OR circulati*) AND (trade OR commerce) NEAR25(specie OR silver OR dollar? OR currency OR circulati*) AND (drai* OR expor*) NEAR25(specie OR silver OR dollar? OR currency OR circulati*)

Suggestions on how to improve that monster would very welcome.

2.)There is also a discrepancy between two portals to search the Readex newspaper database. When I’ve searched only newspapers from the Archive of Americana portal, I consistently get higher returns than if I had searched America’s Historical Newspapers directly. The difference is potentially significant — in the period 1835-1839, AHN returns 1,702,150 hits compared to AA’s 1,933,685, a difference of 231,535, or 13.6%.

I’m not sure why this is so; the two searches say they are tapping into the same databases, to wit:

AA’s search says it includes:

Early American Newspapers, Series 1 (1690 – 1876), Early American Newspapers, Series 2 (1758 – 1900), Early American Newspapers, Series 3 (1829 – 1922), Early American Newspapers, Series 4 (1756 – 1922), Early American Newspapers, Series 5 (1777 – 1922), Early American Newspapers, Series 6 (1741 – 1922), Early American Newspapers, Series 7 (1773 – 1922), Hispanic American Newspapers (1808 – 1980), African American Newspapers, 1827-1998 (1827 – 1998) and Ethnic American Newspapers from the Balch Collection (1808 – 1980).

While AHN’s claims:

Early American Newspapers Series 1 – 7, 1690-1922; African American Newspapers, 1827-1998; Ethnic American Newspapers from the Balch Collection, 1799-1971; Hispanic American Newspapers, 1808-1980 and Selected Historical Newspapers.

That seems comparable to me. If anything, the AHN search should include more, what with the inclusion of “Selected Historical Newspapers.”

I’m planning to e-mail the Readex people to find out what’s going on — and what I might be missing — but any suggestions in the meantime are welcome.


Raw Numbers

(Note: these figures come from searches performed using the AHN portal, not the AA portal)

Years

Total “Hits” (articles?)

1795-1799

3,626,530

1800-1804

4,422,965

1805-1809

5,041,412

1810-1814

4,838,756

1815-1819

6,449,231

1820-1824

3,856,979

1825-1829

2,338,139

1830-1834

1,991,623

1835-1839

1,702,150

1840-1844

1,907,799

1845-1849

2,398,359

1850-1854

2,682,211

1855-1859

2,762,811

1860-1864

2,757,069

1865-1869

3,725,627

1870-1874

4,531,278

1875-1879

4,566,376

1880-1884

5,015,152

1885-1889

6,958,484

1890-1894

9,701,775

1895-1899

11,397,028

History and Historians, Power At Play, The Past is a Foreign...Something

Triumphant Return! Et L’Affaire Cronon

 

It’s Been A While

As Spring threatens to return, I find my thoughts turning once more (as do those of so many rapidly middle-aging historians) to blogging. I know, gentle readers, that I’ve left you without terrible puns and alliterative link dumps for far too long; the Goose Commerce thread in your RSS reader is, no doubt, covered in dust, mites, and then more dust. And, may I say: that’s disgusting.

But awake! Or at least, don’t delete. I’m back! And plan to post at least weekly here until I lose interest again.1

So, to business…

As I’m sure you’re aware, the newest shiny debate in PastLand is L’affaire Cronon, aka the Wisconsin Republican party’s bizarre attack on one of my favorite authors, William Cronon (Mr. Nature’s Metropolis). The AHA has a full roundup on everything you need to catch yourself up

There’s been a lot of commentary, obviously, but for my own purposes the most interesting include those smart things said about the wider legal context of this attack at the egregiously inappropriately-named AmericanScience blog: Part 1, Part 2.

As for what the heck Cronon himself is up to, the best read I’ve seen so far is what Ben Schmidt, professional history’s own Nate Silver, has said over at the wonderful and informative Sapping Attention. I agree with all that Schmidt says2 : seems like the deliberative democracy shoe, consciously consensual and wholly impractical, is what fits.

While I admire Cronon’s position – especially given that he is ascending to a the highest honorific position within the guild, usually not a place that one achieves by making political waves – I can’t say that I agree with his theory of politics. I side with Martin Van Buren: we need parties, and partisanship, to make the system go; playing the center (ideologically, philosophically) is a fool’s game. Conflict is a feature, not a bug: because people just disagree, that’s why.

Which still leaves us with the problem of establishing and policing standards of discourse: so maybe Prof. Cronon has the right idea after all.

In any case, I look forward to making more of these uninformed comments in the future! Now back to actual work for a change.


Image: law_keven, “Do you think he’s alive???…..” Flickr, CC License

1.) Hey, if I’m nothing if not realistic.

2.) Save the bit about Changes in the Land being the better book: it’s good, but clearly, Nature’s Metropolis is in every way more interesting.

History and Historians

What If Historians Wrote the News?


Shit happened.1


Cf. Greg Marx, “Embrace the Wonk,” Columbia Journalism Review (May/June 2010); Chris Beam, “The Only Politics Article You’ll Ever Have to Read: What if political scientists covered the news?Slate, 4 June 2010; Conor Friedersdorf, “It Depends Who Writes the News,” True/Slant, 7 June 2010; Jonathan Chait, “A Sociologist Covers the News,” The New Republic, 7 June 2010; and Georg Wilhelm Friedrich Hegel, trans. John Sibree, Philosophy of History (New York: American Home Library Company, 1902), §4:66.

Image cite: Frank Wuestefeld, “Shit Happens,” Flickr, CC License

History and Historians

Unduly Interested in Note-Taking

Or, Back of the Hand History



Worth reading the whole, delightfully meandering piece:

Historians are like reliable local guides. Ideally, they will know the terrain like the backs of their hands. They recognise all the inhabitants and have a sharp eye for strangers and impostors. They may not have much sense of world geography and probably can’t even draw a map. But if you want to know how to get somewhere, they are the ones to take you.
~Keith Thomas, “Diary,” London Review of Books 32, No. 11 (10 June 2010)


Balthus Van Tassel, “Sacromonte – Travel guide,” Flickr, CC License

History and Historians, Ivory Towers

The Golden Ratio

Or, Φ Upon Lesser Calculations!


If have any interest in the whole Library of Congress / Twitter development, you should go read Dan Cohen’s smart post on the topic:

Cohen’s post is largely about how to apply the key insight from William Press’s work on the efficacy of “strong” profiling to archival practice like the LoC’s acquisition of the Twitchive (Twarchive?). He comes up with what he terms a “calculus of importance” — but what I’m going to call Press-Cohen’s law, cause that’s more internet sciency — for best allotting collection and curation resources:

In other words, if you believe that the notebooks of a known writer are likely to be 100 times more important to future historians and researchers than the blog of a nobody, you should spend 10, not 100, times the resources in preserving those notebooks over the blog. It’s still a considerable gap, but much less than the traditional (authoritarian) model would suggest. The calculus of importance thus implies that libraries and archives should consciously pursue contents such as those in the Cambridge University Library tower, even if they feel it runs counter to common sense.

An perspicuous friend and colleague of mine wondered if a corollary to Press-Cohen’s law would make sense for research, as well as archive compilation. That is, “should a historian spend only 10 times as much effort pursuing the obvious characters and institutions (or historiographies), instead of 100?”

PF&C suggested that “standard disciplinary practice already says yes” — and I would agree, and even go further and say that it is probably worthwhile to make the use of such a ratio explicit (hence the post!).

Going to a known wells and looking from a new perspective needs to be part of our practice, but within limits (you should dig in lots of new places, too). The 100/10 ratio seems a pretty reasonable rule of thumb, in a world of limited time and resources for research.

What think you, yea historians of teh internets?

PS: Isn’t quasi-social science fun?


Image cite: fd, “Golden Spirals,” Flickr, CC License