Or, the world in a grain of ads
You’ll recall that in my last I wondered “What am I getting wrong?” — a big question, for sure, with many and varied answers, as friends, acquaintances and passer-by would be happy to tell you. But in this case I was specifically concerned with what I was misunderstanding about the search results I was receiving from a Readex database, America’s Historical Newspapers.
Well, you’ll be pleased to know that Readex, in the person of their marketing director, David Loiterstein, was kind enough to get in touch by e-mail and tell me exactly that. And the answer? Granularity.
Basically, the AHN database does not consistently break down advertising sections at the same level of granularity; it has changed over time. As David explained:
Initially, particularly for the 18th century in which the first series of newspapers was so heavily concentrated, we identified individually every advertisement on every page; however, in later series multiple contiguous advertisements were identified in groups.
So: sometimes individual ads count as individual “articles,” sometimes a multiple-ad block count as one, and sometimes entire columns of ads count as one unit; and the granularity of the ads goes down, generally speaking, over time. Which means that my results — which included all article types, including ads — were skewed by the ways ads are counted.
David provided a graph of his own, illustrating this effect, and suggesting a way to get clear of it (reproduced here with permission):
I’ll let him explain:
This approach seen above—in which advertisements are isolated and an aggregate number of the other article types is counted separately—provides a more representative measure of available “texts.” While the data does in fact indicate fewer “articles” available between 1820 and 1850 in what is otherwise a steady increase in articles available between 1690 and 1819 and between 1850 and 1922. The declining number of ads as a percentage of “articles” or “text” is a result not of fewer ads but the changing approach by which we identify them.
Thus, practically speaking, if you want to get some kind of a baseline for how representative a given search’s results are, you’re going to have to sacrifice including ads in those search results. Not ideal, of course, but much better than not knowing what your results mean. In addition to responding directly to this specific question, David also mentioned that Readex was working to update the Readex Help section, and fix the discrepancy between the two portals I had noticed.
So where does this leave us?
Well, with a much better understanding of how one of the most important databases in Early American historical research functions, for which I am grateful to David and his colleagues for their quick response and kind explanation.
I would note, though, that even using the new numbers, the curve still shows an unexpected dip in the 1820s and 1830s — the heart of the Jacksonian era, where most historians would tell you that print, and especially newspapers, exploded. As I said before, this is not something I think unique to Readex, but rather an artifact of the way many digitization projects have done triage (or, alternately, it might be proof that print output indeed declined, in which case steam-powered presses were not actually all that important in the development of American democracy! But let’s hope not, as then we’d have to revise a lot of historiography…).
In any case, all good factors to keep in mind when trying to use large collections to buttress claims about relative representativeness, ubiquity, or uniqueness. And now on to new and exciting problems…