En las últimas dos décadas, se han llevado a cabo variados esfuerzos para digitalizar textos, incluidos libros y periódicos, que son fuentes primarias en la mayoría de nuestras sociedades. The wealth of geographic information in such digital archives has not been used much, while they are very valuable for the study of cities. But because of the time and workforce needed for the data collection, these studies were limited to a very small number of cities or short periods of time. Afterwards, we present the tests that we did to have statistics on the accuracy of our method. By selecting newspapers according to this time dimension, we ensure that they had a sufficient diffusion to stay alive at least two decades consecutively. After that, type describes whether the city is mentioned in an article, an advertisement, some family announcements, or in the caption of an illustration. An attempt is made to identify the dynamics of urban systems during the historical process of their evolution. An early paper of Zipf (1946) used local newspapers to study the interactions between distant 'communities' and used this data in a gravity model. It produces the open access academic publishing portal … the place where the news item is published), as well as the importance of the possible places. Named entities can be locations, persons, organisations, dates, measures (money, weight, distance, percent…), etc. We then counted the number of true positives, true negatives, false positives and false negatives to derive precision and recall indices for our three periods of time. Michel J.-B., Shen Y. K., Aiden A. P., Veres A., Gray M. K., Team T. G. B., et al., 2011, "Quantitative Analysis of Culture Using Millions of Digitized Books", Science, Vol.331, No.6014, 176-182. Our objective is to look for macroscopic spatial trends in the way information is diffused and how this is changing over time. Figure 6: Information field extracted from 15 local newspapers. It allows to gain knowledge on the spatial organisation of territories through time. An illustration is made with the case of European cities between 1200 and 1990, using harmonised historical data bases. The column ppn corresponds to a unique identifier given to each newspaper title. Table 2 shows that the vast majority of city names is not ambiguous (86.4%) and does not require the use of NLP techniques. More generally, the methodology proposed in this data paper is of interest for people working on extracting geographic information from unstructured text data. Table 4 shows the results of these two calculations for the three different periods. For cities with multiple names, multiple string queries via the SRU protocol were done. In this project, we have geocoded place names contained in a selection of 102 million news items to build origin-destination matrices with places mentioned in the news items (o) and places where the newspapers were issued (d) for 125 years (t). This operation could be done in a reasonable amount of time. The following period is a period of development of the press, that ends in a peak during the Second World War, a period were many anti- and pro-German newspapers were created, most of the anti-German being underground. A seminal study by Michel et al. Other files are also included such as freq_count_corps.csv, that contains the total number of items published in each year for every newspapers, which allows for example to standardise the data. The application of these 4 criteria resulted in a sub-corpus of 81 newspapers that still cover an important part of the Delpher archive. In this example, we can see that, according to the first line of the table, Amsterdam was mentioned in 347 articles of De Maasbode, a Rotterdam newspaper, in 1871. Quantitative analyses do not replace in depth readings, but they are a new way of looking at these sources and can reveal hidden patterns that appear only at the macroscopic scale. First off, simple maps show a general expansion in cities number and size of cities over time, … As we are interested by the amount of non-local information received by urban dwellers, we decided to take this time mark as our starting point, because from this period, newspapers became the backbone of information diffusion in the Netherlands. Researchers have shown that these massive digital archives can be used to identify macroscopic trends related to historical and cultural changes. ", Journal of Informetrics, Vol.10, No.4, 1025-1036. However, we did not apply any disambiguation algorithm as the 15 cities from the list have homonyms of much smaller size (Figure 4). For example, in the case of the third row of Table 3, the string 'Goes' has been identified in the text of the news item, but the multiNER did not classify it as a place name, so it does not appear in the 'NER result column'. We kept only the woonplaatsen with more than 10,000 inhabitants. Homonymy: several places can have similar names. More recently, a study on British newspapers has used more refined techniques such as Named-Entity recognition to study the content of a massive corpus of historical newspapers (Lansdall-Welfare et al., 2017). The very short lifespan of most of titles is consistent with the findings of Van Kranenburg et al. The file for ambiguous place names is structured almost the same way. Between the sixteenth and nineteenth centuries, Transjordan was a marginal province of the Ottoman Empire with a local mode of governance. Studies applied to historical newspapers have shown that the level of performance of these algorithms can differ significantly (Ehrmann, Colavizza, Rochat, Kaplan, 2016; Mosallam, Abi-Haidar, Ganascia, 2014). We decided to look at the terms people use to say where they live because place names have a stronger inertia than the boundaries of local governments. Cities can be defined according to many criteria, they can be continuous build-up areas, functional entities, designated by a certain level of urban functions or by administrative status. Such organisations would have been very difficult to identify considering the cross-temporal dimension. International, national and institutional contexts have led to redefine a project——that began in 2003 and that has already fulfilled its original … Cybergeo, the electronic European Journal of Geography, is intended to promote faster communication of research and greater direct contact between authors and readers. Created with the aim of encouraging the exchange of ideas, methods and results, it publishes in any european language. Au cours des deux dernières décennies, d'importants efforts de numérisation de textes anciens ont été entrepris, notamment de livres et de journaux qui constituent des sources très riches sur les sociétés qui les ont produites. Information circulation has been identified as a key factor in urban dynamics. For years, the main concern of the Ottoman Porte in Transjordan was to ensure the safety of the Hajj caravan by paying the Bedouin tribes of the regions it passed through (eg. We have designed DIGGER in order to study the evolution of the Dutch urban system by investigating information flows extracted from historical newspapers that go back to 1869. For a maximum level of precision, it would have been necessary to develop a specific disambiguation algorithm that uses the sentence around the named entity, the metadata of the newspaper (i.e. Allan Pred (1977) also used local newspapers from different American cities to measure the time it took for information to travel from one place to another. We decided to go for a mixed technique to retrieve the data on cities in a reasonable amount of time. A more extensive study on the diffusion of information between the Dutch cities and its evolution over time can be found in Peris et al. (submitted). We have applied four criteria in the selection of newspapers: the newspaper had to be issued after 1869; its publication place had to be in the Netherlands; the newspaper had to exist during at least two consecutive decades; and we dropped the many small newspapers that were published only during the Second World War. En este artículo, presentamos DIGGER, una base de datos recientemente construida a partir de Delpher, la cual corresponde al archivo digital de periódicos históricos de la Biblioteca Nacional de Países Bajos. Schwartz T., 2011, "Culturomics: Periodicals Gauge Culture's Pulse", Science, Vol.332, No.6025, 35-36. Les données couvrant de longues périodes temporelles sont relativement rares pour l'étude des villes et pourtant essentielles à la compréhension du temps long de leurs dynamiques. The most important sources of errors leading to false positives are listed below. However, many efforts are being made by to constantly improve OCR quality. Cette base de données peut être utilisée pour analyser plus d'un siècle de développement du système urbain des Pays-Bas ainsi que pour l'étude de la diffusion des informations ou des biais spatiaux dans la couverture médiatique. The first important step in any quantitative study using a text archive is to select a relevant corpus. This threshold is often used by statistical agencies and scholars as the lower limit to define urban centres, and significantly reduces the number of places to query for. For these 274 cities, we performed SRU9 queries using city names as simple search terms to retrieve the relevant articles from the corpus. Lansdall-Welfare T., Sudhahar S., Thompson J., Lewis J., Team F. N., Cristianini N., 2017, "Content analysis of 150 years of British periodicals", Proceedings of the National Academy of Sciences, Vol.114, No.4, E457-E465. This dataset can be used to study the evolution of the Dutch urban system as well as aspects related to the spatial diffusion of information and geographical bias in media coverage. The different steps of the data collection are summarized in Figure 5. In our case, defining our primary units of analysis is made difficult by the fact that the data collection is meant for a corpus that covers more than one century. The Precision P corresponds to share of relevant instances among the retrieved instances, and can be defined as: Where tp corresponds to the true positives and fp to the false positives. Given the intimate connection of the most of these organisations with their place, one could argue that this is less a problem as news items using these organisation names will often be referring to something related to or happening in that place. Ce problème est prégnant pour les données sur les relations interurbaines, à l'échelle des systèmes de ville. Publisher: UMR 8504 Géographie-cités Because of the considerable variability in the number of news items published in each newspaper we decided to plot the relative frequency of place-name mentions in comparison to the total number of news items published. Table 2: Issues in city name recognition and their solution. Figure 2: Location of the 317 cities for which data is collected. In the case of organisations, we could not apply NER because we had an insufficient knowledge of the organisations using the city names from the list. NER was used only for ambiguous cases. Then, it presents issues in place names recognition and choices to deal with these issues. Neudecker C., Wilms L., Faber W. J., van Veen T., 2014, "Large-scale refinement of digital historic newspapers with named entity recognition", 16. Meijers E., Peris A., 2018, "Using toponym co-occurrences to measure relationships between places: review, application and evaluation", International Journal of Urban Sciences, 1-23. Investigadores, han demostrado que estos archivos digitales masivos, se pueden utilizar para identificar tendencias macroscópicas, relacionadas con cambios históricos y culturales. The two cities that were selected are Best, a small town close to Eindhoven which has a name that is a very common word in Dutch (the superlative of "better", like in English), and Dordrecht, a bigger city in South-Holland which has a very low chance of having false positives. Diferentes estudios han resaltado la importancia de contar con datos en el largo plazo que permitan el estudio de las ciudades, no obstante, tales fuentes son relativamente escasas. "Social distance" is one of the most successful concepts in international sociology. This resulted in the presence of a lot of short lived newspapers only published during the Second World War (n=2139) that can be very interesting for historians interested in the war but less relevant for long term studies. NER is a subtask of Natural Language Processing (NLP) that aims to locate and classify entities from a given text into pre-defined categories. Pred (1971) defines information fields as the total array of non-local contacts of individual places. The only difference is that additional to the frequency returned by the simple string query, there is an extra column with the number of hits after performing NER on the individual articles returned after the first query: Table 4: Structure of the freq_count_ner.csv file. We do not know the exact origins of the Nabataeans; they are a nomadic people from Arabia who settled in present-day Jordan between the 6th and 4th centuries BC. Figure 1: News items per year in Delpher and in the sub-corpus. This work was funded through a VIDI grant (452-14-004) provided by the Netherlands Organisation for Scientific Research (NWO), and through the researcher-in-residence program of the Koninklijke Bibliotheek, the national library of the Netherlands. However, problems related to extracting spatial information from text where not addressed, including the variety of scales (an article can mention a street, a city, a country, etc.) The content of a digital archive might be influenced by many factors such as digitalization policies, projects targeting a specific part of the media landscape (a newspaper, a region or a time period) or copyrights issues.

