Main database tables and column names
These are the key tables to found in a castarter
database:
index_id- a table with three columns:id: a unique integer identifier corresponding to a unique urlurl: a urlindex_group: a textual string, by defaultindex. It is not infrequent to have separate index pages for different sections of a website (e.g. “news”, “events”, “statements”, etc.), different tags, or different levels of the indexing process (they can, for example, be calledstep_01,step_02). In such cases, it is useful to separate these different types of sources in case of updates: one would be interested in downloading the latestexample.com/news/page/1and the latestexample.com/statements/page/1, and following, but not necessarily all index pages.
index_download- a table with four columns. New rows appear here only when a download has been attempted.id: an integer, matching the identifier defined in the previous tablebatch: an integer, starting from 1 and increasing. It identifies the download batch and allows for matching data with a specific download instance.datetime: timestamp of when download was attemptedstatus: http response status code, such as 200 for successful, 404 for not found, etc.size: size of the downloaded file
contents_id- a table with five columns, similar to the one outlined above:id: a unique integer identifier corresponding to a unique urlurl: a urllink_text: text used for the linksource_index_id: the identifier of the url from where the link was extractedsource_index_batch: the identifier of the download batch from where the link was obtained
contents_download- a table with five columns, similar to the one outlined above. New rows appear here only when a download has been attempted.id: an integer, matching the identifier defined in thecontents_idtablebatch: an integer, starting from 1 and increasing. It identifies the download batch and allows for matching data with a specific download instance.datetime: timestamp of when download was attemptedstatus: http response status code, such as 200 for successful, 404 for not found, etc.size: size of the downloaded file
contents_data- a table with an unspecified number of columns. They must include:id- an integer, matching the identifier defined in thecontents_idtableurl- url from which the contents have been extracted. In principle, this is redundant as it can be derived from thecontents_idtable. However, given the importance of ensuring full consistency between data and their source, some redundancy may be warranted.- … - value columns with the actual contents for the field.