Main database tables and column names
These are the key tables to found in a castarter
database:
index_id
- a table with three columns:id
: a unique integer identifier corresponding to a unique urlurl
: a urlindex_group
: a textual string, by defaultindex
. It is not infrequent to have separate index pages for different sections of a website (e.g. “news”, “events”, “statements”, etc.), different tags, or different levels of the indexing process (they can, for example, be calledstep_01
,step_02
). In such cases, it is useful to separate these different types of sources in case of updates: one would be interested in downloading the latestexample.com/news/page/1
and the latestexample.com/statements/page/1
, and following, but not necessarily all index pages.
index_download
- a table with four columns. New rows appear here only when a download has been attempted.id
: an integer, matching the identifier defined in the previous tablebatch
: an integer, starting from 1 and increasing. It identifies the download batch and allows for matching data with a specific download instance.datetime
: timestamp of when download was attemptedstatus
: http response status code, such as 200 for successful, 404 for not found, etc.size
: size of the downloaded file
contents_id
- a table with five columns, similar to the one outlined above:id
: a unique integer identifier corresponding to a unique urlurl
: a urllink_text
: text used for the linksource_index_id
: the identifier of the url from where the link was extractedsource_index_batch
: the identifier of the download batch from where the link was obtained
contents_download
- a table with five columns, similar to the one outlined above. New rows appear here only when a download has been attempted.id
: an integer, matching the identifier defined in thecontents_id
tablebatch
: an integer, starting from 1 and increasing. It identifies the download batch and allows for matching data with a specific download instance.datetime
: timestamp of when download was attemptedstatus
: http response status code, such as 200 for successful, 404 for not found, etc.size
: size of the downloaded file
contents_data
- a table with an unspecified number of columns. They must include:id
- an integer, matching the identifier defined in thecontents_id
tableurl
- url from which the contents have been extracted. In principle, this is redundant as it can be derived from thecontents_id
table. However, given the importance of ensuring full consistency between data and their source, some redundancy may be warranted.- … - value columns with the actual contents for the field.