Database structure in castarter

To keep track of the urls we are working on, castarter facilitates storing urls, as well as some basic metadata about them, in an orderly fashion.

Database location and database file naming conventions

By default, databases are stored in the same folder as website data, e.g. under base_folder/project/website/.

The location of the database cas be retrieved with:

cas_get_db_folder()
cas_get_db_file()

The filename of the SQLite database includes reference to both the project and the website name. This allows to store all database files of different projects in a single folder, as the file naming convention should prevent overlaps.

What if details about multiple projects and websites are to be stored in a single database, e.g. because it relies on a MySQL database hosted on a server rather than on local SQLite databases? Then, the database will need an additional table, with a list of project and website associated to a unique id. That id is then used in table names of each project and website. This approach prevents potential issues with project or website names that may include characters that are not appropriate for a database table name. (#TODO: not yet implemented).

Main database tables and column names

These are the key tables to found in a castarter database:

  • index_id - a table with three columns:

    • id: a unique integer identifier corresponding to a unique url
    • url: a url
    • index_group: a textual string, by default index. It is not infrequent to have separate index pages for different sections of a website (e.g. “news”, “events”, “statements”, etc.), different tags, or different levels of the indexing process (they can, for example, be called step_01, step_02). In such cases, it is useful to separate these different types of sources in case of updates: one would be interested in downloading the latest example.com/news/page/1 and the latest example.com/statements/page/1, and following, but not necessarily all index pages.
  • index_download - a table with four columns. New rows appear here only when a download has been attempted.

    • id: an integer, matching the identifier defined in the previous table
    • batch: an integer, starting from 1 and increasing. It identifies the download batch and allows for matching data with a specific download instance.
    • datetime: timestamp of when download was attempted
    • status: http response status code, such as 200 for successful, 404 for not found, etc.
    • size: size of the downloaded file
  • contents_id- a table with five columns, similar to the one outlined above:

    • id: a unique integer identifier corresponding to a unique url
    • url: a url
    • link_text: text used for the link
    • source_index_id: the identifier of the url from where the link was extracted
    • source_index_batch: the identifier of the download batch from where the link was obtained
  • contents_download - a table with five columns, similar to the one outlined above. New rows appear here only when a download has been attempted.

    • id: an integer, matching the identifier defined in the contents_id table
    • batch: an integer, starting from 1 and increasing. It identifies the download batch and allows for matching data with a specific download instance.
    • datetime: timestamp of when download was attempted
    • status: http response status code, such as 200 for successful, 404 for not found, etc.
    • size: size of the downloaded file
  • contents_data - a table with an unspecified number of columns. They must include:

    • id - an integer, matching the identifier defined in the contents_id table
    • url - url from which the contents have been extracted. In principle, this is redundant as it can be derived from the contents_id table. However, given the importance of ensuring full consistency between data and their source, some redundancy may be warranted.
    • … - value columns with the actual contents for the field.

Additional tables

Additional tables can be of course be kept in the same database to store additional data, metadata, or information about relevant contents.

For example, castarter facilitates checking and storing information about the availability of pages on the Internet Archive’s Wayback Machine, relying on their API through the cas_check_ai() function. cas_save_ai() allows to request to the Wayback Machine to store a copy of the given page.

  • ia_check - a table with six columns:

    • url - url from which the contents have been extracted.
    • status - as returned from the Internet Archive’s API. NA if not available.
    • available - logical
    • ia_url - latest available url on the Internet Archive. NA if not available.
    • timestamp - timestamp of the latest available page. NA if not available.
    • checked_at - timestamp based on the local computer storing the time when the given url has been checked.