Title: | Content Analysis Starter Toolkit |
---|---|
Description: | Consistent approaches for basic web scraping, text mining and word frequency analysis of textual datasets |
Authors: | Giorgio Comai [aut, cre, cph] |
Maintainer: | Giorgio Comai <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0.9012 |
Built: | 2024-10-26 15:21:55 UTC |
Source: | https://github.com/giocomai/castarter |
Archive originals of downloaded files in compressed folders
cas_archive( path = NULL, file_format = "tar.gz", index = TRUE, contents = TRUE, remove_original = TRUE, db_connection = NULL, db_folder = NULL, ... )
cas_archive( path = NULL, file_format = "tar.gz", index = TRUE, contents = TRUE, remove_original = TRUE, db_connection = NULL, db_folder = NULL, ... )
path |
Path to archive directory, defaults to NULL. If NULL, path is set to the project/website/archive folder. |
file_format |
Defaults to "tar.gz", to ensure cross-platform compatibility. No other formats are supported at this stage. |
remove_original |
Defaults to TRUE. If TRUE, after local files have been confirmed to be stored in the relevant compressed file, they are removed from their original folders, and the empty folders deleted. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
Backup files to Google Drive
cas_backup_gd( glob = c("*.tar.gz", "*.sqlite"), email = gargle::gargle_oauth_email(), scopes = "https://www.googleapis.com/auth/drive.file", client = cas_google_client, ... )
cas_backup_gd( glob = c("*.tar.gz", "*.sqlite"), email = gargle::gargle_oauth_email(), scopes = "https://www.googleapis.com/auth/drive.file", client = cas_google_client, ... )
glob |
A character vector with all glob selectors for the type of files
to be stored. Defaults to |
email |
If given, email of the Google account to use for storing files. |
scopes |
Defaults to |
client |
Google app client, defaults to |
... |
This function is typically used to check a web page when extracting links from index, or contents from contents pages.
cas_browse( index = FALSE, remote = TRUE, id = NULL, batch = NULL, index_group = NULL, file_format = "html", sample = 1, disconnect_db = TRUE, ... )
cas_browse( index = FALSE, remote = TRUE, id = NULL, batch = NULL, index_group = NULL, file_format = "html", sample = 1, disconnect_db = TRUE, ... )
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
remote |
Defaults to TRUE. If TRUE, opens relevant url online. If FALSE, it opens the locally stored file. |
sample |
Defaults to 1. By default, it opens one random url. |
... |
Passed to |
Convenience function typically used to generate urls to index pages listing articles.
cas_build_urls( url, url_ending = "", glue = FALSE, start_page = NULL, end_page = NULL, increase_by = 1, date_format = "Ymd", start_date = NULL, end_date = Sys.Date() - 1, date_separator = NULL, increase_date_by = "day", reversed_order = FALSE, index_group = "index", index = TRUE, write_to_db = FALSE, ... )
cas_build_urls( url, url_ending = "", glue = FALSE, start_page = NULL, end_page = NULL, increase_by = 1, date_format = "Ymd", start_date = NULL, end_date = Sys.Date() - 1, date_separator = NULL, increase_date_by = "day", reversed_order = FALSE, index_group = "index", index = TRUE, write_to_db = FALSE, ... )
url |
First part of index link that does not change in other index pages. |
url_ending |
Part of index link appneded after the part of the link that varies. If not relevant, may be left empty. |
glue |
Logical, defaults to FALSE. If TRUE, the url is parsed with
|
start_page |
If the urls include a numerical component, define first number of the sequence. Defaults to NULL. If given, coerced to numeric, expected to be an integer. |
end_page |
If the urls include a numerical component, define first number of the sequence. Defaults to NULL. If given, coerced to numeric, expected to be an integer. |
increase_by |
Defines by how much the number in the link should be increased in the numerical sequence. Defaults to 1. |
date_format |
A character string, defaults to "YMD". Check
|
start_date |
Defaults to NULL. If given, a date, or a character vector
of length one coercible to date with |
end_date |
Defaults to |
increase_date_by |
Defaults to "day". See |
reversed_order |
Logical, defaults to FALSE. If TRUE, the order of urls in the output. |
index_group |
A character vector, defaults to "index". Used for differentiating among different types of index or links in local databases. |
index |
Defaults to TRUE. Relevant only if |
write_to_db |
Defaults to FALSE. If set to TRUE, stores the newly created URLs to the local database. |
A data frame with three columns, id
, url
, and index_group
.
Typically, url
corresponds to a vector of unique urls.
It is not uncommon in particular for index pages to
include dates in the URL, along the lines of
example.com/archive/2022-01-01
, example.com/archive/2022-01-02
, etc. To
build such urls, cas_build_urls
needs a start_date
and end_date
.
The formatting of the date can be defined either by providing to the
parameter date_format
a string that strptime
is able to
interpret directly, or a simplified string (such as "Ymd", without the
"%"),adding a date_separator
such as "-" as needed.
cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_build_urls( url = "https://example.com/news/?skip=", start_page = 0, end_page = 100, increase_by = 10 ) cas_build_urls( url = "https://example.com/archive/", start_date = "2022-01-01", end_date = "2022-12-31", date_separator = "-" ) %>% head() cas_build_urls( url = "https://example.com/archive/?from={here}&to={here}", glue = TRUE, start_date = "2011-01-01", end_date = "2022-12-31", date_separator = ".", date_format = "dmY", index_group = "news" )
cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_build_urls( url = "https://example.com/news/?skip=", start_page = 0, end_page = 100, increase_by = 10 ) cas_build_urls( url = "https://example.com/archive/", start_date = "2022-01-01", end_date = "2022-12-31", date_separator = "-" ) %>% head() cas_build_urls( url = "https://example.com/archive/?from={here}&to={here}", glue = TRUE, start_date = "2011-01-01", end_date = "2022-12-31", date_separator = ".", date_format = "dmY", index_group = "news" )
Checks if given corpus exists, and, optionally updates it
cas_check_corpus( ..., update = FALSE, keep_only_latest = FALSE, path = NULL, file_format = "parquet", partition = NULL, token = "full_text", corpus_folder = "corpus" )
cas_check_corpus( ..., update = FALSE, keep_only_latest = FALSE, path = NULL, file_format = "parquet", partition = NULL, token = "full_text", corpus_folder = "corpus" )
... |
Passed to |
update |
Logical, defaults to FALSE. If set to TRUE, it checks if the local database has contents with a higher content id than is currently available in previously exported corpus, if any. If so, it writes a new, updated corpus. |
keep_only_latest |
Logical, defaults to FALSE. If set to TRUE, it deletes previous, older, corpora of the same type. |
path |
Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder. |
file_format |
Defaults to "parquet". Currently, other options are not implemented. |
partition |
Defaults to NULL. If NULL, the parquet file is not
partitioned. "year" is a common alternative: if set to "year", the parquet
file is partitioned by year. If a |
token |
Defaults to "full_text", which does not tokenise the text
column. If different from |
Path to corpus. NULL, if no corpus is found and update is set to FALSE.
Checks if database folder exists, if not returns an informative message
cas_check_db_folder()
cas_check_db_folder()
If the database folder exists, returns TRUE. Otherwise throws an error.
Other database functions:
cas_check_use_db()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db()
,
cas_set_db_folder()
,
cas_write_to_db()
# If database folder does not exist, it throws an error tryCatch(cas_check_db_folder(), error = function(e) { return(e) } ) # Create database folder cas_set_db_folder(path = fs::path( tempdir(), "cas_db_folder" )) cas_create_db_folder(ask = FALSE) cas_check_db_folder()
# If database folder does not exist, it throws an error tryCatch(cas_check_db_folder(), error = function(e) { return(e) } ) # Create database folder cas_set_db_folder(path = fs::path( tempdir(), "cas_db_folder" )) cas_create_db_folder(ask = FALSE) cas_check_db_folder()
contents_data
table in the database; if corpus is give, it just returns that instead.Mostly used internally
cas_check_read_db_contents_data( corpus = NULL, collect = FALSE, db_connection = NULL, db_folder = NULL, ... )
cas_check_read_db_contents_data( corpus = NULL, collect = FALSE, db_connection = NULL, db_folder = NULL, ... )
collect |
Logical, defaults to FALSE. If TRUE, it always returns a data frame and not a database connection, no matter the input. |
... |
Passed to |
Mostly used internally in functions, exported for reference.
cas_check_use_db(use_db = NULL, ...)
cas_check_use_db(use_db = NULL, ...)
use_db |
Defaults to NULL. If NULL, checks current use_db settings. If given, returns given value, ignoring use_db. |
Either TRUE or FALSE, depending on current use_db settings.
Other database functions:
cas_check_db_folder()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db()
,
cas_set_db_folder()
,
cas_write_to_db()
cas_check_use_db()
cas_check_use_db()
Parameters can be left to NULL; it will then rely on parameters set with cas_set_options()
cas_check_website_folder(base_folder = NULL, project = NULL, website = NULL)
cas_check_website_folder(base_folder = NULL, project = NULL, website = NULL)
base_folder |
Defaults to NULL, can be set once per session with
|
project |
Defaults to NULL. Project name, can be set once per session
with |
website |
Defaults to NULL. Website name, can be set once per session
with |
Logical, TRUE if website folder exists, FALSE if it does not.
Return a connection to be used for caching
cas_connect_to_db( db_connection = NULL, use_db = NULL, db_type = NULL, db_folder = NULL, read_only = FALSE, ... )
cas_connect_to_db( db_connection = NULL, use_db = NULL, db_type = NULL, db_folder = NULL, read_only = FALSE, ... )
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
use_db |
Defaults to NULL. If given, it should be given either TRUE or
FALSE. Typically set with |
read_only |
Defaults to FALSE. Passed to |
... |
Passed to |
A connection object.
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db()
,
cas_set_db_folder()
,
cas_write_to_db()
if (interactive()) { db_connection <- DBI::dbConnect( RSQLite::SQLite(), # or e.g. odbc::odbc(), Driver = ":memory:", # or e.g. "MariaDB", Host = "localhost", database = "example_db", UID = "example_user", PWD = "example_pwd" ) cas_connect_to_db(db_connection) db_settings <- list( driver = "MySQL", host = "localhost", port = 3306, database = "castarter", user = "secret_username", pwd = "secret_password" ) cas_connect_to_db(db_settings) }
if (interactive()) { db_connection <- DBI::dbConnect( RSQLite::SQLite(), # or e.g. odbc::odbc(), Driver = ":memory:", # or e.g. "MariaDB", Host = "localhost", database = "example_db", UID = "example_user", PWD = "example_pwd" ) cas_connect_to_db(db_connection) db_settings <- list( driver = "MySQL", host = "localhost", port = 3306, database = "castarter", user = "secret_username", pwd = "secret_password" ) cas_connect_to_db(db_settings) }
Convert database type, e.g. from DuckDB to SQLite
cas_convert_db_type( source_db_type, destination_db_type, disconnect_db = FALSE, ... )
cas_convert_db_type( source_db_type, destination_db_type, disconnect_db = FALSE, ... )
source_db_type |
A database type, such as "DuckDB" or "SQLite". Must be declared explicitly. |
destination_db_type |
A database type, such as "DuckDB" or "SQLite". Must be declared explicitly. |
Count strings in a corpus
cas_count( corpus, pattern, text = text, group_by = date, ignore_case = TRUE, drop_na = TRUE, fixed = FALSE, full_words_only = FALSE, pattern_column_name = pattern, n_column_name = n, locale = "en" )
cas_count( corpus, pattern, text = text, group_by = date, ignore_case = TRUE, drop_na = TRUE, fixed = FALSE, full_words_only = FALSE, pattern_column_name = pattern, n_column_name = n, locale = "en" )
corpus |
A textual corpus as a data frame. |
pattern |
A character vector of one or more words or strings to be counted. |
text |
Defaults to |
group_by |
Defaults to NULL. If given, the unquoted name of the column to be used for grouping (e.g. date, or doc_id, or source, etc.) |
ignore_case |
Defaults to TRUE. |
drop_na |
Defaults to TRUE. If TRUE, all rows where either |
full_words_only |
Defaults to FALSE. If FALSE, string is counted even when the it is found in the middle of a word (e.g. if FALSE, "ratio" would be counted as match in the word "irrational"). |
pattern_column_name |
Defaults to |
n_column_name |
Defaults to |
locale |
Locale to be used when ignore_case is set to TRUE. Passed to
|
A data frame
## Not run: cas_count( corpus = corpus, pattern = c("dogs", "cats", "horses"), text = text, group_by = date, n_column_name = n ) ## End(Not run)
## Not run: cas_count( corpus = corpus, pattern = c("dogs", "cats", "horses"), text = text, group_by = date, n_column_name = n ) ## End(Not run)
Count strings in a corpus relative to the number of words
cas_count_relative( corpus, pattern, text = text, group_by = date, ignore_case = TRUE, fixed = FALSE, full_words_only = FALSE, pattern_column_name = pattern, n_column_name = n, locale = "en" )
cas_count_relative( corpus, pattern, text = text, group_by = date, ignore_case = TRUE, fixed = FALSE, full_words_only = FALSE, pattern_column_name = pattern, n_column_name = n, locale = "en" )
corpus |
A textual corpus as a data frame. |
pattern |
A character vector of one or more words or strings to be counted. |
text |
Defaults to text. The unquoted name of the column of the corpus data frame to be used for matching. |
group_by |
Defaults to NULL. If given, the unquoted name of the column to be used for grouping (e.g. date, or doc_id, or source, etc.) |
ignore_case |
Defaults to TRUE. |
full_words_only |
Defaults to FALSE. If FALSE, string is counted even when the it is found in the middle of a word (e.g. if FALSE, "ratio" would be counted as match in the word "irrational"). |
pattern_column_name |
Defaults to 'word'. The unquoted name of the column to be used for the word in the output (if |
n_column_name |
Defaults to 'n'. The unquoted name of the column to be used for the count in the output. |
locale |
Locale to be used when ignore_case is set to TRUE. Passed to |
A data frame
## Not run: cas_count_relative( corpus = corpus, pattern = c("dogs", "cats", "horses"), text = text, group_by = date, n_column_name = n ) ## End(Not run)
## Not run: cas_count_relative( corpus = corpus, pattern = c("dogs", "cats", "horses"), text = text, group_by = date, n_column_name = n ) ## End(Not run)
Count total words in a dataset
cas_count_total_words( corpus, pattern = "\\w+", text = text, group_by = date, ignore_case = TRUE, n_column_name = n, locale = "en" )
cas_count_total_words( corpus, pattern = "\\w+", text = text, group_by = date, ignore_case = TRUE, n_column_name = n, locale = "en" )
corpus |
A textual corpus as a data frame. |
pattern |
Defaults to pattern commonly used to count words. |
text |
Defaults to |
group_by |
Defaults to NULL. If given, the unquoted name of the column to be used for grouping (e.g. date, or doc_id, or source, etc.) |
ignore_case |
Defaults to TRUE. |
n_column_name |
Defaults to |
locale |
Locale to be used when ignore_case is set to TRUE. Passed to
|
castarter
stores the project database.Creates the base folder where castarter
stores the project database.
cas_create_db_folder(path = NULL, ask = TRUE, ...)
cas_create_db_folder(path = NULL, ask = TRUE, ...)
ask |
Logical, defaults to TRUE. If FALSE, and database folder does not exist, it just creates it without asking (useful for non-interactive sessions). |
Nothing, used for its side effects.
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_connect_to_db()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db()
,
cas_set_db_folder()
,
cas_write_to_db()
cas_create_db_folder(path = fs::path(fs::path_temp(), "cas_data"))
cas_create_db_folder(path = fs::path(fs::path_temp(), "cas_data"))
cas_write_corpus()
.Typically used for file maintainance, especially when datasets are routinely updated.
cas_delete_corpus( keep = 1, ask = TRUE, file_format = "parquet", partition = "year", token = "full_text", corpus_folder = "corpus", path = NULL, ... )
cas_delete_corpus( keep = 1, ask = TRUE, file_format = "parquet", partition = "year", token = "full_text", corpus_folder = "corpus", path = NULL, ... )
keep |
Numeric, defaults to 1. Number of corpus files to keep. Only the most recent files are kept. |
file_format |
Defaults to "parquet". Currently, other options are not implemented. |
partition |
Defaults to NULL. If NULL, the parquet file is not
partitioned. "year" is a common alternative: if set to "year", the parquet
file is partitioned by year. If a |
token |
Defaults to "full_text", which does not tokenise the text
column. If different from |
path |
Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder. |
... |
Passed to |
Delete rows from selected database table
cas_delete_from_db( table, id = NULL, batch = NULL, index_group = NULL, ask = TRUE, db_folder = NULL, db_connection = NULL, disconnect_db = FALSE, ... )
cas_delete_from_db( table, id = NULL, batch = NULL, index_group = NULL, ask = TRUE, db_folder = NULL, db_connection = NULL, disconnect_db = FALSE, ... )
table |
Name of the table from where rows should be deleted. |
id |
Defaults to NULL. A vector of id. Rows with the given id will be removed from the database. |
batch |
Defaults to NULL. A vector of batch identigiers. Rows with the given batch id will be removed from the database. |
index_group |
Defaults to NULL. A vector of "index_group" names. Rows with the given "index_group" will be removed from the database. |
ask |
Defaults to TRUE. If TRUE, it runs a query checking how many rows would be deleted, and actually deletes them only after confirming. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
... |
Passed to |
Nothing, used for its side effects.
## Not run: if (interactive) { cas_delete_from_db(table = "contents_data", id = id_to_delete) } ## End(Not run)
## Not run: if (interactive) { cas_delete_from_db(table = "contents_data", id = id_to_delete) } ## End(Not run)
Disable caching for the current session
cas_disable_db()
cas_disable_db()
Nothing, used for its side effects.
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db()
,
cas_set_db_folder()
,
cas_write_to_db()
if (interactive()) { cas_disable_db() }
if (interactive()) { cas_disable_db() }
Ensure that connection to database is disconnected consistently
cas_disconnect_from_db( db_connection = NULL, db_type = NULL, use_db = NULL, disconnect_db = FALSE )
cas_disconnect_from_db( db_connection = NULL, db_type = NULL, use_db = NULL, disconnect_db = FALSE )
db_connection |
Defaults to NULL. If NULL, and database is enabled, |
use_db |
Defaults to NULL. If given, it should be given either TRUE or FALSE. Typically set with |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
Nothing, used for its side effects.
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db()
,
cas_set_db_folder()
,
cas_write_to_db()
cas_disconnect_from_db()
cas_disconnect_from_db()
Downloads files systematically, and stores details about the download in a local database
cas_download( download_df = NULL, index = FALSE, index_group = NULL, file_format = "html", overwrite_file = FALSE, create_folder_if_missing = NULL, ignore_id = TRUE, wait = 1, pause_base = 2, pause_cap = 256, pause_min = 4, sample = FALSE, retry_times = 3, terminate_on = NULL, user_agent = NULL, download_again_if_status_is_not = NULL, ... )
cas_download( download_df = NULL, index = FALSE, index_group = NULL, file_format = "html", overwrite_file = FALSE, create_folder_if_missing = NULL, ignore_id = TRUE, wait = 1, pause_base = 2, pause_cap = 256, pause_min = 4, sample = FALSE, retry_times = 3, terminate_on = NULL, user_agent = NULL, download_again_if_status_is_not = NULL, ... )
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
overwrite_file |
Logical, defaults to FALSE. If TRUE, files are downloaded again even if already present, overwriting previously downloaded items. |
wait |
Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue. |
sample |
Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded. |
retry_times |
Defaults to 3. Number of times to retry download in case of errors. |
user_agent |
Defaults to NULL. If given, passed to download method. |
... |
Passed to |
urls_df |
A data frame with at least two columns named |
Downloads one file at a time with chromote
cas_download_chromote( download_df = NULL, index = FALSE, index_group = NULL, overwrite_file = FALSE, ignore_id = TRUE, wait = 1, db_connection = NULL, sample = FALSE, file_format = "html", download_again = FALSE, disconnect_db = FALSE, ... )
cas_download_chromote( download_df = NULL, index = FALSE, index_group = NULL, overwrite_file = FALSE, ignore_id = TRUE, wait = 1, db_connection = NULL, sample = FALSE, file_format = "html", download_again = FALSE, disconnect_db = FALSE, ... )
download_df |
A data frame with four columns: |
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
overwrite_file |
Logical, defaults to FALSE. |
wait |
Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
sample |
Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded. |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
... |
Passed to |
Mostly used internally by cas_download
.
cas_download_httr( download_df = NULL, index = FALSE, index_group = NULL, overwrite_file = FALSE, ignore_id = TRUE, wait = 1, create_folder_if_missing = NULL, pause_base = 2, pause_cap = 256, pause_min = 4, terminate_on = NULL, retry_times = 3, db_connection = NULL, disconnect_db = FALSE, sample = FALSE, file_format = "html", user_agent = NULL, download_again_if_status_is_not = NULL, ... )
cas_download_httr( download_df = NULL, index = FALSE, index_group = NULL, overwrite_file = FALSE, ignore_id = TRUE, wait = 1, create_folder_if_missing = NULL, pause_base = 2, pause_cap = 256, pause_min = 4, terminate_on = NULL, retry_times = 3, db_connection = NULL, disconnect_db = FALSE, sample = FALSE, file_format = "html", user_agent = NULL, download_again_if_status_is_not = NULL, ... )
download_df |
A data frame with four columns: |
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
overwrite_file |
Logical, defaults to FALSE. |
wait |
Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue. |
retry_times |
Defaults to 3. Number of times to retry download in case of errors. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
sample |
Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded. |
user_agent |
Defaults to NULL. If given, passed to download method. |
... |
Passed to |
Invisibly returns the full httr
response.
Downloads index files systematically, and stores details about the download in a local database
cas_download_index( download_df = NULL, index_group = NULL, file_format = "html", overwrite_file = FALSE, create_folder_if_missing = NULL, wait = 1, pause_base = 2, pause_cap = 256, pause_min = 4, sample = FALSE, retry_times = 8, terminate_on = 404, user_agent = NULL, download_again_if_status_is_not = NULL, ... )
cas_download_index( download_df = NULL, index_group = NULL, file_format = "html", overwrite_file = FALSE, create_folder_if_missing = NULL, wait = 1, pause_base = 2, pause_cap = 256, pause_min = 4, sample = FALSE, retry_times = 8, terminate_on = 404, user_agent = NULL, download_again_if_status_is_not = NULL, ... )
index |
Mostly used internally by cas_download
.
cas_download_internal( download_df = NULL, index = FALSE, index_group = NULL, overwrite_file = FALSE, ignore_id = TRUE, wait = 1, create_folder_if_missing = NULL, db_connection = NULL, disconnect_db = FALSE, sample = FALSE, file_format = "html", ... )
cas_download_internal( download_df = NULL, index = FALSE, index_group = NULL, overwrite_file = FALSE, ignore_id = TRUE, wait = 1, create_folder_if_missing = NULL, db_connection = NULL, disconnect_db = FALSE, sample = FALSE, file_format = "html", ... )
download_df |
A data frame with four columns: |
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
overwrite_file |
Logical, defaults to FALSE. |
wait |
Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
sample |
Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded. |
... |
Passed to |
Invisibly returns the full httr
response.
Downloads html pages based on a vector of links.
cas_download_legacy( url, type = "contents", custom_folder = NULL, custom_path = NULL, file_format = "html", url_to_download = NULL, size = 500, wget_system = FALSE, method = "auto", missing_pages = TRUE, start = 1, wait = 1, ignore_ssl_certificates = FALSE, use_headless_chromium = FALSE, headless_chromium_wait = 1, use_phantomjs = FALSE, create_script = FALSE, project = NULL, website = NULL, base_folder = NULL )
cas_download_legacy( url, type = "contents", custom_folder = NULL, custom_path = NULL, file_format = "html", url_to_download = NULL, size = 500, wget_system = FALSE, method = "auto", missing_pages = TRUE, start = 1, wait = 1, ignore_ssl_certificates = FALSE, use_headless_chromium = FALSE, headless_chromium_wait = 1, use_phantomjs = FALSE, create_script = FALSE, project = NULL, website = NULL, base_folder = NULL )
url |
A character vector of urls, or a data frame with at least two columns named |
type |
Accepted values are either "contents" (default), "index". |
custom_folder |
Defaults to NULL. If given, overrides the "type" param and stores files in given path as a subfolder of project/website. Folder must already exist, and should be empty. |
url_to_download |
Defaults to NULL. If given, expected to be a logical vector to be applied to the given urls. If given, it takes precedence over |
size |
Defaults to 500. It represents the minimum size in bytes that downloaded html files should have: files that are smaller will be downloaded again. Used only when missing_pages == FALSE. |
wget_system |
Logical, defaults to FALSE. Calls wget as a system command through the system() function. Wget must be previously installed on the system. |
method |
Defaults to "auto". Method is passed to the function utils::download.file(); available options are "internal", "wininet" (Windows only) "libcurl", "wget" and "curl". For more information see ?utils::download.file() |
missing_pages |
Logical, defaults to TRUE. If TRUE, verifies if a downloaded html file exists for each element in articlesLinks; when there is no such file, it downloads it. |
start |
Integer. Only url with position higher than start in the url vector will be downloaded: |
ignore_ssl_certificates |
Logical, defaults to FALSE. If TRUE it uses wget to download the page, and does not check if the SSL certificate is valid. Useful, for example, for https pages with expired or mis-configured SSL certificate. |
use_headless_chromium |
Logical, defaults to FALSE. If TRUE uses the |
headless_chromium_wait |
Numeric, in seconds. How long should headless chrome wait after loading page? |
create_script |
Logical, defaults to FALSE. Tested on Linux only. If TRUE, creates a downloadPages.sh executable file that can be used to download all relevant pages from a terminal. |
project |
Name of 'castarter2' project. Must correspond to the name of a folder in the current working directory. |
website |
Name of a website included in a 'castarter2' project. Must correspond to the name of a sub-folder of the project folder. |
path |
Defaults to NULL. If given, overrides the "type" and "custom_folder" param and stores files in given path. |
By default, returns nothing, used for its side effects (downloads html files in relevant folder). Download files can then be imported in a vector with the function ImportHtml.
## Not run: if (interactive()) { cas_download(url) } ## End(Not run)
## Not run: if (interactive()) { cas_download(url) } ## End(Not run)
Enable caching for the current session
cas_enable_db(db_type = "SQLite")
cas_enable_db(db_type = "SQLite")
Nothing, used for its side effects.
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db()
,
cas_set_db_folder()
,
cas_write_to_db()
if (interactive()) { cas_enable_db() }
if (interactive()) { cas_enable_db() }
Run the Shiny Application
cas_explorer( corpus = castarter::cas_demo_corpus, default_pattern = NULL, title = "castarter", collect = FALSE, advanced = FALSE, custom_head_html = "<meta name=\"referrer\" content=\"no-referrer\" />", footer_html = shiny::tagList(), onStart = NULL, options = list(), enableBookmarking = NULL, uiPattern = "/", ... )
cas_explorer( corpus = castarter::cas_demo_corpus, default_pattern = NULL, title = "castarter", collect = FALSE, advanced = FALSE, custom_head_html = "<meta name=\"referrer\" content=\"no-referrer\" />", footer_html = shiny::tagList(), onStart = NULL, options = list(), enableBookmarking = NULL, uiPattern = "/", ... )
collect |
Defaults to FALSE. If TRUE, retrieves the corpus in memory,
even if is originally read from a parquet file or a database. With
|
custom_head_html |
Chunk of code to be included in the app's |
onStart |
A function that will be called before the app is actually run.
This is only needed for |
options |
Named options that should be passed to the |
enableBookmarking |
Can be one of |
uiPattern |
A regular expression that will be applied to each |
... |
arguments to pass to golem_opts. See |
Run the Shiny Application
cas_explorer_legacy( corpus = castarter::cas_demo_corpus, default_string = NULL, custom_head_html = "<meta name=\"referrer\" content=\"no-referrer\" />", onStart = NULL, options = list(), enableBookmarking = NULL, uiPattern = "/", ... )
cas_explorer_legacy( corpus = castarter::cas_demo_corpus, default_string = NULL, custom_head_html = "<meta name=\"referrer\" content=\"no-referrer\" />", onStart = NULL, options = list(), enableBookmarking = NULL, uiPattern = "/", ... )
onStart |
A function that will be called before the app is actually run.
This is only needed for |
options |
Named options that should be passed to the |
enableBookmarking |
Can be one of |
uiPattern |
A regular expression that will be applied to each |
... |
arguments to pass to golem_opts.
See |
Export database tables to another format such as csv
cas_export_tables( path = NULL, file_format = "csv.gz", tables = NULL, db_connection = NULL, disconnect_db = FALSE, db_folder = NULL, ... )
cas_export_tables( path = NULL, file_format = "csv.gz", tables = NULL, db_connection = NULL, disconnect_db = FALSE, db_folder = NULL, ... )
path |
Defaults to NULL. If NULL, path is set to the project/website/export/file_format folder. |
file_format |
Defaults to "csv.gz", i.e. compressed csv files. All
formats supported by |
tables |
Defaults to NULL. If NULL, all database tables are exported. If
given, names of the database tables to export. See
|
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
... |
Passed to |
## Not run: if (interactive) { cas_export_tables(file_format = "csv") } ## End(Not run)
## Not run: if (interactive) { cas_export_tables(file_format = "csv") } ## End(Not run)
Extract fields and contents from downloaded files
cas_extract( extractors, post_processing = NULL, id = NULL, ignore_id = TRUE, custom_path = NULL, index = FALSE, store_as_character = TRUE, check_previous = TRUE, db_connection = NULL, file_format = "html", sample = FALSE, write_to_db = FALSE, keep_if_status = 200, encoding = "UTF-8", readability = FALSE, ... )
cas_extract( extractors, post_processing = NULL, id = NULL, ignore_id = TRUE, custom_path = NULL, index = FALSE, store_as_character = TRUE, check_previous = TRUE, db_connection = NULL, file_format = "html", sample = FALSE, write_to_db = FALSE, keep_if_status = 200, encoding = "UTF-8", readability = FALSE, ... )
extractors |
A named list of functions. See examples for details. |
post_processing |
Defaults to NULL. If given, it must be a function that takes a data frame as input (logically, a row of the dataset) and returns it with additional or modified columns. |
id |
Defaults to NULL, identifiers to process when extracting. If given,
must be a numeric vector, logically corresponding to the identifiers in the
|
ignore_id |
Defaults to TRUE. If TRUE, it checks if identifiers have
been added to the local ignore list, typically with |
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
store_as_character |
Logical, defaults to TRUE. If TRUE, it converts to character all extracted contents before writing them to database. This reduces issues of type conversions with the default database backend (for example, SQLite automatically converts dates to numeric) or using different backends. This implies you will need to set data types when you read the database, but it also means that you can consistently expect all columns to be character vectors, which in one form or another are consistently implemented across database backends. Set to FALSE if you want to remain in control of column types. |
check_previous |
Logical, defaults to TRUE. If FALSE, no check will be
conducted to verify if the same content had been previously extracted. If
FALSE, |
sample |
Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded. |
keep_if_status |
Defaults to 200. Keep only if recorded download status matches the given status. |
... |
Passed to |
## Not run: if (interactive) { ### Post-processing example #### # For example, in order to add a column called `internal_id` # that takes the ending digits of the url (assuming the url ends with digits) # a function such as the following would be passed to cas_extract pp <- function(df) { df |> dplyr::mutate(internal_id = stringr::str_extract(url, "[[:digit:]]+$")) } } cas_extract( extractors = extractors_l, # assuming it has already been set post_processing = pp ) ## End(Not run)
## Not run: if (interactive) { ### Post-processing example #### # For example, in order to add a column called `internal_id` # that takes the ending digits of the url (assuming the url ends with digits) # a function such as the following would be passed to cas_extract pp <- function(df) { df |> dplyr::mutate(internal_id = stringr::str_extract(url, "[[:digit:]]+$")) } } cas_extract( extractors = extractors_l, # assuming it has already been set post_processing = pp ) ## End(Not run)
Facilitates extraction of contents from an html file
cas_extract_html( html_document, container = NULL, container_class = NULL, container_id = NULL, container_name = NULL, container_property = NULL, container_itemprop = NULL, container_instance = NULL, attribute = NULL, sub_element = NULL, no_children = NULL, trim = TRUE, squish = FALSE, no_match = "", exclude_css_path = NULL, exclude_xpath = NULL, custom_xpath = NULL, custom_css_path = NULL, keep_everything = FALSE, extract_text = TRUE, as_character = TRUE )
cas_extract_html( html_document, container = NULL, container_class = NULL, container_id = NULL, container_name = NULL, container_property = NULL, container_itemprop = NULL, container_instance = NULL, attribute = NULL, sub_element = NULL, no_children = NULL, trim = TRUE, squish = FALSE, no_match = "", exclude_css_path = NULL, exclude_xpath = NULL, custom_xpath = NULL, custom_css_path = NULL, keep_everything = FALSE, extract_text = TRUE, as_character = TRUE )
html_document |
An html document parsed with |
container |
Defaults to NULL. Type of html container from where links
are to be extracted, such as "div", "ul", and others. Either
|
container_class |
Defaults to NULL. If provided, also |
container_id |
Defaults to NULL. If provided, also |
container_itemprop |
Defaults to NULL. If provided, also |
container_instance |
Defaults to NULL. If given, it must be an integer. If a given combination is found more than once in the same page, the relevant occurrence is kept. Use with caution, as not all pages always include the same number of elements of the same class/with the same id. |
attribute |
Defaults to NULL. If given, type of attribute to extract.
Typically used in combination with container, as in
|
sub_element |
Defaults to NULL. If provided, also |
no_children |
Defaults to FALSE, i.e. by default all subelements of the
selected combination (e.g. div with given class) are extracted. If TRUE,
only text found under the given combination (but not its subelements) will
be extracted. Corresponds to the xpath string |
trim |
Defaults to TRUE. If TRUE, applies |
squish |
Defaults to FALSE. If TRUE, applies |
no_match |
Defaults to "". A common alternative would be NA. Value to return when the given container, selector or element is not found. |
exclude_css_path |
Defaults to NULL. To remove script, for example, use
|
exclude_xpath |
Defaults to NULL. A common pattern when extracting text
would be |
custom_xpath |
Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead. |
custom_css_path |
Defaults to NULL. If given, all other parameters are ignored and given CSSpath used instead. |
keep_everything |
Defaults to FALSE. If TRUE, all text included in the page is returned as a single string. |
extract_text |
Defaults to TRUE. If TRUE, text is extracted. |
as_character |
Defaults to TRUE. If FALSE, and if |
A character vector of length one.
## Not run: if (interactive()) { url <- "https://example.com" html_document <- rvest::read_html(x = url) # example for a tag that looks like: # <meta name="twitter:title" content="Example title" /> cas_extract_html( html_document = html_document, container = "meta", container_name = "twitter:title", attribute = "content" ) # example for a tag that looks like: # <meta name="keywords" content="various;keywords;"> cas_extract_html( html_document = html_document, container = "meta", container_name = "keywords", attribute = "content" ) # example for a tag that looks like: # <meta property="article:published_time" content="2016-10-29T13:09+03:00"/> cas_extract_html( html_document = html_document, container = "meta", container_property = "article:published_time", attribute = "content" ) } ## End(Not run)
## Not run: if (interactive()) { url <- "https://example.com" html_document <- rvest::read_html(x = url) # example for a tag that looks like: # <meta name="twitter:title" content="Example title" /> cas_extract_html( html_document = html_document, container = "meta", container_name = "twitter:title", attribute = "content" ) # example for a tag that looks like: # <meta name="keywords" content="various;keywords;"> cas_extract_html( html_document = html_document, container = "meta", container_name = "keywords", attribute = "content" ) # example for a tag that looks like: # <meta property="article:published_time" content="2016-10-29T13:09+03:00"/> cas_extract_html( html_document = html_document, container = "meta", container_property = "article:published_time", attribute = "content" ) } ## End(Not run)
Extract direct links to individual content pages from index pages
cas_extract_links( id = NULL, batch = "latest", domain = NULL, index = TRUE, index_group = NULL, output_index = FALSE, output_index_group = NULL, include_when = NULL, exclude_when = NULL, container = NULL, container_class = NULL, container_id = NULL, custom_xpath = NULL, custom_css = NULL, match = NULL, min_length = NULL, max_length = NULL, attribute_type = "href", append_string = NULL, remove_string = NULL, write_to_db = FALSE, file_format = "html", keep_only_within_domain = TRUE, sample = FALSE, check_previous = TRUE, check_again = FALSE, encoding = "UTF-8", reverse_order = FALSE, db_connection = NULL, disconnect_db = TRUE, ... )
cas_extract_links( id = NULL, batch = "latest", domain = NULL, index = TRUE, index_group = NULL, output_index = FALSE, output_index_group = NULL, include_when = NULL, exclude_when = NULL, container = NULL, container_class = NULL, container_id = NULL, custom_xpath = NULL, custom_css = NULL, match = NULL, min_length = NULL, max_length = NULL, attribute_type = "href", append_string = NULL, remove_string = NULL, write_to_db = FALSE, file_format = "html", keep_only_within_domain = TRUE, sample = FALSE, check_previous = TRUE, check_again = FALSE, encoding = "UTF-8", reverse_order = FALSE, db_connection = NULL, disconnect_db = TRUE, ... )
id |
Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id will be processed. |
domain |
Defaults to "". Web domain of the website. It is added at the beginning of each link found. If links in the page already include the full web address this should be ignored. |
output_index |
Defaults to FALSE. If FALSE, new links are added to the
contents table. If TRUE, the links extracted will be stored again as
index, using |
output_index_group |
Defaults to NULL. Relevant only when |
include_when |
Part of URL found only in links of individual articles to be downloaded. If more than one provided, it includes all links that contains either of the strings provided. |
exclude_when |
If an URL includes this string, it is excluded from the output. One or more strings may be provided. |
container |
Defaults to NULL. Type of html container from where links
are to be extracted, such as "div", "ul", and others. Either
|
container_class |
Defaults to NULL. If provided, also |
container_id |
Defaults to NULL. If provided, also |
custom_xpath |
Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead. |
match |
Defaults to NULL. Used when extracting json files. Name of property from where url is to be extracted. N.B. Only partly implemented, please report issues along with specific example where it emerged. |
min_length |
If a link is shorter than the number of characters given in min_length, it is excluded from the output. |
max_length |
If a link is longer than the number of characters given in max_length, it is excluded from the output. |
attribute_type |
Defaults to "href". Type of attribute to extract from links. |
append_string |
If provided, appends given string to the extracted articles. Typically used to create links for print or mobile versions of the extracted page. |
remove_string |
If provided, remove given string (or strings) from links. |
write_to_db |
Logical, defaults to FALSE. If TRUE stored newly extracted links in the database, associates each of them with an id, and records the source for each link. |
keep_only_within_domain |
Logical, defaults to TRUE. If TRUE, and domain given, links to external websites are dropped. |
check_previous |
Defaults to TRUE. If TRUE, checks if newly found links
are previously stored in database, and if they are, it discards them. If
FALSE, and |
check_again |
Defaults to FALSE. If FALSE, files from where are at least a link has been extracted are not re-processed. If TRUE, they are processed again. By default, only new links are then actually included in the output or stored in the local database. |
reverse_order |
Logical, defaults to FALSE. If TRUE, index files are
processed in reverse order of |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
... |
Passed to |
A data frame.
## Not run: links <- cas_extract_links(domain = "http://www.example.com/") ## End(Not run)
## Not run: links <- cas_extract_links(domain = "http://www.example.com/") ## End(Not run)
Extracts scripts from an html page
cas_extract_script( html_document, script_type = NULL, match = NULL, accessors = NULL, remove_from_script = NULL )
cas_extract_script( html_document, script_type = NULL, match = NULL, accessors = NULL, remove_from_script = NULL )
html_document |
An html document parsed with |
script_type |
Defaults to NULL. Type of script. Common script types
include |
match |
Default to NULL. If given, used to filter extracted scripts.
Must be a named vector in the format |
accessors |
Defaults to NULL. If given, a vector of accessors passed to
|
remove_from_script |
Defaults to NULL. If given, removed after the script has been extracted but before processing the json. |
May return a list or a character vector. If no match is found, returns NA_character_
## Not run: if (interactive()) { url <- "https://www.digi24.ro/stiri/externe/casa-alba-pune-capat-isteriei-globale-nu-exista-indicii-ca-obiectele-zburatoare-doborate-de-rachetele-sua-ar-fi-extraterestre-2250863" html_document <- rvest::read_html(x = url) cas_extract_script( html_document = html_document, script_type = "application/ld+json" ) # get date published cas_extract_script( html_document = html_document, script_type = "application/ld+json", match = c(`@type` = "NewsArticle"), accessors = "datePublished" ) # get title cas_extract_script( html_document = html_document, script_type = "application/ld+json", match = c(`@type` = "NewsArticle"), accessors = "headline" ) # get nested element, e.g. url of the logo of the publisher cas_extract_script( html_document = html_document, script_type = "application/ld+json", match = c(`@type` = "NewsArticle"), accessors = c("publisher", "logo", "url") ) } ## End(Not run)
## Not run: if (interactive()) { url <- "https://www.digi24.ro/stiri/externe/casa-alba-pune-capat-isteriei-globale-nu-exista-indicii-ca-obiectele-zburatoare-doborate-de-rachetele-sua-ar-fi-extraterestre-2250863" html_document <- rvest::read_html(x = url) cas_extract_script( html_document = html_document, script_type = "application/ld+json" ) # get date published cas_extract_script( html_document = html_document, script_type = "application/ld+json", match = c(`@type` = "NewsArticle"), accessors = "datePublished" ) # get title cas_extract_script( html_document = html_document, script_type = "application/ld+json", match = c(`@type` = "NewsArticle"), accessors = "headline" ) # get nested element, e.g. url of the logo of the publisher cas_extract_script( html_document = html_document, script_type = "application/ld+json", match = c(`@type` = "NewsArticle"), accessors = c("publisher", "logo", "url") ) } ## End(Not run)
cas_extract_html()
This may or may not work, but it may be worth giving this a quick a try before looking for alternatives. The parameters returned first should work best.
cas_find_extractor( html_document, pattern, containers = c("h1", "h2", "h3", "h4", "span", "td", "p", "div"), exclude_css_path = NULL )
cas_find_extractor( html_document, pattern, containers = c("h1", "h2", "h3", "h4", "span", "td", "p", "div"), exclude_css_path = NULL )
html_document |
An html document parsed with |
pattern |
A text string to be matched. |
containers |
Containers to be parsed for best matches. By default:
|
exclude_css_path |
Defaults to NULL. To remove script, for example, use
|
A data frame list with container and class or id of values that
should work if passed to cas_extract_html()
.
## Not run: if (interactive) { # not ideal example, but you'll get the gist, see additonal example below library("castarter") url <- "https://www.nasa.gov/news-release/nasa-sets-coverage-for-roscosmos-spacewalk-outside-space-station/" html_page <- rvest::read_html(url) cas_find_extractor( html_document = html_page, pattern = "NASA Sets Coverage for Roscosmos Spacewalk Outside Space Station" ) cas_find_extractor( html_document = html_page, pattern = "Oct 23, 2023" ) cas_find_extractor( html_document = html_page, pattern = "Roxana Bardan" ) cas_find_extractor( html_document = html_page, pattern = "RELEASE" ) ## Use this information to extract contents library("castarter") url <- "https://www.state.gov/designating-russian-virtual-currency-money-launderer/" html_page <- rvest::read_html(url) cas_find_extractor( html_document = html_page, pattern = "Designating Russian Virtual Currency Money Launderer" ) cas_extract_html( html_document = html_page, container = "span", container_class = "bc_current collapse" ) cas_extract_html( html_document = html_page, container = "h1", container_class = "featured-content__headline stars-above" ) cas_find_extractor( html_document = html_page, pattern = "Press Statement" ) cas_extract_html( html_document = html_page, container = "p", container_class = "article-meta doctype-meta" ) cas_find_extractor( html_document = html_page, pattern = "Matthew Miller, Department Spokesperson" ) cas_extract_html( html_document = html_page, container = "p", container_class = "article-meta__author-bureau" ) cas_find_extractor( html_document = html_page, pattern = "November 3, 2023" ) cas_extract_html( html_document = html_page, container = "p", container_class = "article-meta__publish-date" ) cas_find_extractor( html_document = html_page, pattern = "The United States is sanctioning Ekaterina Zhdanova", exclude_css_path = "script" ) cas_extract_html( html_document = html_page, container = "div", container_class = "entry-content", exclude_css_path = "script" ) } ## End(Not run)
## Not run: if (interactive) { # not ideal example, but you'll get the gist, see additonal example below library("castarter") url <- "https://www.nasa.gov/news-release/nasa-sets-coverage-for-roscosmos-spacewalk-outside-space-station/" html_page <- rvest::read_html(url) cas_find_extractor( html_document = html_page, pattern = "NASA Sets Coverage for Roscosmos Spacewalk Outside Space Station" ) cas_find_extractor( html_document = html_page, pattern = "Oct 23, 2023" ) cas_find_extractor( html_document = html_page, pattern = "Roxana Bardan" ) cas_find_extractor( html_document = html_page, pattern = "RELEASE" ) ## Use this information to extract contents library("castarter") url <- "https://www.state.gov/designating-russian-virtual-currency-money-launderer/" html_page <- rvest::read_html(url) cas_find_extractor( html_document = html_page, pattern = "Designating Russian Virtual Currency Money Launderer" ) cas_extract_html( html_document = html_page, container = "span", container_class = "bc_current collapse" ) cas_extract_html( html_document = html_page, container = "h1", container_class = "featured-content__headline stars-above" ) cas_find_extractor( html_document = html_page, pattern = "Press Statement" ) cas_extract_html( html_document = html_page, container = "p", container_class = "article-meta doctype-meta" ) cas_find_extractor( html_document = html_page, pattern = "Matthew Miller, Department Spokesperson" ) cas_extract_html( html_document = html_page, container = "p", container_class = "article-meta__author-bureau" ) cas_find_extractor( html_document = html_page, pattern = "November 3, 2023" ) cas_extract_html( html_document = html_page, container = "p", container_class = "article-meta__publish-date" ) cas_find_extractor( html_document = html_page, pattern = "The United States is sanctioning Ekaterina Zhdanova", exclude_css_path = "script" ) cas_extract_html( html_document = html_page, container = "div", container_class = "entry-content", exclude_css_path = "script" ) } ## End(Not run)
Generate basic metadata about the corpus, including start and end date and total number of items available.
cas_generate_metadata( corpus = NULL, db_connection = NULL, db_folder = NULL, ... )
cas_generate_metadata( corpus = NULL, db_connection = NULL, db_folder = NULL, ... )
... |
Passed to |
A list.
Get base folder under which files will be stored.
cas_get_base_folder(..., level = "website", custom_path = NULL)
cas_get_base_folder(..., level = "website", custom_path = NULL)
... |
Passed to |
level |
Defaults to "website". Valid values are "website", "project", and "base". |
custom_path |
Defaults to NULL. If given, all other parameters and settings are ignored, and folder is set to this value. |
Build full path to base working folder
cas_get_base_path( create_folder_if_missing = NULL, custom_path = NULL, custom_folder = NULL, index = FALSE, file_format = "html", ... )
cas_get_base_path( create_folder_if_missing = NULL, custom_path = NULL, custom_folder = NULL, index = FALSE, file_format = "html", ... )
create_folder_if_missing |
Logical, defaults to NULL. If NULL, it will ask before creating a new folder. If TRUE, it will create it without asking. |
custom_path |
Defaults to NULL. If given, all other parameters and settings are ignored, and folder is set to this value. |
file_format |
|
... |
Passed to |
Path to base folder. A character vector of length one of class fs_path
.
Get path to folder where the corpus is stored.
cas_get_corpus_path( ..., corpus_folder = "corpus", file_format = "parquet", partition = NULL, token = "full_text" )
cas_get_corpus_path( ..., corpus_folder = "corpus", file_format = "parquet", partition = NULL, token = "full_text" )
... |
Passed to |
file_format |
Defaults to "parquet". Currently, other options are not implemented. |
partition |
Defaults to NULL. If NULL, the parquet file is not
partitioned. "year" is a common alternative: if set to "year", the parquet
file is partitioned by year. If a |
token |
Defaults to "full_text", which does not tokenise the text
column. If different from |
## Not run: cas_get_corpus_path() ## End(Not run)
## Not run: cas_get_corpus_path() ## End(Not run)
Get connection to database with details about current website
cas_get_db( db_folder = NULL, base_folder = NULL, project = NULL, website = NULL )
cas_get_db( db_folder = NULL, base_folder = NULL, project = NULL, website = NULL )
db_folder |
Defaults to NULL. can be set once per session with
|
base_folder |
Defaults to NULL, can be set once per session with
|
project |
Defaults to NULL. Project name, can be set once per session
with |
website |
Defaults to NULL. Website name, can be set once per session
with |
cas_get_db( base_folder = fs::path_temp(), project = "example_project", website = "example_website" )
cas_get_db( base_folder = fs::path_temp(), project = "example_project", website = "example_website" )
Gets location of database file
cas_get_db_file(db_folder = NULL, ...)
cas_get_db_file(db_folder = NULL, ...)
A character vector of length one with location of the SQLite database file.
cas_set_db_folder(path = tempdir()) db_file_location <- cas_get_db_file(project = "test-project") # outputs location of database file db_file_location
cas_set_db_folder(path = tempdir()) db_file_location <- cas_get_db_file(project = "test-project") # outputs location of database file db_file_location
Typically set with cas_set_db()
cas_get_db_settings()
cas_get_db_settings()
A list with all database parameters as stored in environment variables.
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_read_from_db()
,
cas_set_db()
,
cas_set_db_folder()
,
cas_write_to_db()
cas_get_db_settings()
cas_get_db_settings()
Create a data frame with not yet downloaded files
cas_get_files_to_download( urls = NULL, index = FALSE, index_group = NULL, ignore_id = TRUE, desc_id = FALSE, batch = NULL, create_folder_if_missing = NULL, custom_folder = NULL, custom_path = NULL, file_format = "html", db_connection = NULL, download_again = FALSE, download_again_if_status_is_not = NULL, ... )
cas_get_files_to_download( urls = NULL, index = FALSE, index_group = NULL, ignore_id = TRUE, desc_id = FALSE, batch = NULL, create_folder_if_missing = NULL, custom_folder = NULL, custom_path = NULL, file_format = "html", db_connection = NULL, download_again = FALSE, download_again_if_status_is_not = NULL, ... )
urls |
Defaults to NULL. If given, it should correspond with a data
frame with at least two columns named |
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
desc_id |
Logical, defaults to FALSE. If TRUE, results are returned with highest id first. |
batch |
An integer, defaults to NULL. If not given, a check is performed in the database to find if previous downloads have taken place. If so, by default, the current batch will be one unit higher than the highest batch number found in the database. |
download_again_if_status_is_not |
Defaults to NULL. If given, it must a
status code as integer, typically |
... |
Arguments passed on to
|
A data frame with four columns: id
, url
, path
and type
Mostly used internally by cas_extract or for troubleshooting.
cas_get_files_to_extract( id = NULL, ignore_id = TRUE, custom_path = NULL, index = FALSE, store_as_character = TRUE, check_previous = TRUE, db_connection = NULL, file_format = "html", sample = FALSE, keep_if_status = 200, ... )
cas_get_files_to_extract( id = NULL, ignore_id = TRUE, custom_path = NULL, index = FALSE, store_as_character = TRUE, check_previous = TRUE, db_connection = NULL, file_format = "html", sample = FALSE, keep_if_status = 200, ... )
id |
Defaults to NULL, identifiers to process when extracting. If given,
must be a numeric vector, logically corresponding to the identifiers in the
|
ignore_id |
Defaults to TRUE. If TRUE, it checks if identifiers have
been added to the local ignore list, typically with |
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
store_as_character |
Logical, defaults to TRUE. If TRUE, it converts to character all extracted contents before writing them to database. This reduces issues of type conversions with the default database backend (for example, SQLite automatically converts dates to numeric) or using different backends. This implies you will need to set data types when you read the database, but it also means that you can consistently expect all columns to be character vectors, which in one form or another are consistently implemented across database backends. Set to FALSE if you want to remain in control of column types. |
check_previous |
Logical, defaults to TRUE. If FALSE, no check will be
conducted to verify if the same content had been previously extracted. If
FALSE, |
sample |
Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded. |
keep_if_status |
Defaults to 200. Keep only if recorded download status matches the given status. |
... |
Passed to |
#' ## Not run: if (interactive) { cas_get_files_to_extract() } ## End(Not run)
#' ## Not run: if (interactive) { cas_get_files_to_extract() } ## End(Not run)
Get key project parameters that determine the folder used for storing project files
cas_get_options( project = NULL, website = NULL, use_db = NULL, base_folder = NULL, db_type = NULL, db_folder = NULL, ... )
cas_get_options( project = NULL, website = NULL, use_db = NULL, base_folder = NULL, db_type = NULL, db_folder = NULL, ... )
project |
Defaults to NULL. Project name, can be set once per session
with |
website |
Defaults to NULL. Website name, can be set once per session
with |
use_db |
Defaults to TRUE. If TRUE, stores information about the download process and extracted text in a local database. |
base_folder |
Defaults to NULL, can be set once per session with
|
db_folder |
Defaults to NULL. can be set once per session with
|
A list object with the given or previously set options.
Other settings:
cas_set_options()
# it is possible to set only a few options, and let others be added when calling functions cas_set_options(base_folder = fs::path(fs::path_temp(), "castarter_data")) cas_options_list <- cas_get_options() cas_options_list cas_options_list2 <- cas_get_options(project = "test_project") cas_options_list2 cas_set_options( base_folder = fs::path(fs::path_temp(), "castarter_data"), project = "test_project", website = "test_website" ) cas_options_list3 <- cas_get_options() cas_options_list3 # Passing an argument overwrites the arguments set with options cas_options_list4 <- cas_get_options(website = "test_website4") cas_options_list4
# it is possible to set only a few options, and let others be added when calling functions cas_set_options(base_folder = fs::path(fs::path_temp(), "castarter_data")) cas_options_list <- cas_get_options() cas_options_list cas_options_list2 <- cas_get_options(project = "test_project") cas_options_list2 cas_set_options( base_folder = fs::path(fs::path_temp(), "castarter_data"), project = "test_project", website = "test_website" ) cas_options_list3 <- cas_get_options() cas_options_list3 # Passing an argument overwrites the arguments set with options cas_options_list4 <- cas_get_options(website = "test_website4") cas_options_list4
This function relies on data stored in the database.
cas_get_path_to_files( urls = NULL, id = NULL, batch = "latest", status = 200, index = FALSE, index_group = NULL, custom_folder = NULL, custom_path = NULL, file_format = "html", sample = FALSE, db_connection = NULL, db_folder = NULL, disconnect_db = TRUE, ... )
cas_get_path_to_files( urls = NULL, id = NULL, batch = "latest", status = 200, index = FALSE, index_group = NULL, custom_folder = NULL, custom_path = NULL, file_format = "html", sample = FALSE, db_connection = NULL, db_folder = NULL, disconnect_db = TRUE, ... )
batch |
Default to "latest": returns only the path to the file with the highest batch identifier available. Valid values are: "latest", "all", or a numeric identifier corresponding to desired batch. |
status |
Defaults to 200. Keeps only files downloaded with the given status (can be more than one, given as a vector). If NULL, no filter based on status is applied. |
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
sample |
Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
A data frame of one row if "batch" is set to "latest". Possibly more than one row in other cases.
Checks that a given input corresponds to the format expected of a download data frame, consistently returns expected format
cas_get_urls_df(urls = NULL, index = FALSE, index_group = NULL, ...)
cas_get_urls_df(urls = NULL, index = FALSE, index_group = NULL, ...)
url |
A character vector or a data frame with at least two columns, |
Consistently returns a data frame with at least two columns: a
numeric id
column, and a character url
column.
cas_get_urls_df(c( "https://example.com/a/", "https://example.com/b/" ))
cas_get_urls_df(c( "https://example.com/a/", "https://example.com/b/" ))
Get folder were files and data related to the current website are stored
cas_get_website_folder(base_folder = NULL, project = NULL, website = NULL)
cas_get_website_folder(base_folder = NULL, project = NULL, website = NULL)
base_folder |
Defaults to NULL, can be set once per session with
|
project |
Defaults to NULL. Project name, can be set once per session
with |
website |
Defaults to NULL. Website name, can be set once per session
with |
A path to a folder.
cas_get_website_folder()
cas_get_website_folder()
For details on API access to the Wayback Machine see: https://archive.org/help/wayback_api.php
cas_ia_check( url = NULL, wait = 1, retry_times = 3, pause_base = 2, pause_cap = 512, pause_min = 4, db_connection = NULL, disconnect_db = FALSE, check_db = TRUE, write_db = TRUE, output_only_newly_checked = FALSE, ... )
cas_ia_check( url = NULL, wait = 1, retry_times = 3, pause_base = 2, pause_cap = 512, pause_min = 4, db_connection = NULL, disconnect_db = FALSE, check_db = TRUE, write_db = TRUE, output_only_newly_checked = FALSE, ... )
url |
A charachter vector of length one, a url. |
wait |
Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue. |
retry_times |
Defaults to 3. Number of times to retry download in case of errors. |
check_db |
Defaults to TRUE. If TRUE, checks if given URL has already been checked in local database, and queries APIs only for URLs that have not been previously checked. |
write_db |
Defaults to TRUE. If TRUE, writes result to a local database. |
... |
Passed to |
For an R package facilitating more extensive interaction with the API, see: https://github.com/hrbrmstr/wayback
Integration with Wayback CDX Server API to be considered.
A url linking to the version on the Internet Archive
Consider using long waiting times, and using a high number of retry. Retry is
done graciously, using httr::RETRY
, and respecting the waiting time given
when error 529 "too many requests" is returned by the server. This is still
likely to take a long amount of time.
cas_ia_save( url = NULL, wait = 32, retry_times = 3, pause_base = 16, pause_cap = 1024, pause_min = 64, only_if_unavailable = TRUE, ia_check = TRUE, ia_check_wait = 2, db_connection = NULL, check_db = TRUE, write_db = TRUE, ... )
cas_ia_save( url = NULL, wait = 32, retry_times = 3, pause_base = 16, pause_cap = 1024, pause_min = 64, only_if_unavailable = TRUE, ia_check = TRUE, ia_check_wait = 2, db_connection = NULL, check_db = TRUE, write_db = TRUE, ... )
url |
A charachter vector of length one, a url. |
wait |
Defaults to 32. I have found no information online about what wait time is considered suitable by Archive.org itself, but I've noticed that with wait time shorter than 10 seconds the whole process stops getting positive replies from the server very soon. |
retry_times |
Defaults to 3. Number of times to retry download in case of errors. |
pause_base , pause_cap
|
This method uses exponential back-off with full
jitter - this means that each request will randomly wait between
|
pause_min |
Minimum time to wait in the backoff; generally only necessary if you need pauses less than one second (which may not be kind to the server, use with caution!). |
only_if_unavailable |
Defaults to TRUE. If TRUE, checks for availability of urls before attempting to save them. |
ia_check |
Defaults to TRUE. If TRUE, checks again the URL after saving it and keeps record in the local database. |
ia_check_wait |
Defaults to 2, passed to |
check_db |
Defaults to TRUE. If TRUE, checks if given URL has already been checked in local database, and queries APIs only for URLs that have not been previously checked. |
write_db |
Defaults to TRUE. If TRUE, writes result to a local database. |
... |
Passed to |
## Not run: if (interactive()) { # Once the usual parameters are set with `cas_set_options()` it is generally # ok to just let it get urls from the database and let it run without any # additional parameter. cas_ia_save() } ## End(Not run)
## Not run: if (interactive()) { # Once the usual parameters are set with `cas_set_options()` it is generally # ok to just let it get urls from the database and let it run without any # additional parameter. cas_ia_save() } ## End(Not run)
Adds a column with n words before and after the selected pattern to see keywords in context
cas_kwic( corpus, pattern, text = text, words_before = 5, words_after = 5, same_sentence = TRUE, period_at_end_of_sentence = TRUE, ignore_case = TRUE, regex = TRUE, full_words_only = FALSE, full_word_with_partial_match = TRUE, pattern_column_name = pattern )
cas_kwic( corpus, pattern, text = text, words_before = 5, words_after = 5, same_sentence = TRUE, period_at_end_of_sentence = TRUE, ignore_case = TRUE, regex = TRUE, full_words_only = FALSE, full_word_with_partial_match = TRUE, pattern_column_name = pattern )
corpus |
A textual corpus as a data frame. |
pattern |
A pattern, typically of one or more words, to be used to break text. Should be of length 1 or length equal to the number of rows. |
text |
Defaults to text. The unquoted name of the column of the corpus data frame to be used for matching. |
words_before |
Integer, defaults to 5. Number of columns to include in
the |
words_after |
Integer, defaults to 5. Number of columns to include in
the |
same_sentence |
Logical, defaults to TRUE. If TRUE, before and after include only words found in the sentence including the matched pattern. |
period_at_end_of_sentence |
Logical, defaults to TRUE. If TRUE, a period
(".") is always included at the end of a sentence. Relevant only if
|
ignore_case |
Defaults to TRUE. |
regex |
Defaults to TRUE. Treat pattern as regex. |
full_words_only |
Defaults to FALSE. If FALSE, pattern is counted even when it is found in the middle of a word (e.g. if FALSE, "ratio" would be counted as match in the word "irrational"). |
full_word_with_partial_match |
Defaults to TRUE. If TRUE, if there is a
partial match of the pattern, the |
pattern_column_name |
Defaults to 'pattern'. The unquoted name of the column to be used for the word in the output. |
A data frame (a tibble), with the same columns as input, plus three columns: before, pattern, and after. Only rows where the pattern is found are included.
cas_kwic( corpus = tifkremlinen::kremlin_en, pattern = c("china", "india") )
cas_kwic( corpus = tifkremlinen::kremlin_en, pattern = c("china", "india") )
Adds a column with n words before and after the selected pattern to see keywords in context
cas_kwic_single_pattern( corpus, pattern, text = text, words_before = 5, words_after = 5, same_sentence = TRUE, period_at_end_of_sentence = TRUE, ignore_case = TRUE, regex = TRUE, full_words_only = FALSE, full_word_with_partial_match = TRUE, pattern_column_name = pattern )
cas_kwic_single_pattern( corpus, pattern, text = text, words_before = 5, words_after = 5, same_sentence = TRUE, period_at_end_of_sentence = TRUE, ignore_case = TRUE, regex = TRUE, full_words_only = FALSE, full_word_with_partial_match = TRUE, pattern_column_name = pattern )
corpus |
A textual corpus as a data frame. |
pattern |
A pattern, typically of one or more words, to be used to break text. Should be of length 1 or length equal to the number of rows. |
text |
Defaults to text. The unquoted name of the column of the corpus data frame to be used for matching. |
words_before |
Integer, defaults to 5. Number of columns to include in
the |
words_after |
Integer, defaults to 5. Number of columns to include in
the |
same_sentence |
Logical, defaults to TRUE. If TRUE, before and after include only words found in the sentence including the matched pattern. |
period_at_end_of_sentence |
Logical, defaults to TRUE. If TRUE, a period
(".") is always included at the end of a sentence. Relevant only if
|
ignore_case |
Defaults to TRUE. |
regex |
Defaults to TRUE. Treat pattern as regex. |
full_words_only |
Defaults to FALSE. If FALSE, pattern is counted even when it is found in the middle of a word (e.g. if FALSE, "ratio" would be counted as match in the word "irrational"). |
full_word_with_partial_match |
Defaults to TRUE. If TRUE, if there is a
partial match of the pattern, the |
pattern_column_name |
Defaults to 'pattern'. The unquoted name of the column to be used for the word in the output. |
A data frame (a tibble), with the same columns as input, plus three columns: before, pattern, and after. Only rows where the pattern is found are included.
cas_kwic_single_pattern( corpus = tifkremlinen::kremlin_en, pattern = "West" )
cas_kwic_single_pattern( corpus = tifkremlinen::kremlin_en, pattern = "West" )
cas_write_dataset
Read datasets created with cas_write_dataset
cas_read_corpus( ..., update = FALSE, path = NULL, file_format = "parquet", partition = NULL, token = "full_text", corpus_folder = "corpus" )
cas_read_corpus( ..., update = FALSE, path = NULL, file_format = "parquet", partition = NULL, token = "full_text", corpus_folder = "corpus" )
... |
Passed to |
update |
Logical, defaults to FALSE. If FALSE, just checks if relevant corpus has been previously stored. If TRUE, it checks if more recent contents are available in the local database. |
path |
Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder. |
file_format |
Defaults to "parquet". Currently, other options are not implemented. |
partition |
Defaults to NULL. If NULL, the parquet file is not
partitioned. "year" is a common alternative: if set to "year", the parquet
file is partitioned by year. If a |
token |
Defaults to "full_text", which does not tokenise the text
column. If different from |
A dataset as ArrowObject
## Not run: cas_read_corpus() ## End(Not run)
## Not run: cas_read_corpus() ## End(Not run)
Read contents data from local database
cas_read_db_contents_data(db_connection = NULL, db_folder = NULL, ...)
cas_read_db_contents_data(db_connection = NULL, db_folder = NULL, ...)
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
Read contents from local database
cas_read_db_contents_id(db_connection = NULL, db_folder = NULL, ...)
cas_read_db_contents_id(db_connection = NULL, db_folder = NULL, ...)
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
A data frame with three columns and data stored in the contents_id
table of the local database. The data frame has zero rows if the database
does not exist or no data was previously stored there.
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_contents(urls = urls_df) cas_read_db_contents_id()
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_contents(urls = urls_df) cas_read_db_contents_id()
Read index from local database
cas_read_db_download( index = FALSE, id = NULL, batch = "latest", status = 200L, db_connection = NULL, db_folder = NULL, ... )
cas_read_db_download( index = FALSE, id = NULL, batch = "latest", status = 200L, db_connection = NULL, db_folder = NULL, ... )
batch |
Default to "latest": returns only the path to the file with the highest batch identifier available. Valid values are: "latest", "all", or a numeric identifier corresponding to desired batch. |
status |
Defaults to 200. Keeps only files downloaded with the given status (can be more than one, given as a vector). If NULL, no filter based on status is applied. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
A data frame with three columns and data stored in the index_id
table of the local database. The data frame has zero rows if the database
does not exist or no data was previously stored there.
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_index(urls = urls_df) cas_read_db_index()
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_index(urls = urls_df) cas_read_db_index()
Read status on the Internet Archive of given URLs
cas_read_db_ia(db_connection = NULL, db_folder = NULL, ...)
cas_read_db_ia(db_connection = NULL, db_folder = NULL, ...)
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
Read identifiers to be ignored from the local database
cas_read_db_ignore_id( db_connection = NULL, db_folder = NULL, index_group = NULL, disconnect_db = TRUE, ... )
cas_read_db_ignore_id( db_connection = NULL, db_folder = NULL, index_group = NULL, disconnect_db = TRUE, ... )
A data frame with a single column, id
cas_set_options( base_folder = fs::path(tempdir(), "R", "cas_read_db_ignore_id"), db_folder = fs::path(tempdir(), "R", "cas_read_db_ignore_id"), project = "example_project", website = "example_website" ) cas_enable_db() cas_write_db_ignore_id(id = sample(x = 1:100, size = 10)) cas_read_db_ignore_id()
cas_set_options( base_folder = fs::path(tempdir(), "R", "cas_read_db_ignore_id"), db_folder = fs::path(tempdir(), "R", "cas_read_db_ignore_id"), project = "example_project", website = "example_website" ) cas_enable_db() cas_write_db_ignore_id(id = sample(x = 1:100, size = 10)) cas_read_db_ignore_id()
Read index from local database
cas_read_db_index( db_connection = NULL, db_folder = NULL, index_group = NULL, ... )
cas_read_db_index( db_connection = NULL, db_folder = NULL, index_group = NULL, ... )
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
A data frame with three columns and data stored in the index_id
table of the local database. The data frame has zero rows if the database
does not exist or no data was previously stored there.
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_index(urls = urls_df) cas_read_db_index()
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_index(urls = urls_df) cas_read_db_index()
Read urls stored in the local database
cas_read_db_urls( index = FALSE, db_connection = NULL, db_folder = NULL, index_group = NULL, ... )
cas_read_db_urls( index = FALSE, db_connection = NULL, db_folder = NULL, index_group = NULL, ... )
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
Reads data from local database
cas_read_from_db( table, db_folder = NULL, db_connection = NULL, disconnect_db = FALSE, ... )
cas_read_from_db( table, db_folder = NULL, db_connection = NULL, disconnect_db = FALSE, ... )
table |
Name of the table. See readme for details. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
... |
Passed to |
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_set_db()
,
cas_set_db_folder()
,
cas_write_to_db()
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_to_db( df = urls_df, table = "index_id" ) cas_read_from_db(table = "index_id")
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_to_db( df = urls_df, table = "index_id" ) cas_read_from_db(table = "index_id")
Delete a specific table from database
cas_reset_db( table, db_connection = NULL, disconnect_db = FALSE, db_folder = NULL, ask = TRUE, ... )
cas_reset_db( table, db_connection = NULL, disconnect_db = FALSE, db_folder = NULL, ask = TRUE, ... )
table |
Name of the table. Yuu can use
|
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
ask |
Logical, defaults to TRUE. If set to FALSE, the relevant table will be deleted without asking for confirmation from the user. |
... |
Passed to |
Removes from the local database the folder where extracted data are stored
cas_reset_db_contents_data( db_connection = NULL, db_folder = NULL, ask = TRUE, ... )
cas_reset_db_contents_data( db_connection = NULL, db_folder = NULL, ask = TRUE, ... )
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
ask |
Logical, defaults to TRUE. If set to FALSE, the relevant table will be deleted without asking for confirmation from the user. |
... |
Passed to |
Removes from the local database the folder where links to contents associated with their id are stored
cas_reset_db_contents_id( db_connection = NULL, db_folder = NULL, ask = TRUE, ... )
cas_reset_db_contents_id( db_connection = NULL, db_folder = NULL, ask = TRUE, ... )
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
ask |
Logical, defaults to TRUE. If set to FALSE, the relevant table will be deleted without asking for confirmation from the user. |
... |
Passed to |
Removes from the local database all identifiers included in the ignore list
cas_reset_db_ignore_id(db_connection = NULL, db_folder = NULL, ask = TRUE, ...)
cas_reset_db_ignore_id(db_connection = NULL, db_folder = NULL, ask = TRUE, ...)
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
ask |
Logical, defaults to TRUE. If set to FALSE, the relevant table will be deleted without asking for confirmation from the user. |
... |
Passed to |
cas_set_options( base_folder = fs::path(tempdir(), "R", "cas_reset_db_ignore_id"), db_folder = fs::path(tempdir(), "R", "cas_reset_db_ignore_id"), project = "example_project", website = "example_website" ) cas_enable_db() cas_write_db_ignore_id(id = sample(x = 1:100, size = 10)) cas_read_db_ignore_id() cas_reset_db_ignore_id(ask = FALSE) cas_read_db_ignore_id()
cas_set_options( base_folder = fs::path(tempdir(), "R", "cas_reset_db_ignore_id"), db_folder = fs::path(tempdir(), "R", "cas_reset_db_ignore_id"), project = "example_project", website = "example_website" ) cas_enable_db() cas_write_db_ignore_id(id = sample(x = 1:100, size = 10)) cas_read_db_ignore_id() cas_reset_db_ignore_id(ask = FALSE) cas_read_db_ignore_id()
Removes from the local database the table where links to index urls are stored
cas_reset_db_index_id(db_connection = NULL, db_folder = NULL, ask = TRUE, ...)
cas_reset_db_index_id(db_connection = NULL, db_folder = NULL, ask = TRUE, ...)
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
ask |
Logical, defaults to TRUE. If set to FALSE, the relevant table will be deleted without asking for confirmation from the user. |
... |
Passed to |
Delete all files and database records for the contents pages of the current website
cas_reset_download_contents( batch = NULL, file_format = "html", db_connection = NULL, db_folder = NULL, ask = TRUE, ... )
cas_reset_download_contents( batch = NULL, file_format = "html", db_connection = NULL, db_folder = NULL, ask = TRUE, ... )
batch |
Defaults to NULL. If given, only files and records related to the given batch are removed. If not given, all contents files are removed. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
ask |
Logical, defaults to TRUE. If set to FALSE, the relevant table will be deleted without asking for confirmation from the user. |
... |
Passed to |
Delete all files and database records for the index pages of the current website
cas_reset_download_index( batch = NULL, file_format = "html", db_connection = NULL, db_folder = NULL, ask = TRUE, ... )
cas_reset_download_index( batch = NULL, file_format = "html", db_connection = NULL, db_folder = NULL, ask = TRUE, ... )
batch |
Defaults to NULL. If given, only files and records related to the given batch are removed. If not given, all index files are removed. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
ask |
Logical, defaults to TRUE. If set to FALSE, the relevant table will be deleted without asking for confirmation from the user. |
... |
Passed to |
Restore files from compressed files
cas_restore( restore_to = NULL, restore_from = NULL, file_format = "tar.gz", index = FALSE, contents = FALSE, batch = NULL, db_connection = NULL, db_folder = NULL, ... )
cas_restore( restore_to = NULL, restore_from = NULL, file_format = "tar.gz", index = FALSE, contents = FALSE, batch = NULL, db_connection = NULL, db_folder = NULL, ... )
restore_to |
Path to archive directory, defaults to NULL. If NULL, path is set to the project/website/archive folder. |
restore_from |
Path to archive directory, defaults to NULL. If NULL, path is set to the project/website/archive folder. |
file_format |
Defaults to "tar.gz", to ensure cross-platform compatibility. No other formats are supported at this stage. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
A path to the base folder where files are stored. Corresponds to
restore_to
if given, or to a temporary folder if restore_to
is set to
NULL.
Set database connection settings for the session
cas_set_db( db_settings = NULL, driver = NULL, host = NULL, port, database, user, pwd )
cas_set_db( db_settings = NULL, driver = NULL, host = NULL, port, database, user, pwd )
db_settings |
A list of database connection settings (see example) |
driver |
A database driver. Common database drivers include |
host |
Host address, e.g. "localhost". |
port |
Port to use to connect to the database. |
database |
Database name. |
user |
Database user name. |
pwd |
Password for the database user. |
A list with all given parameters (invisibly).
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db_folder()
,
cas_write_to_db()
if (interactive()) { # Settings can be provided either as a list db_settings <- list( driver = "MySQL", host = "localhost", port = 3306, database = "castarter", user = "secret_username", pwd = "secret_password" ) cas_set_db(db_settings) # or as parameters cas_set_db( driver = "MySQL", host = "localhost", port = 3306, database = "castarter", user = "secret_username", pwd = "secret_password" ) }
if (interactive()) { # Settings can be provided either as a list db_settings <- list( driver = "MySQL", host = "localhost", port = 3306, database = "castarter", user = "secret_username", pwd = "secret_password" ) cas_set_db(db_settings) # or as parameters cas_set_db( driver = "MySQL", host = "localhost", port = 3306, database = "castarter", user = "secret_username", pwd = "secret_password" ) }
Consider using a folder out of your current project directory, e.g. cas_set_db_folder("~/R/cas_data/")
: you will be able to use the same database in different projects, and prevent database files from being sync-ed if you use services such as Nextcloud or Dropbox.
cas_set_db_folder(path = NULL, ...) cas_get_db_folder(path = NULL, ...)
cas_set_db_folder(path = NULL, ...) cas_get_db_folder(path = NULL, ...)
path |
A path to a location used for storing the database. If the folder does not exist, it will be created. |
The path to the database folder, if previously set; the same path as given to the function; or the default, cas_data
is none is given.
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db()
,
cas_write_to_db()
cas_set_db_folder(fs::path(fs::path_home_r(), "R", "cas_data")) cas_set_db_folder(fs::path(fs::path_temp(), "cas_data")) cas_get_db_folder()
cas_set_db_folder(fs::path(fs::path_home_r(), "R", "cas_data")) cas_set_db_folder(fs::path(fs::path_temp(), "cas_data")) cas_get_db_folder()
Your project folder can be anywhere on your file system. Considering that
this is where possibly a very large number of html files will be downloaded,
it is usually preferable to choose a location that is not included in live
backups. These settings determine the names given to these hierarchical
folders: website
folder will be under project
folder which will be under
the base_folder
.
cas_set_options( project = NULL, website = NULL, use_db = TRUE, base_folder = NULL, db_type = "SQLite", db_folder = NULL )
cas_set_options( project = NULL, website = NULL, use_db = TRUE, base_folder = NULL, db_type = "SQLite", db_folder = NULL )
project |
Defaults to NULL. Project name, can be set once per session
with |
website |
Defaults to NULL. Website name, can be set once per session
with |
use_db |
Defaults to TRUE. If TRUE, stores information about the download process and extracted text in a local database. |
base_folder |
Defaults to NULL, can be set once per session with
|
db_folder |
Defaults to NULL. can be set once per session with
|
Nothing, used for its side effects (setting options).
Other settings:
cas_get_options()
cas_set_options(base_folder = fs::path(fs::path_temp(), "castarter_data")) cas_options_list <- cas_get_options() cas_options_list
cas_set_options(base_folder = fs::path(fs::path_temp(), "castarter_data")) cas_options_list <- cas_get_options() cas_options_list
For detail on parameters, see https://davidgohel.github.io/ggiraph/articles/offcran/using_ggiraph.html
cas_show_barchart_ggiraph( ggobj, data_id = NULL, tooltip = NULL, position = "stack" )
cas_show_barchart_ggiraph( ggobj, data_id = NULL, tooltip = NULL, position = "stack" )
ggobj |
A ggplot2 object, typically generated with |
data_id |
Defaults to NULL. If given, unquoted name of column, passed to ggiraph. |
tooltip |
Defaults to NULL. If given, unquoted name of column, passed to ggiraph. |
position |
Defaults to "stack". Available values include "dodge". |
A girafe/htmlwidget object
Creates barchart with ggplot2
cas_show_barchart_ggplot2(ggobj, position = "stack")
cas_show_barchart_ggplot2(ggobj, position = "stack")
ggobj |
A ggplot2 object, typically generated with |
position |
Defaults to "dodge". Available values include "stack". |
A ggplot2 object.
cas_count( corpus = tifkremlinen::kremlin_en, pattern = c("putin", "medvedev") ) |> cas_summarise(period = "year") |> cas_show_gg_base() |> cas_show_barchart_ggplot2(position = "stack")
cas_count( corpus = tifkremlinen::kremlin_en, pattern = c("putin", "medvedev") ) |> cas_summarise(period = "year") |> cas_show_gg_base() |> cas_show_barchart_ggplot2(position = "stack")
Creates base ggplot2 object to be used by ggplot or ggiraph
cas_show_gg_base( count_df, group_by = date, n_column_name = n, pattern_column_name = pattern, group_as_factor = FALSE, font_base_size = 14 )
cas_show_gg_base( count_df, group_by = date, n_column_name = n, pattern_column_name = pattern, group_as_factor = FALSE, font_base_size = 14 )
group_by |
Defaults to NULL. If given, the unquoted name of the column to be used for grouping (e.g. date, or doc_id, or source, etc.) |
n_column_name |
Defaults to 'n'. The unquoted name of the column to be used for the count in the output. |
pattern_column_name |
Defaults to 'pattern'. The unquoted name of the column to be used for the word in the output. |
group_as_factor |
Defaults to FALSE. If TRUE, the grouping column is forced into a factor, otherwise it is kept in its current format (e.g. date, or numeric). |
A ggplot2 object with aesthetics set, but no geometry.
cas_count( corpus = tifkremlinen::kremlin_en, pattern = c("putin", "medvedev") ) |> cas_summarise(period = "year") |> cas_show_gg_base() |> cas_show_barchart_ggplot2(position = "dodge")
cas_count( corpus = tifkremlinen::kremlin_en, pattern = c("putin", "medvedev") ) |> cas_summarise(period = "year") |> cas_show_gg_base() |> cas_show_barchart_ggplot2(position = "dodge")
Create dygraphs based on a data frame typically generated with cas_count()
cas_show_ts_dygraph( count_df, date_column_name = date, n_column_name = n, pattern_column_name = pattern, range_selector = TRUE )
cas_show_ts_dygraph( count_df, date_column_name = date, n_column_name = n, pattern_column_name = pattern, range_selector = TRUE )
count_df |
count_df <- castarter::cas_count( corpus = castarter::cas_demo_corpus, words = c("russia", "moscow") ) %>% cas_summarise(before = 15, after = 15) cas_show_ts_dygraph(count_df)
count_df <- castarter::cas_count( corpus = castarter::cas_demo_corpus, words = c("russia", "moscow") ) %>% cas_summarise(before = 15, after = 15) cas_show_ts_dygraph(count_df)
cas_count()
Summarise for a given time period word counts, typically calculatd with
cas_count()
cas_summarise( count_df, date_column_name = date, n_column_name = n, pattern_column_name = pattern, period = NULL, f = mean, period_summary_function = sum, every = 1L, before = 0L, after = 0L, complete = FALSE, auto_convert = FALSE )
cas_summarise( count_df, date_column_name = date, n_column_name = n, pattern_column_name = pattern, period = NULL, f = mean, period_summary_function = sum, every = 1L, before = 0L, after = 0L, complete = FALSE, auto_convert = FALSE )
count_df |
A data frame. Must include at least a column with a date or date-time column and a column with number of occurrences for the given time. |
period |
Defaults to NULL. A string describing the time unit to be used for summarising. Possible values include "year", "quarter", "month", "day", "hour", "minute", "second", "millisecond". |
f |
Defaults to |
period_summary_function |
Defaults to |
every |
The number of periods to group together. For example, if the period was set to |
before , after
|
The number of values before or after the current element to
include in the sliding window. Set to |
complete |
Should the function be evaluated on complete windows only? If |
auto_convert |
Defaults to FALSE. If FALSE, the date column is returned using the same format as the input; the minimun vale in the given group is used for reference (e.g. all values for January 2022 are summarised as 2021-01-01 it the data were originally given as dates.). If TRUE, it tries to adapt the output to the most intuitive correspondent type; for year, a numeric column with only the year number, for quarter in the format 2022.1, for month in the format 2022-01. |
date |
Defaults to |
n |
Unquoted to |
A data frame with two columns: the name of the period, and the same name originally used for n
.
## Not run: # this assumes dates are provided in a column called date corpus_df %>% cas_count( pattern = "example", group_by = date ) %>% cas_summarise(period = "year") ## End(Not run)
## Not run: # this assumes dates are provided in a column called date corpus_df %>% cas_count( pattern = "example", group_by = date ) %>% cas_summarise(period = "year") ## End(Not run)
Currently supports only update when re-downloading index urls is expected to bring new articles. It takes the first urls for each index group, and continues downloading new index pages as long as new links are found in each page. If no new link is found, it stops downloading and moves to the next index group.
cas_update( extract_links_partial, extractors, post_processing = NULL, wait = 3, user_agent = NULL, ... )
cas_update( extract_links_partial, extractors, post_processing = NULL, wait = 3, user_agent = NULL, ... )
extract_links_partial |
A partial function, typically created with
|
extractors |
A named list of functions. See examples for details. |
post_processing |
Defaults to NULL. If given, it must be a function that takes a data frame as input (logically, a row of the dataset) and returns it with additional or modified columns. |
wait |
Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue. |
user_agent |
Defaults to NULL. If given, passed to download method. |
... |
Passed to |
# Example of extract_links_partial: extract_links_partial <- purrr::partial( .f = cas_extract_links, reverse_order = TRUE, container = "div", container_class = "hentry h-entry hentry_event", exclude_when = c("/photos", "/videos"), domain = "http://en.kremlin.ru/" )
# Example of extract_links_partial: extract_links_partial <- purrr::partial( .f = cas_extract_links, reverse_order = TRUE, container = "div", container_class = "hentry h-entry hentry_event", exclude_when = c("/photos", "/videos"), domain = "http://en.kremlin.ru/" )
Export the textual dataset for the current website
cas_write_corpus( corpus = NULL, to_lower = FALSE, drop_na = TRUE, drop_empty = TRUE, date = date, text = text, tif_compliant = FALSE, file_format = "parquet", partition = NULL, token = "full_text", corpus_folder = "corpus", path = NULL, db_connection = NULL, db_folder = NULL, ... )
cas_write_corpus( corpus = NULL, to_lower = FALSE, drop_na = TRUE, drop_empty = TRUE, date = date, text = text, tif_compliant = FALSE, file_format = "parquet", partition = NULL, token = "full_text", corpus_folder = "corpus", path = NULL, db_connection = NULL, db_folder = NULL, ... )
corpus |
Defaults to NULL. If NULL, retrieves corpus from the current
website with |
to_lower |
Defaults to FALSE. Whether to convert tokens to lowercase.
Passed to |
drop_na |
Defaults to TRUE. If TRUE, items that have NA in their |
drop_empty |
Defaults to TRUE. If TRUE, items that have empty elements
("") in their |
date |
Unquoted date column, defaults to |
text |
Unquoted text column, defaults to |
tif_compliant |
Defaults to FALSE. If TRUE, it ensures that the first column is a character vector named "doc_id" and that the second column is a character vector named "text". See https://docs.ropensci.org/tif/ for details |
file_format |
Defaults to "parquet". Currently, other options are not implemented. |
partition |
Defaults to NULL. If NULL, the parquet file is not
partitioned. "year" is a common alternative: if set to "year", the parquet
file is partitioned by year. If a |
token |
Defaults to "full_text", which does not tokenise the text
column. If different from |
path |
Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
... |
Passed to |
If some IDs are already present in the database, only the new ones are appended: IDs are expected to be unique.
cas_write_db_contents_data( contents_df, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, quiet = FALSE, check_previous = TRUE, ... )
cas_write_db_contents_data( contents_df, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, quiet = FALSE, check_previous = TRUE, ... )
overwrite |
Logical, defaults to FALSE. If TRUE, checks if matching data are previously held in the table and overwrites them. This should be used with caution, as it may overwrite completely the selected table. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
quiet |
Defaults to FALSE. If set to TRUE, messages on number of lines added are not shown. |
check_previous |
Defaults to TRUE. If set to FALSE, the given input is stored in the database without checking if the same id had already been stored. |
... |
Passed to |
Check for consistency in database columns: if new columns do not match previous columns, it throws an error.
Invisibly returns only new rows added.
If some URLs are already included in the database, it appends only the new ones: URLs are expected to be unique.
cas_write_db_contents_id( urls, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, quiet = FALSE, check_previous = TRUE, ... )
cas_write_db_contents_id( urls, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, quiet = FALSE, check_previous = TRUE, ... )
urls |
A data frame with five columns, such as
|
overwrite |
Logical, defaults to FALSE. If TRUE, checks if matching data are previously held in the table and overwrites them. This should be used with caution, as it may overwrite completely the selected table. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
quiet |
Defaults to FALSE. If set to TRUE, messages on number of lines added are not shown. |
check_previous |
Defaults to TRUE. If set to FALSE, the given input is stored in the database without checking if the same url had already been stored. |
... |
Passed to |
Invisibly returns only new rows added.
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_contents_id(urls = urls_df) cas_read_db_contents_id()
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_contents_id(urls = urls_df) cas_read_db_contents_id()
There are two main use cases for this function:
a number of the files downloaded turned out to be irrelevant. Rather than delete any trace about them, it may be preferrable to just ignore them, so they are not processed when extracting data.
urls originally included for download, but not yet downloaded, should be ignored and not downloaded. This may or may not be a temporary arrangement, but it is considered useful to keep the urls in the database.
cas_write_db_ignore_id( id, db_folder = NULL, db_connection = NULL, disconnect_db = FALSE, ... ) cas_ignore_id( id, db_folder = NULL, db_connection = NULL, disconnect_db = FALSE, ... )
cas_write_db_ignore_id( id, db_folder = NULL, db_connection = NULL, disconnect_db = FALSE, ... ) cas_ignore_id( id, db_folder = NULL, db_connection = NULL, disconnect_db = FALSE, ... )
id |
Defaults to NULL. A vector of id. Rows with the given id will be added to the ignore table. |
cas_set_options( base_folder = fs::path(tempdir(), "R", "cas_write_db_ignore_id"), db_folder = fs::path(tempdir(), "R", "cas_write_db_ignore_id"), project = "example_project", website = "example_website" ) cas_enable_db() cas_write_db_ignore_id(id = sample(x = 1:100, size = 10)) cas_read_db_ignore_id()
cas_set_options( base_folder = fs::path(tempdir(), "R", "cas_write_db_ignore_id"), db_folder = fs::path(tempdir(), "R", "cas_write_db_ignore_id"), project = "example_project", website = "example_website" ) cas_enable_db() cas_write_db_ignore_id(id = sample(x = 1:100, size = 10)) cas_read_db_ignore_id()
If some URLs are already included in the database, it appends only the new ones: URLs are expected to be unique.
cas_write_db_index( urls, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, ... )
cas_write_db_index( urls, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, ... )
urls |
A data frame with three columns, with the same name and type as
|
overwrite |
Logical, defaults to FALSE. If TRUE, checks if matching data are previously held in the table and overwrites them. This should be used with caution, as it may overwrite completely the selected table. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
... |
Passed to |
Invisibly returns only new rows added.
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_index(urls = urls_df) cas_read_db_index()
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), db_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_db_index(urls = urls_df) cas_read_db_index()
Write index or contents urls directly to the local database
cas_write_db_urls( urls, index = FALSE, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, quiet = FALSE, check_previous = TRUE, ... )
cas_write_db_urls( urls, index = FALSE, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, quiet = FALSE, check_previous = TRUE, ... )
urls |
A data frame with five columns, such as
|
index |
Logical, defaults to FALSE. If TRUE, downloaded files will be
considered |
overwrite |
Logical, defaults to FALSE. If TRUE, checks if matching data are previously held in the table and overwrites them. This should be used with caution, as it may overwrite completely the selected table. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
quiet |
Defaults to FALSE. If set to TRUE, messages on number of lines added are not shown. |
check_previous |
Defaults to TRUE. If set to FALSE, the given input is stored in the database without checking if the same url had already been stored. |
... |
Passed to |
Generic function for writing to database
cas_write_to_db( df, table, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, ... )
cas_write_to_db( df, table, overwrite = FALSE, db_connection = NULL, disconnect_db = FALSE, ... )
df |
A data frame. Must correspond with the type of data expected for each table. |
table |
Name of the table. See readme for details. |
overwrite |
Logical, defaults to FALSE. If TRUE, checks if matching data are previously held in the table and overwrites them. This should be used with caution, as it may overwrite completely the selected table. |
db_connection |
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example). |
disconnect_db |
Defaults to TRUE. If FALSE, leaves the connection to database open. |
... |
Passed to |
If successful, returns invisibly the same data frame provided as
input and written to the database. Returns silently NULL, if nothing is
added, e.g. because use_db
is set to FALSE.
Other database functions:
cas_check_db_folder()
,
cas_check_use_db()
,
cas_connect_to_db()
,
cas_create_db_folder()
,
cas_disable_db()
,
cas_disconnect_from_db()
,
cas_enable_db()
,
cas_get_db_settings()
,
cas_read_from_db()
,
cas_set_db()
,
cas_set_db_folder()
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_to_db( df = urls_df, table = "index_id" )
cas_set_options( base_folder = fs::path(tempdir(), "R", "castarter_data"), project = "example_project", website = "example_website" ) cas_enable_db() urls_df <- cas_build_urls( url = "https://www.example.com/news/", start_page = 1, end_page = 10 ) cas_write_to_db( df = urls_df, table = "index_id" )
index_id
tableEmpty data frame with the same format as data stored in the index_id
table
casdb_empty_index_id
casdb_empty_index_id
A data frame with 0 rows and 3 columns:
Numeric. Column meant for unique integer identifier corresponding to a unique url
Character. A url.
Character. A textual string, by default index
.
Helps you define the parameters you need for building index urls
Helps you define the parameters you need for building index urls
cass_build_urls() cass_build_urls()
cass_build_urls() cass_build_urls()
Nothing, but prints to the console the function call as created in the Shiny app.
Nothing, called for interactive use.
## Not run: if (interactive) { cass_build_urls() } ## End(Not run) ## Not run: if (interactive) { cass_build_urls() } ## End(Not run)
## Not run: if (interactive) { cass_build_urls() } ## End(Not run) ## Not run: if (interactive) { cass_build_urls() } ## End(Not run)
Combines a vector of words into a string to be used for regex matching.
cass_combine_into_pattern(words, full_words_only = TRUE)
cass_combine_into_pattern(words, full_words_only = TRUE)
words |
A character vector of words to be combined for string matching. |
full_words_only |
Logical, defaults to TRUE. If TRUE, the correspondent words are matched only when they are a separate word. |
A character vector of length one, ready to be used for regex matching.
words <- c("dogs", "cats", "horses") cass_combine_into_pattern(words)
words <- c("dogs", "cats", "horses") cass_combine_into_pattern(words)
A minimal shiny app that demonstrates the functioning of related modules
cass_download_csv_app(df, type)
cass_download_csv_app(df, type)
df |
A data frame to be exported as csv. |
A shiny app
count_df <- castarter::cas_count( corpus = castarter::cas_demo_corpus, string = c("russia", "moscow") ) %>% cas_summarise(before = 15, after = 15) # cass_cass_download_csv_app(count_df)
count_df <- castarter::cas_count( corpus = castarter::cas_demo_corpus, string = c("russia", "moscow") ) %>% cas_summarise(before = 15, after = 15) # cass_cass_download_csv_app(count_df)
Takes a character vector and returns it with matches of pattern wrapped in html tags used for highlighting
cass_highlight(string, pattern, ignore_case = TRUE)
cass_highlight(string, pattern, ignore_case = TRUE)
string |
A character vector. |
ignore_case |
Defaults to TRUE. |
param |
Pattern to match. |
cass_highlight( string = c( "The R Foundation for Statistical Computing", "R is free software and comes with ABSOLUTELY NO WARRANTY" ), pattern = "foundation|software|warranty" )
cass_highlight( string = c( "The R Foundation for Statistical Computing", "R is free software and comes with ABSOLUTELY NO WARRANTY" ), pattern = "foundation|software|warranty" )
A minimal shiny app that demonstrates the functioning of related modules
cass_show_ts_dygraph_app(count_df)
cass_show_ts_dygraph_app(count_df)
count_df |
A dataframe with three columns ( |
A shiny app
count_df <- castarter::cas_count( corpus = castarter::cas_demo_corpus, string = c("russia", "moscow") ) %>% cas_summarise(before = 15, after = 15) # cass_show_ts_dygraph_app(count_df)
count_df <- castarter::cas_count( corpus = castarter::cas_demo_corpus, string = c("russia", "moscow") ) %>% cas_summarise(before = 15, after = 15) # cass_show_ts_dygraph_app(count_df)
Split string into multiple inputs
cass_split_string(string, squish = TRUE, to_lower = TRUE, to_regex = FALSE)
cass_split_string(string, squish = TRUE, to_lower = TRUE, to_regex = FALSE)
string |
A text string, typically a user input in a shiny app. |
to_regex |
Defaults to FALSE. If TRUE collapses the split string,
separating each element with |
A character vector
cass_split("dogs, cats, horses") cass_split(string = "dogs, cats, horses", to_regex = TRUE)
cass_split("dogs, cats, horses") cass_split(string = "dogs, cats, horses", to_regex = TRUE)