README.md
# aswan
[![Documentation Status](https://readthedocs.org/projects/aswan/badge/?version=latest)](https://aswan.readthedocs.io/en/latest)
[![codeclimate](https://img.shields.io/codeclimate/maintainability/endremborza/aswan.svg)](https://codeclimate.com/github/endremborza/aswan)
[![codecov](https://img.shields.io/codecov/c/github/endremborza/aswan)](https://codecov.io/gh/endremborza/aswan)
[![pypi](https://img.shields.io/pypi/v/aswan.svg)](https://pypi.org/project/aswan/)
[![DOI](https://zenodo.org/badge/doi/10.5281/zenodo.7477587.svg)](https://doi.org/10.5281/zenodo.7477587)
collect and organize data into a T1 data depot
named after the [Aswan Dam](https://en.wikipedia.org/wiki/Aswan_Dam)
Collect and compress data from the internet for later parsing
- quick, parallel, customizable to collect
- compressed to store
- quick to sync with a remote store
- sync to continue collecting
- sync to parse
- immutable collection
## To Setup a Remote
set the environment variables `ASWAN_AUTH_HEX` and `ASWAN_AUTH_PASS` according to the [zimmauth](https://github.com/endremborza/zimmauth) package, and `ASWAN_REMOTE` with the name of the default remote.
## Concepts
- objects
- saved by collection events
- events
- collection
- registration (v2: registration for parsing)
- (v2) parsing
- runs
- manual run vs automated run
- makes manual adding of urls easy but revertible
- has unique id
- generates events
- linked to a specific version of the code
- ideally commit hash + pip freeze
- statuses
- determined by base status + runs integrated
- contains
- what urls need to be collected
- (v2) what collected objects need to be parsed
- sqlite file, constantly trimmed
### Structure
- objects
- 00, 01, ...
- runs
- run-hash
- context.yaml
- commit-hash, pip-freeze, ...
- events.zip
- statuses
- status-hash
- context.yaml
- parent-status, integrated
- db.sqlite.zip
- current-run
- context.yaml
- events
- these to be compressed into ../runs
- status.sqlite
- there is a 'TEST' status
- cannot be integrated whatever is based on it
- a test run can be made on it...
when starting a run:
- check if current-run is empty
- if not, fail with
- find latest status
- if it has not integrated all past runs, create a new status that has
- start collection (+ registration)
- either stops or breaks, all events and objects are saved to disk
- if properly stops, move and compress stuff
- based on one that was the starter, and current run id
## Pre v1.0 laundry list
- parallelize push / pull
- parsing/connection/broken session error docs
- transferring / ignoring cookies
- template projects
- oddsportal
- updating thingy, based on latest match in season
- footy
- rotten
- boxoffice