data/preprocessing/nutrient_load_reduction/README
# Data Processing Pipeline README
This repository contains a data processing pipeline implemented using a Makefile and Python script to download, preprocess, upload, and generate checksums for data files. The pipeline is designed to work with geospatial data related to nutrient load reduction.
## Prerequisites
Before running the pipeline, ensure you have the following prerequisites in place:
1. **Data Download**: You need to manually download the data from [here](https://figshare.com/articles/figure/DRP_NO3_TN_TP_rasters/14527638/1?file=31154728) and save it in the `data/` directory.
2. **Python Dependencies**: The preprocessing script requires Python and the following Python packages:
- `geopandas`
- Other dependencies as specified in your `process_data.py` script.
3. **AWS Credentials**: To upload results to an AWS S3 bucket, you should have AWS credentials configured on your machine.
## Usage
### 1. Download and Unzip Data
Use the following command to download and unzip the data:
```bash
make unzip-limiting-nutrient
```
This command will download the data and place it in the data/ directory.
### 2. Preprocess Data
Before ingesting the data into your database, preprocess it using the Python script. Run the following command:
``` bash
make process-limiting-nutrients
```
This command will execute the process_data.py script, which performs data preprocessing, including reprojection and calculation of nutrient reduction percentages.
### 3. Upload Process Data
To upload the processed data to an AWS S3 bucket, use the following command:
```bash
make upload_results
```
Make sure you have AWS credentials configured to access the specified S3 bucket.
### 4. Generate Checksum
Generate a SHA-256 checksum for the processed data by running the following command:
```bash
make write_checksum
```
This command will calculate the checksum and save it in the data_checksums/ directory.
## Configuration
You can configure the pipeline by modifying the variables at the top of the Makefile:
- `DATA_DIR`: Specify the directory where data files are stored.
- `checksums_dir`: Define the directory where checksum files will be saved.
- `AWS_S3_BUCKET_URL`: Set the AWS S3 bucket URL for uploading results.
Feel free to adapt this pipeline to suit your specific data processing needs and directory structure.
`Note`: Make sure you have the necessary permissions and access to the data sources and AWS resources mentioned in this README before running the pipeline.