Publish AQI bulletins data on Datasette¶

This notebook publishes the AQI bulletins data we collected on Datasette.

Steps:¶

  1. Import data into a SQLite Database
  2. Explore the SQLite db created
  3. Opening the SQLite db in Datasette
  4. Deploy Datasette using Vercel

Examples
References

STEP 1: Import data into a SQLite Database¶

In [1]:
import sqlite_utils
import pandas as pd
In [107]:
# Time series data -- AQI Bulletins data of all Indian Cities
time_series_df = pd.read_csv("C:/Users/dskcy/UEInfo/AQI_bulletins/data/Processed/AllIndiaBulletins_Master_2024.csv")
In [108]:
#clean
time_series_df = time_series_df[['date', 'City', 'No. Stations', 'Air Quality', 'Index Value', 'Prominent Pollutant']]
time_series_df.columns=['date', 'city', 'no_stations', 'aqi_category','aqi','prominent_pollutant']
time_series_df = time_series_df.replace(to_replace='Very poor', value='Very Poor')
# Replace "OZONE" with "O3" in the 'prominent_pollutant' column
time_series_df['prominent_pollutant'] = time_series_df['prominent_pollutant'].astype(str).str.replace("OZONE", "O3", regex=False)
time_series_df['date'] = pd.to_datetime(time_series_df['date'])
In [109]:
time_series_df['year'] = time_series_df['date'].dt.year
time_series_df['month'] = time_series_df['date'].dt.strftime('%b')
In [116]:
duplicates = time_series_df[time_series_df.duplicated(subset=['date', 'city'], keep=False)]
duplicates
Out[116]:
date city no_stations aqi_category aqi prominent_pollutant year month
179252 2022-03-31 Aurangabad 1 Moderate 104 PM10 2022 Mar
179253 2022-03-31 Aurangabad 1 Poor 268 PM10 2022 Mar
179301 2022-04-01 Aurangabad 1 Very Poor 312 PM10 2022 Apr
179302 2022-04-01 Aurangabad 1 Moderate 108 PM10 2022 Apr
179670 2022-04-03 Aurangabad 1 Very Poor 317 PM10 2022 Apr
... ... ... ... ... ... ... ... ...
212132 2022-11-01 Aurangabad 1 Moderate 117 PM2.5 2022 Nov
212193 2022-11-02 Aurangabad 1 Moderate 107 PM10 2022 Nov
212194 2022-11-02 Aurangabad 1 Moderate 144 PM10 2022 Nov
212418 2022-11-03 Aurangabad 1 Moderate 169 PM10 2022 Nov
212419 2022-11-03 Aurangabad 1 Moderate 128 PM2.5 2022 Nov

68 rows × 8 columns

On these dates, the bulletins did not differentiate the two Aurangabads of Maharashtra and Bihar. Example

Removing these duplicates.

In [8]:
db = sqlite_utils.Database("C:/Users/dskcy/UEInfo/AQI_bulletins/data/Processed/aqi-bulletins.db")
# this creates or connects to an SQLite file
In [120]:
# Add a table in the db and insert all rows from csv in it
table_name = "aqi_bulletins"
db[table_name].insert_all(
                            time_series_df.drop_duplicates(subset=["date", "city"]).to_dict(orient="records"),
                            alter=True,
                            pk=("date", "city") # Primary Key \\ Don't auto-create row_id...
                         )
Out[120]:
<Table aqi_bulletins (date, city, no_stations, aqi_category, aqi, prominent_pollutant, year, month)>
In [121]:
#Date in YYYY-MM-DD format
db.execute("""
    UPDATE aqi_bulletins
    SET date = substr(date, 1, 10)
""")
db.conn.commit()
In [9]:
# To avoid facet timeout errors, create index on columns on which we want facets
db.execute("CREATE INDEX IF NOT EXISTS idx_city ON aqi_bulletins(city)")
db.execute("CREATE INDEX IF NOT EXISTS idx_aqicategory ON aqi_bulletins(aqi_category)")
db.conn.commit()

STEP 2: Explore the SQLite db created¶

In [3]:
#See all tables in db
db.tables
Out[3]:
[<Table aqi_bulletins (date, city, no_stations, aqi_category, aqi, prominent_pollutant, year, month)>]
In [4]:
for table_name in db.table_names():
    row_count = db[table_name].count
    print(f"{table_name}: {row_count} rows")
aqi_bulletins: 387686 rows
In [125]:
# Inspect schema
print(db["aqi_bulletins"].schema)
CREATE TABLE [aqi_bulletins] (
   [date] TEXT,
   [city] TEXT,
   [no_stations] INTEGER,
   [aqi_category] TEXT,
   [aqi] INTEGER,
   [prominent_pollutant] TEXT,
   [year] INTEGER,
   [month] TEXT,
   PRIMARY KEY ([date], [city])
)
In [126]:
# Show first 5 rows in the table
list(db[table_name].rows)[:5]  
Out[126]:
[{'date': '2015-05-01',
  'city': 'Varanasi',
  'no_stations': 0,
  'aqi_category': 'Moderate',
  'aqi': 157,
  'prominent_pollutant': 'PM10',
  'year': 2015,
  'month': 'May'},
 {'date': '2015-05-01',
  'city': 'Kanpur',
  'no_stations': 0,
  'aqi_category': 'Moderate',
  'aqi': 175,
  'prominent_pollutant': 'PM10',
  'year': 2015,
  'month': 'May'},
 {'date': '2015-05-01',
  'city': 'Faridabad',
  'no_stations': 0,
  'aqi_category': 'Moderate',
  'aqi': 173,
  'prominent_pollutant': 'PM10',
  'year': 2015,
  'month': 'May'},
 {'date': '2015-05-01',
  'city': 'Ahmedabad',
  'no_stations': 0,
  'aqi_category': 'Moderate',
  'aqi': 168,
  'prominent_pollutant': 'PM2.5',
  'year': 2015,
  'month': 'May'},
 {'date': '2015-05-01',
  'city': 'Hyderabad',
  'no_stations': 0,
  'aqi_category': 'Moderate',
  'aqi': 189,
  'prominent_pollutant': 'PM2.5',
  'year': 2015,
  'month': 'May'}]
In [127]:
# Run SQL queries
## Max AQI in each month in Delhi
results = db.execute("""
    SELECT
        substr(date, 1, 7) AS month,
        MAX(aqi) AS max_aqi
    FROM
        aqi_bulletins
    WHERE
        city = 'Delhi'
    GROUP BY
        month
    ORDER BY
        month
""")
list(results)
Out[127]:
[('2015-08', 246),
 ('2015-09', 302),
 ('2016-03', 352),
 ('2016-10', 445),
 ('2016-11', 497),
 ('2017-04', 383),
 ('2017-05', 368),
 ('2017-12', 469),
 ('2018-06', 447),
 ('2019-02', 382),
 ('2019-07', 317),
 ('2021-01', 460)]
In [128]:
db.conn.close()

STEP 3: Opening the SQLite db in Datasette¶

In [40]:
import json

metadata = {
    "title": "Air Quality Index (AQI) Bulletins",
  "description": "This dataset contains daily Air quality bulletins released by CPCB from 2015-24",
  "license": "GODL",
  "license_url": "https://www.data.gov.in/Godl",
  "source": "Central Pollution Control Board (CPCB)",
  "source_url": "https://cpcb.nic.in/AQI_Bulletin.php",
    "about": "UrbanEmissions.info",
    "about_url": "https://github.com/urbanemissionsinfo/AQI_bulletins",
    "databases": {
        "aqi-bulletins": {
            "tables": {
                "aqi_bulletins": {
                    "facets": ["city","aqi_category","year","month"]  # specify what you want
                }
            }
        }
    }
}

with open("metadata.json", "w") as f:
    json.dump(metadata, f, indent=4)
In [69]:
#!pip install datasette
#!pip install datasette-vega #For charting options
#!datasette "C:/Users/dskcy/UEInfo/AQI_bulletins/data/Processed/aqi_bulletins.db" --port 8001

import subprocess
proc = subprocess.Popen(["datasette",
                         "C:/Users/dskcy/UEInfo/AQI_bulletins/data/Processed/aqi-bulletins.db",
                         "-m", "metadata.json",
                         "--config", "suggest_facets:off", #Don't show suggested facets
                         "--config", "default_facet_size:10", #Show only top 10 in facet
                         "--config", "facet_time_limit_ms:1000", #Increased facet query timeout limit to 1000ms from default 200ms.
                         "--port", "8001",
                        ])

You can now see the db on Datasette on your local at http://localhost:8001/

In [70]:
# Kill process
proc.terminate()
proc.wait()  # wait for the process to cleanly exit
Out[70]:
1

STEP 4: Deploy¶

You can use ngrok for pilot usecases.

The following procedure is to deploy Datasette on Vercel.

  1. Keep the aqi-bulletins.db, metadata.json in a single directory (just to make it easy)
  2. Create a new vercel.json file in the same folder.
  3. Create a new requirements.txt file in the same folder.
  4. Create a new api.py file in the same folder.
  5. In your terminal, go to the directory containing the above files.
  6. pip install uvicorn
  7. uvicorn api:app --reload --host 0.0.0.0 --port 8000 -- this will start the Datasette on local. You can check if it is running fine on http://localhost:8000/
  8. Install Vercel CLI tool
  9. Run vercel login and complete the login/signup.
  10. Run vercel --prod and complete all the steps. Your Datasette app will be published.

You can find the published AQI-Bulletins Datasette app here: AQI Bulletins Datasette

versel.json:

{
  "version": 2,
  "builds": [
    {
      "src": "api.py",
      "use": "@vercel/python"
    }
  ],
  "routes": [
    { "src": "/(.*)", "dest": "api.py" }
  ]
}

requirements.txt

datasette
datasette-vega
<anyother datasette plugins>

api.py

from datasette.app import Datasette
import os
import json

db_path = os.path.join(os.getcwd(), "aqi-bulletins.db")
metadata_path = os.path.join(os.getcwd(), "metadata.json")


with open(metadata_path) as f:
    metadata = json.load(f)


settings =  {
        "suggest_facets": False,
        "default_facet_size": 10,
        "facet_time_limit_ms": 1000,
    }
ds = Datasette([db_path], metadata=metadata, settings=settings)

# This is what Vercel uses as the ASGI entrypoint
app = ds.app()

Examples¶

Select all rows that have PM2.5¶

SELECT * FROM aqi_bulletins WHERE prominent_pollutant LIKE '%PM2.5%';

References¶

  1. 10 min video tutorial on using sqlite-utils
  2. Datasette config settings
  3. Datasette plugins: Cluster map, full text search
  4. Simon Willison on Datasette