Publish AQI bulletins data on Datasette¶
This notebook publishes the AQI bulletins data we collected on Datasette.
Steps:¶
STEP 1: Import data into a SQLite Database¶
In [1]:
import sqlite_utils
import pandas as pd
In [107]:
# Time series data -- AQI Bulletins data of all Indian Cities
time_series_df = pd.read_csv("C:/Users/dskcy/UEInfo/AQI_bulletins/data/Processed/AllIndiaBulletins_Master_2024.csv")
In [108]:
#clean
time_series_df = time_series_df[['date', 'City', 'No. Stations', 'Air Quality', 'Index Value', 'Prominent Pollutant']]
time_series_df.columns=['date', 'city', 'no_stations', 'aqi_category','aqi','prominent_pollutant']
time_series_df = time_series_df.replace(to_replace='Very poor', value='Very Poor')
# Replace "OZONE" with "O3" in the 'prominent_pollutant' column
time_series_df['prominent_pollutant'] = time_series_df['prominent_pollutant'].astype(str).str.replace("OZONE", "O3", regex=False)
time_series_df['date'] = pd.to_datetime(time_series_df['date'])
In [109]:
time_series_df['year'] = time_series_df['date'].dt.year
time_series_df['month'] = time_series_df['date'].dt.strftime('%b')
In [116]:
duplicates = time_series_df[time_series_df.duplicated(subset=['date', 'city'], keep=False)]
duplicates
Out[116]:
date | city | no_stations | aqi_category | aqi | prominent_pollutant | year | month | |
---|---|---|---|---|---|---|---|---|
179252 | 2022-03-31 | Aurangabad | 1 | Moderate | 104 | PM10 | 2022 | Mar |
179253 | 2022-03-31 | Aurangabad | 1 | Poor | 268 | PM10 | 2022 | Mar |
179301 | 2022-04-01 | Aurangabad | 1 | Very Poor | 312 | PM10 | 2022 | Apr |
179302 | 2022-04-01 | Aurangabad | 1 | Moderate | 108 | PM10 | 2022 | Apr |
179670 | 2022-04-03 | Aurangabad | 1 | Very Poor | 317 | PM10 | 2022 | Apr |
... | ... | ... | ... | ... | ... | ... | ... | ... |
212132 | 2022-11-01 | Aurangabad | 1 | Moderate | 117 | PM2.5 | 2022 | Nov |
212193 | 2022-11-02 | Aurangabad | 1 | Moderate | 107 | PM10 | 2022 | Nov |
212194 | 2022-11-02 | Aurangabad | 1 | Moderate | 144 | PM10 | 2022 | Nov |
212418 | 2022-11-03 | Aurangabad | 1 | Moderate | 169 | PM10 | 2022 | Nov |
212419 | 2022-11-03 | Aurangabad | 1 | Moderate | 128 | PM2.5 | 2022 | Nov |
68 rows × 8 columns
On these dates, the bulletins did not differentiate the two Aurangabads of Maharashtra and Bihar. Example
Removing these duplicates.
In [8]:
db = sqlite_utils.Database("C:/Users/dskcy/UEInfo/AQI_bulletins/data/Processed/aqi-bulletins.db")
# this creates or connects to an SQLite file
In [120]:
# Add a table in the db and insert all rows from csv in it
table_name = "aqi_bulletins"
db[table_name].insert_all(
time_series_df.drop_duplicates(subset=["date", "city"]).to_dict(orient="records"),
alter=True,
pk=("date", "city") # Primary Key \\ Don't auto-create row_id...
)
Out[120]:
<Table aqi_bulletins (date, city, no_stations, aqi_category, aqi, prominent_pollutant, year, month)>
In [121]:
#Date in YYYY-MM-DD format
db.execute("""
UPDATE aqi_bulletins
SET date = substr(date, 1, 10)
""")
db.conn.commit()
In [9]:
# To avoid facet timeout errors, create index on columns on which we want facets
db.execute("CREATE INDEX IF NOT EXISTS idx_city ON aqi_bulletins(city)")
db.execute("CREATE INDEX IF NOT EXISTS idx_aqicategory ON aqi_bulletins(aqi_category)")
db.conn.commit()
STEP 2: Explore the SQLite db created¶
In [3]:
#See all tables in db
db.tables
Out[3]:
[<Table aqi_bulletins (date, city, no_stations, aqi_category, aqi, prominent_pollutant, year, month)>]
In [4]:
for table_name in db.table_names():
row_count = db[table_name].count
print(f"{table_name}: {row_count} rows")
aqi_bulletins: 387686 rows
In [125]:
# Inspect schema
print(db["aqi_bulletins"].schema)
CREATE TABLE [aqi_bulletins] ( [date] TEXT, [city] TEXT, [no_stations] INTEGER, [aqi_category] TEXT, [aqi] INTEGER, [prominent_pollutant] TEXT, [year] INTEGER, [month] TEXT, PRIMARY KEY ([date], [city]) )
In [126]:
# Show first 5 rows in the table
list(db[table_name].rows)[:5]
Out[126]:
[{'date': '2015-05-01', 'city': 'Varanasi', 'no_stations': 0, 'aqi_category': 'Moderate', 'aqi': 157, 'prominent_pollutant': 'PM10', 'year': 2015, 'month': 'May'}, {'date': '2015-05-01', 'city': 'Kanpur', 'no_stations': 0, 'aqi_category': 'Moderate', 'aqi': 175, 'prominent_pollutant': 'PM10', 'year': 2015, 'month': 'May'}, {'date': '2015-05-01', 'city': 'Faridabad', 'no_stations': 0, 'aqi_category': 'Moderate', 'aqi': 173, 'prominent_pollutant': 'PM10', 'year': 2015, 'month': 'May'}, {'date': '2015-05-01', 'city': 'Ahmedabad', 'no_stations': 0, 'aqi_category': 'Moderate', 'aqi': 168, 'prominent_pollutant': 'PM2.5', 'year': 2015, 'month': 'May'}, {'date': '2015-05-01', 'city': 'Hyderabad', 'no_stations': 0, 'aqi_category': 'Moderate', 'aqi': 189, 'prominent_pollutant': 'PM2.5', 'year': 2015, 'month': 'May'}]
In [127]:
# Run SQL queries
## Max AQI in each month in Delhi
results = db.execute("""
SELECT
substr(date, 1, 7) AS month,
MAX(aqi) AS max_aqi
FROM
aqi_bulletins
WHERE
city = 'Delhi'
GROUP BY
month
ORDER BY
month
""")
list(results)
Out[127]:
[('2015-08', 246), ('2015-09', 302), ('2016-03', 352), ('2016-10', 445), ('2016-11', 497), ('2017-04', 383), ('2017-05', 368), ('2017-12', 469), ('2018-06', 447), ('2019-02', 382), ('2019-07', 317), ('2021-01', 460)]
In [128]:
db.conn.close()
STEP 3: Opening the SQLite db in Datasette¶
In [40]:
import json
metadata = {
"title": "Air Quality Index (AQI) Bulletins",
"description": "This dataset contains daily Air quality bulletins released by CPCB from 2015-24",
"license": "GODL",
"license_url": "https://www.data.gov.in/Godl",
"source": "Central Pollution Control Board (CPCB)",
"source_url": "https://cpcb.nic.in/AQI_Bulletin.php",
"about": "UrbanEmissions.info",
"about_url": "https://github.com/urbanemissionsinfo/AQI_bulletins",
"databases": {
"aqi-bulletins": {
"tables": {
"aqi_bulletins": {
"facets": ["city","aqi_category","year","month"] # specify what you want
}
}
}
}
}
with open("metadata.json", "w") as f:
json.dump(metadata, f, indent=4)
In [69]:
#!pip install datasette
#!pip install datasette-vega #For charting options
#!datasette "C:/Users/dskcy/UEInfo/AQI_bulletins/data/Processed/aqi_bulletins.db" --port 8001
import subprocess
proc = subprocess.Popen(["datasette",
"C:/Users/dskcy/UEInfo/AQI_bulletins/data/Processed/aqi-bulletins.db",
"-m", "metadata.json",
"--config", "suggest_facets:off", #Don't show suggested facets
"--config", "default_facet_size:10", #Show only top 10 in facet
"--config", "facet_time_limit_ms:1000", #Increased facet query timeout limit to 1000ms from default 200ms.
"--port", "8001",
])
You can now see the db on Datasette on your local at http://localhost:8001/
In [70]:
# Kill process
proc.terminate()
proc.wait() # wait for the process to cleanly exit
Out[70]:
1
STEP 4: Deploy¶
You can use ngrok for pilot usecases.
The following procedure is to deploy Datasette on Vercel.
- Keep the
aqi-bulletins.db
,metadata.json
in a single directory (just to make it easy) - Create a new
vercel.json
file in the same folder. - Create a new
requirements.txt
file in the same folder. - Create a new
api.py
file in the same folder. - In your terminal, go to the directory containing the above files.
pip install uvicorn
uvicorn api:app --reload --host 0.0.0.0 --port 8000
-- this will start the Datasette on local. You can check if it is running fine onhttp://localhost:8000/
- Install Vercel CLI tool
- Run
vercel login
and complete the login/signup. - Run
vercel --prod
and complete all the steps. Your Datasette app will be published.
You can find the published AQI-Bulletins Datasette app here: AQI Bulletins Datasette
versel.json
:
{
"version": 2,
"builds": [
{
"src": "api.py",
"use": "@vercel/python"
}
],
"routes": [
{ "src": "/(.*)", "dest": "api.py" }
]
}
requirements.txt
datasette
datasette-vega
<anyother datasette plugins>
api.py
from datasette.app import Datasette
import os
import json
db_path = os.path.join(os.getcwd(), "aqi-bulletins.db")
metadata_path = os.path.join(os.getcwd(), "metadata.json")
with open(metadata_path) as f:
metadata = json.load(f)
settings = {
"suggest_facets": False,
"default_facet_size": 10,
"facet_time_limit_ms": 1000,
}
ds = Datasette([db_path], metadata=metadata, settings=settings)
# This is what Vercel uses as the ASGI entrypoint
app = ds.app()