What Can Data Do?
Posts
What is in Data Is Plural?

What is in Data Is Plural?

using semantic search on the popular dataset archive.

Lloyd Palum
June 04, 2024

DIP

At What Can Data Do?, we delve into the wealth of data found in Data is Plural, Jeremy Singer-Vine's weekly newsletter that showcases a rich selection of datasets. As a long-time admirer, I often turn to this resource when seeking out well-curated data for my research. Jeremy’s compilation boasts over 1,750 datasets and continues to expand each week.

One challenge I’ve encountered is navigating the archive—it’s not the most search-friendly. While Jeremy provides a Google Sheet with details on the featured datasets, it doesn’t quite facilitate inquiries such as: “Which datasets in the archive relate to global energy production?”

To address this, I embarked on enhancing the archive’s searchability. By applying semantic search to the entirety of Data is Plural’s publications, my goal was not only to learn but also to forge a tool that elevates the ease of data discovery. Let’s dive in…

What is Semantic Search

Semantic search transcends the traditional keyword match, delving into the intent and contextual nuance behind a query to fetch the most pertinent information. Rather than simply scanning for specific terms, it employs sophisticated cognitive tools to consider elements such as:

Contextual Understanding: Words shift in meaning based on their use. Semantic search aims to grasp this context, refining its accuracy.
Recognition of Synonyms: It appreciates that different terms can share meanings and provides relevant findings even when exact search terms are absent from the content.
Natural Language Processing (NLP): Semantic search utilizes NLP to interpret the searcher’s intent, managing conversational queries with finesse.
Conceptual Insight: It searches for ideas and meanings rather than just phrase matches, capturing the essence behind the words.
Discerning User Intent: Semantic search adjusts results to align with whether a user aims to purchase, explore, locate, or troubleshoot.
Tailored Results: It may also integrate personal data, such as past behavior, locale, or time, to customize the search experience.

Powered by machine learning, ontologies, and NLP, semantic search is adept at closing the gap between what users seek and the content they find. This technology proves invaluable in customer support, e-commerce, knowledge management, and content discovery, ensuring users connect more intuitively with the information they require.

What is the all-mpnet-base sentence transformer, and how does it work?

The all-mpnet-base-v2 is a pre-trained model provided by the Hugging Face Sentence Transformers library, which is a Python framework for state-of-the-art sentence, text, and image embeddings. The model is based on the MPNet architecture, which stands for “Masked and Permuted Pre-training for Language Understanding.”

MPNet is an innovative pre-training method that combines the strengths of BERT (Bidirectional Encoder Representations from Transformers) and XLNet architectures to effectively capture the contextual representations of words in a sentence. The MPNet is designed to understand the context from both the left and right sides of a token in the input sequence.

Here’s how the all-mpnet-base-v2 the model works in the context of sentence transformers:

Pre-training:
- MPNet is pre-trained on a large corpus of text data using self-supervised learning. It uses a novel pre-training objective that masks some of the input tokens and permutes others, allowing the model to learn a deep understanding of language context and word relationships.
Sentence Embeddings:
- The all-mpnet-base-v2 model is fine-tuned to produce sentence embeddings. This means that for any given sentence or paragraph, the model generates a fixed-size vector representation that captures its semantic meaning. These embeddings can be used for a variety of downstream tasks such as semantic search, clustering, and classification.
Fine-tuning for Specific Tasks:
- Although the model is pre-trained on a general corpus, it can be further fine-tuned on task-specific data to improve performance on particular tasks. For example, if you want to use the model for legal document analysis, you could fine-tune it on a corpus of legal texts.
Usage in Applications:
- In applications, when you input a sentence to the all-mpnet-base-v2 As a machine learning model, it processes the text through multiple transformer layers, generating an embedding vector that can then be used for semantic comparisons, search, or as input features to other machine learning models.

The Sentence Transformers library makes it easy to use these models. You can load the all-mpnet-base-v2 model and generate embeddings for your sentences with just a few lines of code. These embeddings are designed to be semantically meaningful, such that sentences with similar meanings are close to each other in the embedding space, which is beneficial for many NLP tasks that require understanding sentence similarity.

What is Cosine Similarity, and How do I calculate it?

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle.

It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. The cosine similarity, S, between two vectors A and B is calculated as follows:

S(A, B) = dot(A,B) / (||A|| * ||B||)

where:

dot(A, B) is the dot product of the vectors,
||A|| and ||B|| are the norms (magnitudes) of vectors A and B.

This measure is a metric of the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It should not be confused with a physical space, as it can be used for high dimensional spaces, e.g., text classification in an information retrieval context where each term is a dimension and the terms in a document are modeled as a vector for comparison with other documents.

How we are going to apply Semantic Search to the DIP catalog

Use the Hugging Face open-source model library
Use the Sentence Transformer It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Use the cosine similarity measure to assess the “distance” between the descriptions in the catalog and the posed questions.

Step 1 - Install the requisite Python libraries needed to do a semantic search

!pip -q install transformers[torch] torch scikit-learn protobuf==3.20.0 sentence-transformers

Step 2 - Import the requisite libraries

import pandas as pd
import numpy as np
import time
from sentence_transformers import SentenceTransformer, util

Step 3 - Load the Data is Plural Catalog Spreadsheet

# Load the CSV and preprocess
SHEET_ID = '1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk'
SHEET_NAME = 'Items'
url = f'https://docs.google.com/spreadsheets/d/{SHEET_ID}/gviz/tq?tqx=out:csv&sheet={SHEET_NAME}'
data_df = pd.read_csv(url)

# Take a peak at the first several rows of the data
data_df.head(3)

	edition	position	headline	text	links	hattips
0	2015.10.21	1	Every place name in the United States.	Sometimes, bureaucracy creates poetry. Since 1...	http://geonames.usgs.gov/index.html\nhttp://ge...	https://twitter.com/emilymbadger/status/653982...
1	2015.10.21	2	“There’s finally federal data on low-income co...	The Hechinger Report casts doubt on the Pell g...	http://hechingerreport.org/theres-finally-fede...	NaN
2	2015.10.21	3	What police-related data does your city publish?	The Police Open Data Census, created by Code f...	https://codeforamerica.github.io/PoliceOpenDat...	NaN

Step 4 Create descriptions and embedding vectors for each

(takes several minutes)

# Create a list of the headlines and description of each dataset in the catalog
descriptions = (data_df['headline'] + " " + data_df['text']).tolist()

# Save the links to each dataset to reference when we are performing the semantic search.
links = data_df['links'].tolist()

# Load the Sentence Transformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

start_time = time.time()

# Convert dataset descriptions to embeddings and store them NOT
embeddings = model.encode(descriptions, convert_to_tensor=True)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"The embedding of {len(descriptions)} descriptions took {elapsed_time:4.0f} seconds to run")

print("An example dataset description and its associated embedding...")
print(descriptions[0])
print(f"The vectors are {len(embeddings[0])} in length.\nThe first 10 elements of the vector for this description are:\n{embeddings[0][0:10]}")

The embedding of 1750 descriptions took  169 seconds to run
An example dataset description and its associated embedding...
Every place name in the United States. Sometimes, bureaucracy creates poetry. Since 1890, the U.S. Board on Geographic Names has been cataloguing, standardizing, and promulgating official names for the places we hike, swim, work, and call home. Along the way, it began publishing Geographic Names Information System (GNIS), a searchable and downloadable database containing all of its domestic nomenclature. In Alaska alone, the database lists names for 167 dams, 303 post offices, 666 glaciers, 2,704 capes, and 9,575 streams. My favorite: Confusion Creek. [h/t @emilymbadger]
The vectors are 768 in length.
The first 10 elements of the vector for this description are:
tensor([ 0.0570,  0.0512, -0.0187, -0.0174, -0.0091,  0.0062,  0.0331, -0.0004,
         0.0364, -0.0083])

Step 5 Create a utility function that returns the closest matches to a question.

"""
This function takes the following arguments:
 - question (str)
 - embeddings (tensor)
 - descriptions (list)
 - links (list)
 - top_k (scalar)
 
And returns the top_k closest descriptions to the question that was posed.
The routine uses the cosine similarity to assess the "distance" between the descriptions of the datasets
and the question that was posed.
"""

def find_closest_matches(question, embeddings, descriptions, links, top_k=5):
    question_embedding = model.encode(question, convert_to_tensor=True)
    
    # Compute cosine similarities
    similarities = [util.pytorch_cos_sim(question_embedding, desc_embedding).item() for desc_embedding in embeddings]
    
    # Get the indices of top matches
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    
    return [(descriptions[idx], similarities[idx], links[idx]) for idx in top_indices]

Step 6 Answer questions about the Data Is Plural Catalog

return similar datasets to the question including its description and a link to its source.

# Define the minimum similarity for a match
MIN_SIM = 0.6

while True:
    # Ask for a question
    question = input("\nPlease enter your question (or type 'exit' to quit): ")
    
    if question.lower() == 'exit':
        break
    
    # Find and print closest matches
    matches = find_closest_matches(question, embeddings, descriptions, links, 25)
    print("\nTop matches:")
    for i, (desc, score, link) in enumerate(matches, 1):
        if score > MIN_SIM:
            print(f"\n{i}. {desc}\n(Score: {score:.4f}) References:")
            for r in link.split():
                print(f"* {r}")


Please enter your question (or type 'exit' to quit):  Which datasets in the archive relate to global energy production?

Please enter your question (or type 'exit' to quit):  exit


Top matches:

1. Grid emissions. Ember, an “energy think tank that uses data-driven insights to shift the world from coal to clean electricity,” has begun  compiling annual and monthly statistics on electricity demand, generation, and estimated greenhouse gas emissions by country, standardized from national and international sources. The annual estimates span two decades and 200+ countries and territories; the monthly dataset provides somewhat less coverage. Both can also be explored online. Related: Singularity’s Open Grid Emissions initiative estimates the hourly grid emissions of balancing authorities and power plants in the US, currently for 2019 and 2020. Previously: Other energy-related datasets. [h/t Philippe Quirion]
(Score: 0.6854) References:
* https://ember-climate.org/
* https://twitter.com/nicolasfulghum/status/1572974235932364800
* https://ember-climate.org/data-catalogue/yearly-electricity-data/
* https://ember-climate.org/data-catalogue/monthly-electricity-data/
* https://ember-climate.org/data/data-explorer/
* https://singularity.energy/
* https://singularity.energy/open-grid-emissions
* https://medium.com/singularity-energy/validating-real-time-electricity-emissions-rates-with-an-hourly-historical-benchmark-a9990a2c9049
* https://www.eia.gov/todayinenergy/detail.php?id=27152
* https://www.data-is-plural.com/collections/energy

2. European energy imports/exports. The EU’s Eurostat office publishes a range of statistical datasets on energy usage and economics, including annual imports and exports of petroleum, natural gas, and coal between European countries and their trading partners. Related: The Energy Information Administration tracks US imports and exports of petroleum, natural gas, and coal. As seen in: How Europe is dependent on Russian gas (New Statesman) and Why the Toughest Sanctions on Russia Are the Hardest for Europe to Wield (New York Times). Previously: European gas storage (DIP 2022.01.26), state-owned oil companies (DIP 2019.05.01), and global and gas infrastructure (DIP 2018.06.06). [h/t Lisa Charlotte Muth]
(Score: 0.6618) References:
* https://en.wikipedia.org/wiki/Eurostat
* https://ec.europa.eu/eurostat/databrowser/explore/all/envir?lang=en&subtheme=nrg&display=list&sort=category&extractionId=NRG_TI_SFF__custom_2293925
* https://ec.europa.eu/eurostat/databrowser/explore/all/envir?lang=en&subtheme=nrg.nrg_quant.nrg_quanta.nrg_t.nrg_ti&display=list&sort=category&extractionId=NRG_TI_SFF__custom_2293925
* https://ec.europa.eu/eurostat/databrowser/explore/all/envir?lang=en&subtheme=nrg.nrg_quant.nrg_quanta.nrg_t.nrg_te&display=list&sort=category&extractionId=NRG_TI_SFF__custom_2293925
* https://www.eia.gov/
* https://www.eia.gov/petroleum/data.php#imports
* https://www.eia.gov/naturalgas/data.php#imports
* https://www.eia.gov/coal/data.php#imports
* https://www.newstatesman.com/chart-of-the-day/2022/02/how-europe-is-dependent-on-russian-gas
* https://www.nytimes.com/2022/02/25/business/economy/russia-europe-sanctions-gas-oil.html
* https://agsi.gie.eu/
* https://www.data-is-plural.com/archive/2022-01-26-edition/
* https://www.nationaloilcompanydata.org/
* https://www.data-is-plural.com/archive/2019-05-01-edition/
* https://edx.netl.doe.gov/dataset/global-oil-gas-features-database
* https://www.data-is-plural.com/archive/2018-06-06-edition/

3. Wind and solar power. The Global Energy Monitor’s Global Wind Power Tracker is “a worldwide dataset of utility-scale wind facilities,” focusing on those with planned or installed capacities of at least 10 megawatts. It provides each facility’s name, location, status, capacity, installation type, owner, and other details. The project launched last week alongside a sibling dataset, the Global Solar Power Tracker. They join a growing collection of trackers from the organization, including those examining coal infrastructure, steel plants, and oil and gas resources. [h/t Nathaniel Hoffman]
(Score: 0.6298) References:
* https://globalenergymonitor.org/about/our-story/
* https://globalenergymonitor.org/projects/global-wind-power-tracker/
* https://globalenergymonitor.org/press-release/new-trackers-showing-country-by-country-build-out-of-utility-scale-solar-and-wind/
* https://globalenergymonitor.org/projects/global-solar-power-tracker/
* https://globalenergymonitor.org/about/our-story/
* https://globalenergymonitor.org/projects/
* https://globalenergymonitor.org/projects/global-coal-tracker/
* https://globalenergymonitor.org/projects/global-steel-plant-tracker/
* https://globalenergymonitor.org/projects/global-oil-gas-extraction-tracker/

4. Power plants. The Global Power Plant Database, published by the World Resources Institute, “is a comprehensive, open source database of power plants around the world” and contains “information on plant capacity, generation, ownership, and fuel type.” The current edition, released in June 2018, covers 28,600+ power plants in 164 countries — including more than 1,000 each in Brazil, Canada, China, Great Britain, France, and the United States. Previously: U.S. power plants (DIP 2016.02.10). [h/t Kelly Rose + Paul Deane]
(Score: 0.6279) References:
* http://datasets.wri.org/dataset/globalpowerplantdatabase
* https://www.wri.org/our-work
* https://www.eia.gov/electricity/data/eia923/
* https://tinyletter.com/data-is-plural/letters/data-is-plural-2016-02-10-edition

5. Oil and gas. The Joint Organisations Data Initiative (JODI) coordinates the collection, standardization, and publication of oil and gas data from around the world; the 100+ countries that participate represent the vast majority of global production. The oil data goes back to 2002; the gas data goes back to 2009. Both datasets are updated monthly and track a range of subproducts (e.g., crude oil, diesel, jet fuel) and flows (e.g., imports, exports, production) for each country. Previously: Global and gas infrastructure (DIP 2018.06.06) and state-owned oil companies (DIP 2019.05.01).
(Score: 0.6203) References:
* https://www.jodidata.org/about-jodi/history.aspx
* https://www.jodidata.org/about-jodi/jodi-world-databases.aspx
* https://www.jodidata.org/gas/database/data-downloads.aspx
* https://edx.netl.doe.gov/dataset/global-oil-gas-features-database
* https://tinyletter.com/data-is-plural/letters/data-is-plural-2018-06-06-edition
* https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-05-01-edition
* https://tinyletter.com/data-is-plural/letters/data-is-plural-2019-05-01-edition

And with that, we wrap up for the moment…

Don’t forget to hit subscribe on the What Can Data Do? YouTube channel. Stay tuned for those succinct data nuggets tailor-made for the ever-busy data enthusiasts. In data, we trust!