Finding Jobs on Twitter using the OpenAI Embeddings API
Recently, I've spent a lot of time looking at job listings and reading headlines about LLMs - in particular, those from OpenAI. I wanted to understand more about embeddings, and I also wanted to find interesting job listings with a little less manual work involved. So, I thought I'd take a shot at using the OpenAI Embeddings API to help me find the jobs I'd be interested in.
A Better Way to Find Job Listings
Most job boards I browse post new listings to their Twitter accounts, and if they don't, it's likely another board with the same listings has done so. I felt like Twitter would be the best place to collect a bunch of recent, random listings and start working from there - as opposed to scraping a bunch of different job boards individually.
My rough outline was something like the following:
- Scrape relevant tweets about open positions (eg. "we are hiring")
- Provide a few keywords relevant to my job search (eg. "EMEA, Go, Typescript")
- Rank each keyword against each scraped tweet and compare relatedness.
- Export the results into Google Sheets to find interesting listings.
Scraping Tweets with Twint
The first part was quite straightforward - using Twint, I set up a re-usable function in case I wanted to perform multiple sets of searches in one go. This queries a number of tweets (in this case, 5000) with the search query "hiring remote developer", and saves them to a CSV.
import twint import nest_asyncio # For Jupyter nest_asyncio.apply() # For Jupyter def search(query, limit): c = twint.Config() c.Search = query c.Limit = limit c.Store_csv = True c.Output = "./data.csv" c.Hide_output = True twint.run.Search(c) search("hiring remote developer", 5000)
Next, I created a list of keywords relevant to my job search.
If you're following along, you should spend some time experimenting with different phrasing or variations of these keywords.
Importantly, I provided some keywords here which are not relevant to my skills. The reason being if I found a promising looking job listing, I could use some of these keywords to highlight that they're also relevant to fields or expertise I'm unfamiliar with or trying to avoid.
Next, I needed to get embeddings for these tweets from the OpenAI API. This can a while, depending on the number of entries you want to retrieve embeddings for, and the number of tokens contained in those entries.
Getting Embeddings from OpenAI
I repeated this process for the keywords - first getting the keyword's embedding from the API, then comparing the distance between that and each tweet's embedding to get a similarity score.
import openai from openai.embeddings_utils import get_embedding, cosine_similarity import pandas as pd openai.api_key = "YOUR_API_KEY" df = pd.read_csv("./data.csv") df['embedding'] = df['tweet'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002')) for k in keywords: query = get_embedding(k, engine="text-embedding-ada-002") df[k] = df['embedding'].apply(lambda x: cosine_similarity(x, query)) df.to_csv("./data.csv")
Finally, I saved only the values I was interested in to CSV and JSON. Now I've got a searchable list of jobs, and while it's not a result that jumps out at you, the results are still pretty clear.
Feel free to clone the notebook and give it a try yourself.
Running Things Locally
While not expensive, getting these embeddings from OpenAI's API did cost me a few dollars - and with the current state of hobbyist ML and its reproducability (if you've got a decent graphics card), it only makes sense to try and run things locally.
I had the easiest time getting up and running with MiniLM v2 from HuggingFace , with a few tutorials to aid me along the way. Things took a bit longer to get up and running, but I experimented with creating a search index of all my e-books and it worked really well.
It was a bit slow, but across a number of fiction / non-fiction books of various genres and niches, the actual semantic relevant was impressive. I think it could be a pretty useful way to search through things like textbooks, or instruction manuals. When run on prose, it felt a lot more like some sort of poetry generator despite putting together pretty good results based on my keywords.
If you'd like to try running the e-book notebook yourself, it's available in this Github Repo.