Finding Jobs on Twitter using the OpenAI Embeddings API
Recently, I've spent a lot of time looking at job listings and reading headlines about LLMs - in particular, those from OpenAI. I wanted to understand more about embeddings, and I also wanted to find interesting job listings with a little less manual work involved. So, I thought I'd take a shot at using the OpenAI Embeddings API to help me find the jobs I'd be interested in.
A Better Way to Find Job Listings
Most job boards I browse post new listings to their Twitter accounts, and if they don't, it's likely another board with the same listings has done so. I felt like Twitter would be the best place to collect a bunch of recent, random listings and start working from there - as opposed to scraping a bunch of different job boards individually.
My rough outline was something like the following:
- Scrape relevant tweets about open positions (eg. "we are hiring")
- Provide a few keywords relevant to my job search (eg. "EMEA, Go, Typescript")
- Rank each keyword against each scraped tweet and compare relatedness.
- Export the results into Google Sheets to find interesting listings.
Scraping Tweets with Twint
The first part was quite straightforward - using Twint, I set up a re-usable function in case I wanted to perform multiple sets of searches in one go. This queries a number of tweets (in this case, 5000) with the search query "hiring remote developer", and saves them to a CSV.
import twint
import nest_asyncio # For Jupyter
nest_asyncio.apply() # For Jupyter
def search(query, limit):
c = twint.Config()
c.Search = query
c.Limit = limit
c.Store_csv = True
c.Output = "./data.csv"
c.Hide_output = True
twint.run.Search(c)
search("hiring remote developer", 5000)
Next, I created a list of keywords relevant to my job search.
If you're following along, you should spend some time experimenting with different phrasing or variations of these keywords.
keywords = ["front-end developer","back-end developer","full-stack developer","javascript","python","golang"]
Importantly, I provided some keywords here which are not relevant to my skills. The reason being if I found a promising looking job listing, I could use some of these keywords to highlight that they're also relevant to fields or expertise I'm unfamiliar with or trying to avoid.
For example, one may wish to filter for listings with good relevance for "Front-end Developer" + "Javascript", further refining that list to exclude the jobs that have a high relevance to "Back-end Developer".
Next, I needed to get embeddings for these tweets from the OpenAI API. This can a while, depending on the number of entries you want to retrieve embeddings for, and the number of tokens contained in those entries.
Getting Embeddings from OpenAI
I repeated this process for the keywords - first getting the keyword's embedding from the API, then comparing the distance between that and each tweet's embedding to get a similarity score.
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity
import pandas as pd
openai.api_key = "YOUR_API_KEY"
df = pd.read_csv("./data.csv")
df['embedding'] = df['tweet'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
for k in keywords:
query = get_embedding(k, engine="text-embedding-ada-002")
df[k] = df['embedding'].apply(lambda x: cosine_similarity(x, query))
df.to_csv("./data.csv")
Finally, I saved only the values I was interested in to CSV and JSON. Now I've got a searchable list of jobs, and while it's not a result that jumps out at you, the results are still pretty clear.
Out of the three role designations, we can see that Front End Developer got the highest score, and the same goes for Javascript out of the languages.
import json
df[["tweet"] + keywords + ["link"]].to_csv("./results.csv")
txt = open("results.json","w")
n = txt.write(json.dumps(json.loads(df[["tweet"] + keywords + ["link"]].to_json(orient="records")), indent=4))
txt.close()
# Example Result
{
"tweet": "New Remote Back-End Programming Job! Ebury: Node.js Developer. Hiring in: Latin America Only. Apply here! https://t.co/uDBU37InuH #remotework #remotejobs",
"front-end developer": 0.8201700164,
"back-end developer": 0.8407147429,
"full-stack developer": 0.8260435996,
"javascript": 0.7764954557,
"python": 0.7493150403,
"golang": 0.7548696562,
"link": "https://twitter.com/weworkremotely/status/1621109144869560320"
},
Feel free to clone the notebook and give it a try yourself.
Running Things Locally
While not expensive, getting these embeddings from OpenAI's API did cost me a few dollars - and with the current state of hobbyist ML and its reproducability (if you've got a decent graphics card), it only makes sense to try and run things locally.
I had the easiest time getting up and running with MiniLM v2 from HuggingFace , with a few tutorials to aid me along the way. Things took a bit longer to get up and running, but I experimented with creating a search index of all my e-books and it worked really well.
It was a bit slow, but across a number of fiction / non-fiction books of various genres and niches, the actual semantic relevant was impressive. I think it could be a pretty useful way to search through things like textbooks, or instruction manuals. When run on prose, it felt a lot more like some sort of poetry generator despite putting together pretty good results based on my keywords.
If you'd like to try running the e-book notebook yourself, it's available in this Github Repo.