How to Build a Job Scraper API with FastAPI, Playwright, Langchain and AWS Bedrock
This article will guide AI developers on creating a job scraping API. You will learn how to build an API using FastAPI, scrape websites using Playwright, and extract structured data from HTML using Langchain and AWS Bedrock.
Development Environment
To build the job scraping API, you first need to set up your Python development environment and install the required packages.
In addition to Python 3, you will be using:
virtualenv – a development environment
FastAPI – a modern, fast (high-performance) web framework for building APIs with Python 3.7+ based on standard Python type hints.
Playwright – a library to automate Chromium, Firefox, and WebKit with a single API.
Langchain – a framework for developing applications powered by language models.
AWS Bedrock - a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like Cohere, on a single API, along with a broad set of capabilities.
Anthropic Claude 3 Haiku - One of Anthropic's Large Language Model.
Pydantic - a data validation and settings management library using Python type hinting.
Uvicorn - an ASGI web server implementation for Python.
python-multipart - a streaming multipart parser for Python.
First, it is highly recommended and best practice to create a virtual environment before you begin any Python project. You will create a virtual environment using a tool known as virtualenv. Virtualenv isolates your Python setup on a per-project basis. This means that changes made to one Python project won’t affect another Python project.
On Windows, Linux, or macOS:
mkdir job_scraper
cd job_scraper
The next step is to make your virtual environment. This will be called environment. Make sure the name you choose for your virtualenv is in lower case with no special characters and spaces.
python3 -m venv environment
For windows:
Activate the virtual environment:
environment\Scripts\activate
For Linux and macOS:
Activate the virtual environment:
source environment/bin/activate
You will know that you have virtualenv started when you see that the prompt in your console is prefixed with (environment).
Next, you will be installing dependencies. Create a requirements.txt file. This file contains a list of items to be installed using the pip install command.
For your use case, you need to create a new file using the text editor of our choice in the job_scraper directory, and save the file as requirements.txt. In your requirements.txt file, type the following:
boto3==1.34.13
botocore==1.34.77
fastapi==0.109.2
langchain==0.3.11
langchain-core==0.3.24
langchain_aws
playwright==1.47.0
pydantic==2.10.3
python-decouple==3.8
python-multipart==0.0.9
starlette==0.36.3
uvicorn==0.27.1
On the command line use the following command to install the dependencies:
pip install -r requirements.txt
On the command line, you will see the files installing one after the other. When the items listed in our requirements file are done installing, you can move on to the next phase of building our Job Scraper API.
Creating our Job Scraper API
The code for your Job Scraper API will be broken apart into several small functions, each with one task that it performs well. Your Job Scraper API does two main things that will be split into multiple functions. The first major function connects to the API endpoint of a random job boards to get the job information. The second major function scrapes a webpage to get the list of available jobs on the page.
To start, you will need to define your Pydantic schemas for validating the data. Create a new file schemas.py:
Python
# schemas.py
from typing import Union, List, Dict
from pydantic import BaseModel, Field
class JobInformationURL(BaseModel):
urls: Union[List[str], None] = Field(description="List of job_url")
class JobInformationSchema(BaseModel):
job_description: Union[str, None] = Field(description="Job description")
job_title: Union[str, None] = Field(description="Job title")
company_name: Union[str, None] = Field(description="Job company")
company_website: Union[str, None] = Field(description="Job company website")
location_type: Union[str, None] = Field(description="Job location type")
location: Union[List[str], None] = Field(description="The List of location")
apply_url: Union[str, None] = Field(description="Job url")
commitment: Union[str, None] = Field(description="The List of commitment. It is either 'full-time' or 'part-time' or 'contract' or 'internship'")
These schemas define the structure of the data you expect from the AWS Bedrock API ensuring consistency and allowing for validation.
Before moving on with your code, you need to create an AWS account and set up your credentials.
AWS Account and Credentials Setup
Create an AWS Account: If you don't have one already, sign up for an AWS account at https://aws.amazon.com/.
Create an IAM User:
Go to the IAM (Identity and Access Management) console in your AWS account.
Create a new user with programmatic access.
Attach the
AmazonBedrockFullAccesspolicy to this user. Note: For production, follow the principle of least privilege and grant only necessary permissions.
Get Access Keys:
- After creating the user, you will get an
Access Key IDand aSecret Access Key. Download and store these securely.
- After creating the user, you will get an
Configure AWS Credentials:
Option 1: Using environment variables (Recommended for development)
- Set the following environment variables in your
.envfile:
- Set the following environment variables in your
AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION=YOUR_AWS_REGION (e.g., us-east-1, us-west-2, etc.)
- Replace
YOUR_ACCESS_KEY_ID,YOUR_SECRET_ACCESS_KEY, andYOUR_AWS_REGIONwith your actual credentials and desired region.
Verify Bedrock Access
- Ensure that the region you are using has access to the Anthropic Claude 3 Haiku model in Amazon Bedrock.
Using Anthropic Claude via Bedrock
You will be using the boto3 library to interact with AWS Bedrock. The reason for Anthropic Claude 3 is because the context window is 200k. This is where you use the LangChain framework. You create the prompts.py file to use the Anthropic Claude 3 Haiku model.
# prompts.py
import asyncio
from langchain_core.prompts import PromptTemplate
from langchain_aws import ChatBedrock
from langchain.output_parsers import PydanticOutputParser
from langchain_core.exceptions import OutputParserException
from decouple import config
import boto3
from botocore.exceptions import ClientError
from .schemas import JobInformationSchema, JobInformationURL
async def extract_job_urls(home_page_html_document):
home_page_html_document_output_parser = PydanticOutputParser(pydantic_object=JobInformationURL)
home_page_html_document_format_instructions = home_page_html_document_output_parser.get_format_instructions()
home_page_html_document_template = """
Extract the List of job_urls in this home_page_html_document.
The home_page_html_document is: {home_page_html_document}
Format instructions: {format_instructions},
"""
home_page_html_document_prompt = PromptTemplate(
template=home_page_html_document_template,
input_variables=['home_page_html_document'],
partial_variables={'format_instructions': home_page_html_document_format_instructions},
)
bedrock_runtime = boto3.client('bedrock-runtime',
region_name=config('AWS_DEFAULT_REGION'),
aws_access_key_id=config('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=config('AWS_SECRET_ACCESS_KEY'))
llm = ChatBedrock(model_id='anthropic.claude-3-haiku-20240307-v1:0', client=bedrock_runtime, model_kwargs={"temperature": 0.0})
home_page_html_document_response = home_page_html_document_prompt | llm | home_page_html_document_output_parser
tasks = [
home_page_html_document_response.ainvoke({"home_page_html_document": home_page_html_document})
]
list_of_tasks = await asyncio.gather(*tasks)
return list_of_tasks
This function extract_job_urls will take the HTML content of a job board's homepage and use the Gemini API to extract a list of URLs pointing to individual job postings. It utilizes PromptTemplate to format the request and PydanticOutputParser to structure the output.
You import
boto3to interact with Bedrock.You create a Bedrock client using
boto3.client('bedrock-runtime',...).You use
ChatBedrockfromlangchain_awsto create a model instance, specifying'anthropic.claude-3-haiku-20240307-v1:0'as themodel_id.You use the
llmobject to invoke the model.
Next, you will create another function to extract the job information:
async def extract_job_information(html_document, apply_url):
html_document_output_parser = PydanticOutputParser(pydantic_object=JobInformationSchema)
html_document_format_instructions = html_document_output_parser.get_format_instructions()
html_document_template = """
Extract the following fields: job_description, job_title, company_name, company_website, location_type, the List of location,
commitment, and the apply_url in this html_document.
The job_description should not output html tags.
The location_type is either "remote" or "onsite" or "hybrid".
The commiment is either "full-time" or "part-time" or "contract" or "internship".
Remove all html tags in the job_description field.
The apply_url is {apply_url}
The html_document is: {html_document}
Format instructions: {format_instructions},
"""
html_document_prompt = PromptTemplate(
template=html_document_template,
input_variables=['html_document', "apply_url"],
partial_variables={'format_instructions': html_document_format_instructions},
)
bedrock = boto3.client('bedrock-runtime',
region_name=config('AWS_DEFAULT_REGION'),
aws_access_key_id=config('AWS_ACCESS_KEY_ID'),
aws_secret_access_key=config('AWS_SECRET_ACCESS_KEY'))
llm = ChatBedrock(model_id='anthropic.claude-3-haiku-20240307-v1:0', client=bedrock, model_kwargs={"temperature": 0.0})
html_document_response = html_document_prompt | llm | html_document_output_parser
tasks = [
html_document_response.ainvoke({"html_document": html_document, "apply_url": apply_url})
]
list_of_tasks = await asyncio.gather(*tasks)
return list_of_tasks
This function extract_job_information extracts details like job title, description, company information, etc., from a job posting's HTML.
Now, let's create your web scraper. Create a new file scraper.py:
Python
# scraper.py
from typing import List
from playwright.async_api import async_playwright, Page
class WebScraper:
def __init__(self, url: str, is_paginated: bool = False):
self.url = url
self.is_paginated = is_paginated
async def scrape(self) -> List[str]:
async with async_playwright() as playwright:
browser = await playwright.chromium.launch(
headless=True, # Ensure headless mode
args=[
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage'
]
)
page = await browser.new_page()
await page.goto(self.url, wait_until="domcontentloaded", timeout=0)
if self.is_paginated:
documents = await self._scrape_paginated(page)
else:
documents = [await self._scrape_single_page(page)]
await browser.close()
return documents
async def _scrape_single_page(self, page: Page) -> str:
return await page.content()
async def _scrape_paginated(self, page: Page) -> List[str]:
documents = []
page_number = 1
while True:
try:
document = await self._scrape_single_page(page)
documents.append(document)
# print(f"Scraped page {page_number}")
next_button = await page.get_by_role("button", name=f"{page_number + 1}").click()
await page.wait_for_load_state(state="domcontentloaded", timeout=0)
page_number += 1
# await asyncio.sleep(2) # To avoid overwhelming the server
except Exception as e:
print(f"Error on page {page_number}: {str(e)}")
break
return documents
The WebScraper class uses Playwright to automate a Chromium browser. It can handle both single-page and paginated websites. The scrape method is the main entry point, while _scrape_single_page and _scrape_paginated handle specific scraping logic.
The next is to create some utility functions. Create a utils.py file:
Python
# utils.py
import re
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs = True
self.text = []
def handle_data(self, d):
self.text.append(d)
def get_data(self):
return ''.join(self.text)
def remove_html_tags(text):
"""
Remove HTML tags from a string, handling nested tags and preserving text content.
Args:
text (str): The input string containing HTML tags
Returns:
str: The input string with HTML tags removed
"""
if not isinstance(text, str):
raise TypeError("Input must be a string")
# First, use regex to remove script and style elements and their contents
text = re.sub(r'(?is)<(script|style).?>.?</\1>', '', text)
# Then use HTMLParser to handle the remaining HTML
stripper = MLStripper()
stripper.feed(text)
return stripper.get_data()
def fix_url(url):
return f'https://boards.greenhouse.io{url}'
The remove_html_tags function removes HTML tags from a string, which is useful for cleaning up the job description text. The fix_url function helps construct complete URLs from relative paths found on Greenhouse pages.
Finally, create a new file and call it main.py. After that, you will import the libraries that will help with the writing of this project.
import asyncio
from typing import Annotated, Union
from fastapi import FastAPI, status, Form, Header, Response, Body
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from .prompts import extract_job_information, extract_job_urls
from .scraper import WebScraper
from .utils import fix_url, remove_html_tags
from decouple import config
The fastapi library makes it easy to build APIs in Python. You import functions from prompts.py file, scraper.py file, and utils.py file.
Now, let's tie everything together in our main.py file:
Python
# main.py continued...
app = FastAPI()
origins = [
"http://localhost:3000",
]
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
# allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
class URL(BaseModel):
url: str
@app.post("/jobs/", status_code=status.HTTP_201_CREATED)
async def scrape_job_description(url: URL, response: Response):
url = url.url
if 'greenhouse' not in url:
response.status_code = status.HTTP_400_BAD_REQUEST
return {"data": "Error", "status": status.HTTP_400_BAD_REQUEST, "message": "This is not a greenhouse career page. Nothing to scrape."}
scraper1 = WebScraper(url=url, is_paginated=True)
documents = await scraper1.scrape()
page_tasks = []
for document in documents:
page_tasks.append(extract_job_urls(home_page_html_document=document))
items = await asyncio.gather(*page_tasks)
total_urls = []
for item in items:
for url_object in item:
if url_object.urls is None:
response.status_code = status.HTTP_400_BAD_REQUEST
return {"data": "Error", "status": status.HTTP_400_BAD_REQUEST, "message": "The career page is empty. Nothing to scrape."}
total_urls.extend(url_object.urls)
tasks = []
for url in total_urls:
if 'https://' in url:
scraper2 = WebScraper(url=url)
documents = await scraper2.scrape()
soup = remove_html_tags(documents[0])
tasks.append(extract_job_information(html_document=soup, apply_url=url))
else:
fixed_url = fix_url(url=url)
scraper2 = WebScraper(url=fixed_url)
documents = await scraper2.scrape()
soup = remove_html_tags(documents[0])
tasks.append(extract_job_information(html_document=soup, apply_url=fixed_url))
list_of_tasks = await asyncio.gather(*tasks)
cleaned_list = []
for job_list in list_of_tasks:
for job_info in job_list:
cleaned_list.append(job_info)
return {"data": cleaned_list, "status": status.HTTP_201_CREATED}
This code defines the main FastAPI application and the /jobs/ endpoint. It first validates that the input URL is a Greenhouse career page. Then, it uses WebScraper to scrape the main page and extract_job_urls to get a list of individual job URLs. It scrapes each job page, extracts the information using extract_job_information, and finally returns the structured data.
Testing and Deploying the Job Scraper API
To test that your API is working properly, you run the script from the command line. You should make sure that your computer is connected to the internet.
First, install the required packages using:
pip install -r requirements.txt
Then, start the API server using uvicorn:
Plaintext
cd job_scraper
Then activate the virtual environment. You should see the prompt that was explained earlier in this article.
Plaintext
uvicorn main:app --reload
main refers to the file main.py, and app refers to the FastAPI instance created inside it. --reload enables auto-reloading so the server restarts when you make changes to the code.
You can then use a tool like curl or Postman to send a POST request to http://localhost:8000/jobs/ with a JSON payload like this:
JSON
{
"url": "https://boards.greenhouse.io/somecompany"
}
Replace "https://boards.greenhouse.io/somecompany" with a real Greenhouse career page URL.
The API should respond with a JSON payload containing the scraped job information if successful. Otherwise, it will return an appropriate error message.
What’s Next
You now have a functioning Job Scraper API. You can improve your API by making it handle errors more robustly. You can also develop another API that supports other job boards. I encourage you to play around with FastAPI, Playwright, and Langchain to find out more interesting things that you can do.
This comprehensive guide provides a solid foundation for building and deploying a job scraping API.
You can reach me on Twitter IyanuAshiri or check out my GitHub profile iyanuashiri.