Extractor API Documentation
Description:
This API empowers you to efficiently extract structured data from web pages, providing a rich source of information including text, HTML content, page titles, images, videos, and social media metadata. Whether you're a developer seeking to integrate web scraping capabilities into your application or a data analyst in search of web content, our API offers a convenient solution to access and process web page data. You can use both POST and GET methods to interact with the API.
Base URL: https://kafkai.io/api/v1.0/extractor-api/
Authentication
To access this API, you need to be authenticated as a registered user. User authentication is required for both POST and GET requests.
Note: You must obtain an access token to authenticate your requests.
Endpoint Overview
POST Method
- Endpoint:
/
- Description: Extract data from a web page by sending a POST request with the URL you want to extract.
- Input:
url
(string, required): The URL of the web page to extract.sanitize
(boolean, optional): Set totrue
to sanitize the extracted text with AI (default isfalse
).
Note: Ensure you include the Authorization
header with your access token for user authentication.
Example
Let's say you want to extract data from the web page "https://google.com" and sanitize the text using AI. Here's how you can do it using CURL:
curl -X POST -H "Authorization: Token <your-api-key>" -d "url=https://google.com"
-d "sanitize=true" https://kafkai.io/api/v1.0/extractor-api/
You can achieve the same using Python Requests:
import requests
url = "https://kafkai.io/api/v1.0/extraextractor-apictor/"
headers = {
"Authorization": "Token <your-api-key>",
}
data = {
"url": "https://google.com/",
"sanitize": True,
}
response = requests.post(url, headers=headers, data=data)
print(response.json())
GET Method
- Endpoint:
/<uid>/
- Description: Retrieve previously extracted data based on a unique identifier (
uid
). - Input:
uid
(string, required): The unique identifier of the previously extracted data.
Note: Ensure you include the Authorization
header with your access token for user authentication.
Example
Suppose you want to retrieve previously extracted data with the unique identifier "YOUR_UID." Here's how you can do it using CURL:
curl -H "Authorization: Token <your-api-key>" https://kafkai.io/api/v1.0/extractor-api/YOUR_UID/
You can achieve the same using Python Requests:
import requests
url = "https://kafkai.io/api/v1.0/extractor-api/YOUR_UID/"
headers = {
"Authorization": "Token <your-api-key>",
}
response = requests.get(url, headers=headers)
print(response.json())
Response
The API will return JSON responses with extracted data in both POST and GET requests. The structure of the response is as follows:
{
"url": "https://google.com",
"title": "The internet life Page",
"text": "This is the extracted text content.",
"html": "<html>...</html>",
"images_list": ["https://google.com/image1.jpg", "https://google.com/image2.jpg"],
"videos_list": ["https://google.com/video1.mp4"],
"social_meta_data": {
"twitter": {
"title": "Twitter Content Title",
"description": "Twitter Content Description"
}
},
"page_meta_data": {
"description": "Page Description",
"charset": "UTF-8"
}
}
Notes
- For the POST method, ensure you include the
Authorization
header with a valid access token for user authentication. - The
sanitize
parameter in the POST request allows you to clean the extracted text content using AI. Set it totrue
for AI-based text cleaning. - The
uid
parameter in the GET request is the unique identifier of the previously extracted data. - The API will return a 404 Not Found response if the specified
uid
is not found.