Extractor API Documentation

Description:

This API empowers you to efficiently extract structured data from web pages, providing a rich source of information including text, HTML content, page titles, images, videos, and social media metadata. Whether you're a developer seeking to integrate web scraping capabilities into your application or a data analyst in search of web content, our API offers a convenient solution to access and process web page data. You can use both POST and GET methods to interact with the API.

Base URL: https://kafkai.io/api/v1.0/extractor-api/

Authentication

To access this API, you need to be authenticated as a registered user. User authentication is required for both POST and GET requests.

Note: You must obtain an access token to authenticate your requests.

Endpoint Overview

POST Method

  • Endpoint: /
  • Description: Extract data from a web page by sending a POST request with the URL you want to extract.
  • Input:
    • url (string, required): The URL of the web page to extract.
    • sanitize (boolean, optional): Set to true to sanitize the extracted text with AI (default is false).

Note: Ensure you include the Authorization header with your access token for user authentication.

Example

Let's say you want to extract data from the web page "https://google.com" and sanitize the text using AI. Here's how you can do it using CURL:

curl -X POST -H "Authorization: Token <your-api-key>" -d "url=https://google.com" 
 -d "sanitize=true" https://kafkai.io/api/v1.0/extractor-api/

You can achieve the same using Python Requests:

import requests

url = "https://kafkai.io/api/v1.0/extraextractor-apictor/"
headers = {
    "Authorization": "Token <your-api-key>",
}
data = {
    "url": "https://google.com/",
    "sanitize": True,
}

response = requests.post(url, headers=headers, data=data)
print(response.json())

GET Method

  • Endpoint: /<uid>/
  • Description: Retrieve previously extracted data based on a unique identifier (uid).
  • Input:
    • uid (string, required): The unique identifier of the previously extracted data.

Note: Ensure you include the Authorization header with your access token for user authentication.

Example

Suppose you want to retrieve previously extracted data with the unique identifier "YOUR_UID." Here's how you can do it using CURL:

curl -H "Authorization: Token <your-api-key>" https://kafkai.io/api/v1.0/extractor-api/YOUR_UID/

You can achieve the same using Python Requests:

import requests

url = "https://kafkai.io/api/v1.0/extractor-api/YOUR_UID/"
headers = {
    "Authorization": "Token <your-api-key>",
}

response = requests.get(url, headers=headers)
print(response.json())

Response

The API will return JSON responses with extracted data in both POST and GET requests. The structure of the response is as follows:

{
    "url": "https://google.com",
    "title": "The internet life Page",
    "text": "This is the extracted text content.",
    "html": "<html>...</html>",
    "images_list": ["https://google.com/image1.jpg", "https://google.com/image2.jpg"],
    "videos_list": ["https://google.com/video1.mp4"],
    "social_meta_data": {
        "twitter": {
            "title": "Twitter Content Title",
            "description": "Twitter Content Description"
        }
    },
    "page_meta_data": {
        "description": "Page Description",
        "charset": "UTF-8"
    }
}

Notes

  • For the POST method, ensure you include the Authorization header with a valid access token for user authentication.
  • The sanitize parameter in the POST request allows you to clean the extracted text content using AI. Set it to true for AI-based text cleaning.
  • The uid parameter in the GET request is the unique identifier of the previously extracted data.
  • The API will return a 404 Not Found response if the specified uid is not found.