To crawl a web page in Kotlin using multiple threads and async programming, 코틀린 웹크롤러 Async 비동기

9월 30, 2024

To crawl a web page in Kotlin using multiple threads and async programming, you can use Kotlin's coroutines for asynchronous tasks, along with libraries like Jsoup to parse HTML from web pages. Additionally, you can leverage Kotlin's Dispatchers.IO to perform network operations in the background efficiently.

Below is an example of how you can implement web crawling with multi-threading and asynchronous behavior using Kotlin coroutines and the Jsoup library.

Steps:

Use Coroutines: To handle concurrency.
Use Jsoup: To parse and extract data from HTML.
Use Dispatchers.IO: To perform network tasks asynchronously.
Launch multiple coroutines: For crawling multiple URLs concurrently.

Example Code:

import kotlinx.coroutines.*
import org.jsoup.Jsoup
import java.io.IOException

// Function to fetch and parse the HTML of a webpage
suspend fun crawlPage(url: String): String? = withContext(Dispatchers.IO) {
    try {
        println("Crawling: $url on thread ${Thread.currentThread().name}")
        val doc = Jsoup.connect(url).get()
        val title = doc.title()  // Extract the title of the page (or any data you want)
        return@withContext title
    } catch (e: IOException) {
        println("Error fetching the page: $url")
        return@withContext null
    }
}

// Function to crawl multiple pages concurrently
fun crawlMultiplePages(urls: List<String>) {
    runBlocking {
        val jobs = urls.map { url ->
            async {
                val result = crawlPage(url)
                result?.let {
                    println("Title of $url: $it")
                }
            }
        }
        jobs.awaitAll()  // Wait for all crawls to finish
    }
}

fun main() {
    // List of URLs to crawl
    val urls = listOf(
        "https://www.example.com",
        "https://www.wikipedia.org",
        "https://kotlinlang.org",
        "https://www.github.com",
        "https://news.ycombinator.com"
    )

    println("Starting web crawling with multiple threads and async...")
    crawlMultiplePages(urls)
    println("Finished web crawling.")
}

Explanation:

crawlPage(): This function fetches a web page using Jsoup.connect(url).get() and extracts its title. It runs on the Dispatchers.IO coroutine dispatcher, which is optimized for IO-bound tasks such as networking.
crawlMultiplePages(): This function accepts a list of URLs and launches a coroutine (async) for each URL to crawl it concurrently. The async function allows for asynchronous tasks to be run in parallel. The jobs.awaitAll() ensures that all coroutines complete before proceeding further.
runBlocking {}: This is used in crawlMultiplePages() to block the main thread until all asynchronous tasks are completed.
Dispatchers.IO: This is used to offload the network tasks to a thread pool that is optimized for blocking IO operations.

Output:

When run, the program will:

Fetch each URL concurrently using multiple threads.
Print the title of each page (or any other extracted data).
Handle errors such as timeouts or failed connections.

Dependencies:

To run this code, you need to add Jsoup and Kotlin Coroutines to your dependencies.

Maven dependencies:

<dependency>
    <groupId>org.jetbrains.kotlinx</groupId>
    <artifactId>kotlinx-coroutines-core</artifactId>
    <version>1.6.4</version> <!-- Version may vary -->
</dependency>

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version> <!-- Version may vary -->
</dependency>

Gradle dependencies:

implementation "org.jetbrains.kotlinx:kotlinx-coroutines-core:1.6.4"
implementation "org.jsoup:jsoup:1.14.3"

Enhancements:

Error handling: You can extend the error handling to retry failed requests or log errors in a better way.
Data extraction: Instead of just fetching the title, you can extract other data such as links, images, etc., using Jsoup's DOM parsing methods.
Rate limiting: If you're crawling many pages, you might want to implement rate limiting to avoid overwhelming the target servers.

This implementation provides a simple way to perform web crawling in Kotlin using coroutines and asynchronous programming.

IT