.halguru-webscraping.yaml: |
Represents the configuration settings for a website crawler or scraper. |
MaxLevel |
The maximum level allowed for processing or operations in the website. |
MaxPages |
The maximum number of pages to process for the website. |
Pages[].ContainsText |
Only wepage contains text will be processed. |
Pages[].ContainsXpath |
If defined, ContainsText will be checked only in the xpath. |
Pages[].ContentXPath |
Only html tags from this xpath. If not empty. |
Pages[].Description |
Provides details or information about the web page or a specific element within the web page. |
Pages[].DisabledTags[] |
List of HTML tags that should be disabled or ignored during processing. |
Pages[].Features[].IncludeHtml |
Determines if the raw HTML representation of a specific web feature is extracted and added to the feature's output during processing. |
Pages[].Features[].IncludeText |
Controls whether the extracted plain text, processed via relevant scraping logic, is added to the resulting feature output. |
Pages[].Features[].Name |
The name of the website feature. |
Pages[].Features[].NameRegex |
The regular expression pattern to identify the name component of a website feature. |
Pages[].Features[].NameXpath |
The XPath expression used to locate the name of a specific feature within the website content. |
Pages[].Features[].NormalizeWhitespaces |
When enabled, consecutive whitespace characters are collapsed into a single space, facilitating cleaner and more standardized output after web scraping. |
Pages[].Features[].RemoveHtmlAttributes |
This property is primarily used to strip unnecessary attributes from HTML elements for cleaner and more optimized data extraction. |
Pages[].Features[].RemoveHtmlTags |
Determines if the raw HTML content will have tags stripped for plain text processing. |
Pages[].Features[].TagName |
The tag name used to identify or categorize the website feature. |
Pages[].Features[].ValueRegex |
The regular expression used to extract specific value matches from the HTML content of a website. |
Pages[].Features[].ValueXpath |
The XPath expression used to locate and extract the value of a specific feature within a website's HTML content. |
Pages[].Features[] |
Represents a collection of features extracted or associated with a specific web page, defining key elements or properties of interest within the page. |
Pages[].Files[].Name |
The name of the website feature. |
Pages[].Files[].NameRegex |
The regular expression pattern to identify the name component of a website feature. |
Pages[].Files[].NameXpath |
The XPath expression used to locate the name of a specific feature within the website content. |
Pages[].Files[].TagName |
The tag name used to identify or categorize the website feature. |
Pages[].Files[].UrlRegex |
The regular expression pattern for matching URLs associated with the website file. |
Pages[].Files[].UrlXpath |
The XPath expression used to extract the URL from a website file's content. |
Pages[].Files[] |
Represents a collection of files associated with the webpage for processing or extraction. |
Pages[].IncludeHtml |
Indicates whether the HTML content of the web page should be included in the output during the web scraping process. |
Pages[].IncludeText |
Determines whether the textual content of a web page should be included during the web scraping process. |
Pages[].Name |
The name of the web page or a specific element within the web page. |
Pages[].NormalizeWhitespaces |
Indicates whether whitespaces should be normalized in the text content extracted from an HTML node. |
Pages[].RemoveHtmlAttributes |
Indicates whether HTML attributes should be removed during the web scraping process. |
Pages[].RemoveHtmlTags |
Indicates whether HTML tags should be removed from the content of a web page during processing. |
Pages[].TagName |
The tag name of an HTML element or feature within a website page. |
Pages[].UrlContains |
Filter to process only web pages whose URL contains the specified substring. |
Pages[] |
The collection of website pages configuration. |
StartUrl |
The starting URL for the website. |
UrlsStartWith[] |
The collection of URL prefixes used to filter and process relevant URLs. |