Pages[]
.halguru-webscraping.yaml ➤ Pages
Gets or sets the collection of website pages configuration.
Pages:
- Name: Any text
TagName: Any text
Description: Any text
IncludeHtml: true
IncludeText: true
NormalizeWhitespaces: true
RemoveHtmlTags: true
RemoveHtmlAttributes: true
UrlContains: Any text
ContainsText: Any text
ContainsXpath: Any text
ContentXPath: Any text
DisabledTags: []
Features: []
Files: []
Properties
| Name |
Type |
Required |
Description |
| Name |
Text |
✔️ |
The name of the web page or a specific element within the web page. |
| TagName |
Text |
✔️ |
The tag name of an HTML element or feature within a website page. |
| Description |
Text |
|
Provides details or information about the web page or a specific element within the web page. |
| IncludeHtml |
Boolean |
✔️ |
Indicates whether the HTML content of the web page should be included in the output during the web scraping process. |
| IncludeText |
Boolean |
✔️ |
Determines whether the textual content of a web page should be included during the web scraping process. |
| NormalizeWhitespaces |
Boolean |
✔️ |
Indicates whether whitespaces should be normalized in the text content extracted from an HTML node. |
| RemoveHtmlTags |
Boolean |
✔️ |
Indicates whether HTML tags should be removed from the content of a web page during processing. |
| RemoveHtmlAttributes |
Boolean |
✔️ |
Indicates whether HTML attributes should be removed during the web scraping process. |
| UrlContains |
Text |
|
Filter to process only web pages whose URL contains the specified substring. |
| ContainsText |
Text |
|
Only wepage contains text will be processed. |
| ContainsXpath |
Text |
|
If defined, ContainsText will be checked only in the xpath. |
| ContentXPath |
Text |
|
Only html tags from this xpath. If not empty. |
| DisabledTags |
List |
✔️ |
List of HTML tags that should be disabled or ignored during processing. |
| Features |
List |
✔️ |
Represents a collection of features extracted or associated with a specific web page, defining key elements or properties of interest within the page. |
| Files |
List |
✔️ |
Represents a collection of files associated with the webpage for processing or extraction. |
| Property |
Value |
| Path |
Pages[] |
| Internal Type |
WebScrappingModels.PageHalItem |
| Internal Root Type |
WebScrapingHalConfiguration |
| File Extension |
.halguru-webscraping.yaml |
| JSON Schema |
halguru-webscraping-schema.json |
| Last updated: | | 2025-12-05 |
| Autogenerated: | | Yes |
| AI powered: | | Yes |
| Core version: | | 1.75.0 |